Bachelorarbeit BCLR-2021-71

Bibliograph.
Daten
Egger, Simon: Distributed Fast Fourier Transform for heterogeneous GPU systems.
Universität Stuttgart, Fakultät Informatik, Elektrotechnik und Informationstechnik, Bachelorarbeit Nr. 71 (2021).
105 Seiten, englisch.
Kurzfassung

The Fast Fourier Transform (FFT) is a numerical method to convert the input data to a representation in the frequency domain. A wide range of applications requires the computation of three-dimensional FFTs, which makes the utilization of Graphics Processing Units (GPUs) on distributed systems particularly appealing. The most common approach for distributed computation is to partition the global input data, resulting in slab decomposition or pencil decomposition.

For large numbers of processes, it is well known that slab decomposition only provides limited scalability and is generally outperformed by pencil decomposition. This often leaves their performance comparison on fewer GPUs as a blind spot: We found that slab decomposition generally dominates for larger input sizes when utilizing fewer GPUs, which is compliant with simple theoretical models. An exception to this rule is when the processor grid of pencil decomposition is specifically aligned to fully utilize available NVLink interconnections.

Next to the default implementation of slab decomposition and pencil decomposition, we propose Realigned as a possible optimization for both decomposition methods by taking advantage of cuFFT's advanced data layout. Most notably, Realigned reduced the additional memory requirements of pencil decomposition and computes the 1D-FFTs in y-direction more efficiently.

Since both decomposition methods require global redistribution of the intermediate results, we further compare the performance of different Peer2Peer and All2All communication techniques. In particular, we introduce Peer2Peer-Streams, which avoids the need for additional synchronization and allows the complete overlap of communication and packing phase. Our performance benchmarks show that this approach generally performs best for large input sizes on test systems with a limited number of GPUs when considering MPI without CUDA-awareness. Furthermore, we utilize custom MPI datatypes and adopt MPI_Type for GPUs, which reduces the additional memory requirements dramatically and avoids the need for a packing and unpacking phase altogether. By identifying a redistributed partition as a batch of slices, where each slice consists of the maximum number of contiguous, complex-valued words, we found that MPI_Type often poses a worthwhile consideration when both sent and received partitions are not composed of one-dimensional slices.

Volltext und
andere Links
Volltext
Abteilung(en)Universität Stuttgart, Institut für Parallele und Verteilte Systeme, Simulation großer Systeme
BetreuerSchulte, Prof. Miriam; Brunn, Malte
Eingabedatum3. Februar 2022
   Publ. Informatik