Getov, Vladimir and Wei, Yuan and Carter, Larry and Gatlin, Kang Su (1999) Performance optimisations of the NPB FT kernel by special-purpose unroller. In: Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing (PDP '99). IEEE, pp. 84-88. ISBN 0769500595Full text not available from this repository.
The fast Fourier transform (FFT) is the cornerstone of many supercomputer applications and therefore needs careful performance tuning. Most often, however the real performance of the FFT implementations is far below the acceptable figures. In this paper we explore several strategies for performance optimisations of the FFT computation, such as enhancing instruction-level parallelism, loop merging, and reducing the memory loads and stores by using a special-purpose automatic loop unroller. Our approach is based on the principle of complete unrolling which we apply to modify the FT kernel of the NAS Parallel Benchmarks (NPB). In experiments on two different IBM SP2 platforms, our automatically generated unrolled FFT subroutine is shown to improve the performance between 40% and 53% in comparison with the original code. Further the execution time of the entire 3-D FFT mega-step of the benchmark is faster than when calls to a similar FFT subroutine from the vendor-optimised PESSL numerical library are used. Preliminary results suggest that the completely unrolled code also outperforms FFTW another high-performance FFT package. Finally, our approach for automatic generation of moderately optimised but specialised codes requires only a modest amount of programming effort.
|Item Type:||Book Section|
|Subjects:||University of Westminster > Science and Technology > Electronics and Computer Science, School of (No longer in use)|
|Depositing User:||Miss Nina Watts|
|Date Deposited:||12 Feb 2010 11:53|
|Last Modified:||12 Feb 2010 11:53|
Actions (login required)
|Edit Item (Repository staff only)|