Getov, Vladimir and Wei, Yuan and Carter, Larry and Gatlin, Kang Su (1999) Performance optimisations of the NPB FT kernel by special-purpose unroller. In: Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing (PDP '99). IEEE, pp. 84-88. ISBN 0769500595
Full text not available from this repository.
Official URL: http://dx.doi.org/10.1109/EMPDP.1999.746649
The fast Fourier transform (FFT) is the cornerstone of many supercomputer applications and therefore needs careful performance tuning. Most often, however the real performance of the FFT implementations is far below the acceptable figures. In this paper we explore several strategies for performance optimisations of the FFT computation, such as enhancing instruction-level parallelism, loop merging, and reducing the memory loads and stores by using a special-purpose automatic loop unroller. Our approach is based on the principle of complete unrolling which we apply to modify the FT kernel of the NAS Parallel Benchmarks (NPB). In experiments on two different IBM SP2 platforms, our automatically generated unrolled FFT subroutine is shown to improve the performance between 40% and 53% in comparison with the original code. Further the execution time of the entire 3-D FFT mega-step of the benchmark is faster than when calls to a similar FFT subroutine from the vendor-optimised PESSL numerical library are used. Preliminary results suggest that the completely unrolled code also outperforms FFTW another high-performance FFT package. Finally, our approach for automatic generation of moderately optimised but specialised codes requires only a modest amount of programming effort.
|Item Type:||Book Section|
|Research Community:||University of Westminster > Electronics and Computer Science, School of|
|Deposited On:||12 Feb 2010 11:53|
|Last Modified:||12 Feb 2010 11:53|
Repository Staff Only: item control page