Research on Mahalanobis Distance Algorithm Optimization Based on OpenCL

Mahalanobis distance algorithms has been widely used in machine learning and classification algorithms, and it has an important practical significance in improving the performance of some applications through GPU, especially in some applications with high real-time demand. However, due to the comple...

Full description

Saved in:

Bibliographic Details
Published in	2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) pp. 84 - 91
Main Authors	Qingchun Xie, Yunquan Zhang, Haipeng Jia, Yongquan Lu
Format	Conference Proceeding
Language	English
Published	IEEE 01.08.2014
Subjects	Computer architecture GPU Graphics processing units Heterogeneous Computing Instruction sets Mahalanobis OpenCL Optimization Parallel Optimization Synchronization Vectors Zinc
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Mahalanobis distance algorithms has been widely used in machine learning and classification algorithms, and it has an important practical significance in improving the performance of some applications through GPU, especially in some applications with high real-time demand. However, due to the complexity of the GPU hardware architectures, how to complete the algorithm optimization and achieve high performance portability on different GPU platforms is still a hard work. The redundant calculation of traditional Mahalanobis distance algorithm is very large and is hard to implement parallel. In this paper, we not only throw light on the universal ways to properly map mature algorithms to GPU devices to implement the Mahalanobis distance, such as fixed number of workgroups and vectorization, but also put forward five valuable methods of performance optimization on heterogeneous computing platforms, such as two-dimension NDRange workgroup arrangement, Uberkernel, binary reduction with the help of LDS register in one workgroup. Furthermore, We also demonstrate the high performance of our implementation by comparing it with a well-optimized CPU version from OpenCV library. Experimental results show that speedup reaches up 10.59 times on NVIDIA GPU C2050, 8.55 times on AMD HD5850 and 16.08 times on HD7970. The last experimental results verified the effectiveness of optimization methods which we proposed.
DOI:	10.1109/HPCC.2014.19