PADDi: Highly Scalable Parallel Algorithm for Discord Discovery on Multi-GPU Clusters

Currently, in a wide spectrum of subject domains, time series data mining requires the efficient subsequence anomaly discovery in a very long time series, which cannot be entirely placed in RAM. At present, one of the best approaches to solving such a problem is to formalize the anomaly as a discord...

Full description

Saved in:

Bibliographic Details
Published in	Lobachevskii journal of mathematics Vol. 46; no. 4; pp. 1480 - 1494
Main Authors	Kraeva, Y. A., Zymbler, M. L.
Format	Journal Article
Language	English
Published	Moscow Pleiades Publishing 01.04.2025 Springer Nature B.V
Subjects	Algebra Algorithms Analysis Clusters Configuration management Data exchange Data mining Geometry Mathematical Logic and Foundations Mathematics Mathematics and Statistics Nodes Probability Theory and Stochastic Processes Segments Time series CUDA 62M10 anomaly discord discovery parallel algorithm time series high-performance cluster GPU 65Y05
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Currently, in a wide spectrum of subject domains, time series data mining requires the efficient subsequence anomaly discovery in a very long time series, which cannot be entirely placed in RAM. At present, one of the best approaches to solving such a problem is to formalize the anomaly as a discord, a given-length subsequence that is maximally far away from its non-overlapping nearest neighbor. In the article, we introduce a novel parallel algorithm called PADDi (PALMAD-based anomaly discovery on distributed GPUs), which discovers arbitrary-length discords in a very long time series on a high-performance cluster with nodes, each of which is equipped with multiple GPUs. The algorithm exploits two-level parallelism: first, when the time series is divided into equal-length fragments stored on disks associated with the cluster nodes, and second, when a fragment is split into equal-length segments to be processed by GPUs of the respective node. To implement data exchanges between nodes and calculations on GPUs within a node, we employ MPI and CUDA technologies, respectively. The algorithm performs as follows. Firstly, in each segment processed by one GPU, the algorithm selects potential discords and then discards false positives, resulting in the local candidate set. Next, local candidate sets are sent among cluster nodes in an ‘‘all-to-all’’ manner, resulting in a global candidate set. Then, each cluster node refines the global candidates within its fragment, obtaining the local resulting set of true positive discords. Finally, each cluster node sends the local resulting sets to a master node, which outputs the end result as the intersection of the received local resulting sets. Extensive experiments over real-world and synthetic million-length time series on various configurations of two high-performance clusters with different models of GPU onboard (from 48 to 64 GPUs in total) showed that our algorithm’s scalability remains linear without stagnation or degradation.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1995-0802 1818-9962
DOI:	10.1134/S1995080225606198