A Case Study of Data Management Challenges Presented in Large-Scale Machine Learning Workflows
Running scientific workflow applications on high-performance computing systems provides promising results in terms of accuracy and scalability. An example is the particle track reconstruction research in high-energy physics that consists of multiple machine-learning tasks. However, as the modern HPC...
Saved in:
Published in | 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid) pp. 71 - 81 |
---|---|
Main Authors | , , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.05.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Running scientific workflow applications on high-performance computing systems provides promising results in terms of accuracy and scalability. An example is the particle track reconstruction research in high-energy physics that consists of multiple machine-learning tasks. However, as the modern HPC system scales up, researchers spend more effort on coordinating the individual workflow tasks due to their increasing demands on computational power, large memory footprint, and data movement among various storage devices. These issues are further exacerbated when intermediate result data must be shared among different tasks and each is optimized to fulfill its own design goals, such as the shortest time or minimal memory footprint. In this paper, we investigate the data management challenges presented in scientific workflows. We observe that individual tasks, such as data generation, data curation, model training, and inference, often use data layouts only best for one's I/O performance but orthogonal to its successive tasks. We propose various solutions by employing alternative data structures and layouts in consideration of two tasks running consecutively in the workflow. Our experimental results show up to a 16.46x and 3.42x speedup for initialization time and I/O time respectively, compared to previous approaches. |
---|---|
DOI: | 10.1109/CCGrid57682.2023.00017 |