Dataset Placement and Data Loading Optimizations for Cloud-Native Deep Learning Workloads

The primary challenge facing cloud-based deep learning systems is the need for efficient orchestration of large-scale datasets with diverse data formats and provisioning of high-performance data loading capabilities. To that end, we present DLCache, a cloud-native dataset management and runtime-awar...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE 26th International Symposium on Real-Time Distributed Computing (ISORC) pp. 107 - 116
Main Authors	Kang, Zhuangwei, Min, Ziran, Zhou, Shuang, Barve, Yogesh D., Gokhale, Aniruddha
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2023
Subjects	Cache System Cloudnative Data Management Deep learning Deep Learning Training Graphics processing units Loading Runtime System Software Systematics Time-frequency analysis Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The primary challenge facing cloud-based deep learning systems is the need for efficient orchestration of large-scale datasets with diverse data formats and provisioning of high-performance data loading capabilities. To that end, we present DLCache, a cloud-native dataset management and runtime-aware data-loading solution for deep learning training jobs. DLCache supports the low-latency and high-throughput I/O requirements of DL training jobs using cloud buckets as persistent data storage and a dedicated computation cluster for training. DLCache comprises four layers: a control plane, a metadata plane, an operator plane, and a multi-tier storage plane, which are seamlessly integrated with the Kubernetes ecosystem thereby providing ease of deployment, scalability, and self-healing. For efficient memory utilization, DLCache is designed with an on-the-fly and best-effort caching mechanism that can auto-scale the cache according to runtime configurations, resource constraints, and training speeds. DLCache considers both frequency and freshness of data access as well as data preparation costs in making effective cache eviction decisions that result in reduced completion time for deep learning workloads. Results of evaluating DLCache on the Imagenet-ILSVRC and LibriSpeech datasets under various runtime configurations and simulated GPU computation time experiments showed up to a 147.49% and 156.67% improvement in data loading throughput, respectively, compared to the popular PyTorch framework.
ISSN:	2770-162X
DOI:	10.1109/ISORC58943.2023.00023