Dataset Placement and Data Loading Optimizations for Cloud-Native Deep Learning Workloads

The primary challenge facing cloud-based deep learning systems is the need for efficient orchestration of large-scale datasets with diverse data formats and provisioning of high-performance data loading capabilities. To that end, we present DLCache, a cloud-native dataset management and runtime-awar...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE 26th International Symposium on Real-Time Distributed Computing (ISORC) pp. 107 - 116
Main Authors Kang, Zhuangwei, Min, Ziran, Zhou, Shuang, Barve, Yogesh D., Gokhale, Aniruddha
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.05.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The primary challenge facing cloud-based deep learning systems is the need for efficient orchestration of large-scale datasets with diverse data formats and provisioning of high-performance data loading capabilities. To that end, we present DLCache, a cloud-native dataset management and runtime-aware data-loading solution for deep learning training jobs. DLCache supports the low-latency and high-throughput I/O requirements of DL training jobs using cloud buckets as persistent data storage and a dedicated computation cluster for training. DLCache comprises four layers: a control plane, a metadata plane, an operator plane, and a multi-tier storage plane, which are seamlessly integrated with the Kubernetes ecosystem thereby providing ease of deployment, scalability, and self-healing. For efficient memory utilization, DLCache is designed with an on-the-fly and best-effort caching mechanism that can auto-scale the cache according to runtime configurations, resource constraints, and training speeds. DLCache considers both frequency and freshness of data access as well as data preparation costs in making effective cache eviction decisions that result in reduced completion time for deep learning workloads. Results of evaluating DLCache on the Imagenet-ILSVRC and LibriSpeech datasets under various runtime configurations and simulated GPU computation time experiments showed up to a 147.49% and 156.67% improvement in data loading throughput, respectively, compared to the popular PyTorch framework.
ISSN:2770-162X
DOI:10.1109/ISORC58943.2023.00023