Access Trends of In-network Cache for Scientific Data

Scientific collaborations are increasingly relying on large volumes of data for their work and many of them employ tiered systems to replicate the data to their worldwide user communities. Each user in the community often selects a different subset of data for their analysis tasks; however, members...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Han, Ruize, Sim, Alex, Wu, Kesheng, Monga, Inder, Chin Guok, Würthwein, Frank, Davila, Diego, Balcas, Justas, Newman, Harvey
Format	Paper Journal Article
Language	English
Published	Ithaca Cornell University Library, arXiv.org 11.05.2022
Subjects	Caching Communications traffic Computer Science - Distributed, Parallel, and Cluster Computing Computer Science - Learning Computer Science - Networking and Internet Architecture Computer Science - Performance Data retrieval Machine learning Traffic volume
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Scientific collaborations are increasingly relying on large volumes of data for their work and many of them employ tiered systems to replicate the data to their worldwide user communities. Each user in the community often selects a different subset of data for their analysis tasks; however, members of a research group often are working on related research topics that require similar data objects. Thus, there is a significant amount of data sharing possible. In this work, we study the access traces of a federated storage cache known as the Southern California Petabyte Scale Cache. By studying the access patterns and potential for network traffic reduction by this caching system, we aim to explore the predictability of the cache uses and the potential for a more general in-network data caching. Our study shows that this distributed storage cache is able to reduce the network traffic volume by a factor of 2.35 during a part of the study period. We further show that machine learning models could predict cache utilization with an accuracy of 0.88. This demonstrates that such cache usage is predictable, which could be useful for managing complex networking resources such as in-network caching.
ISSN:	2331-8422
DOI:	10.48550/arxiv.2205.05563