Harmonizing Repair and Maintenance in LRC-Coded Storage

Modern storage systems not only introduce data redundancy for fault tolerance, but also conduct regular main- tenance operations on storage nodes for system robustness. Erasure coding provides storage-efficient redundancy and has been widely deployed in production, yet it also incurs substantial ban...

Full description

Saved in:
Bibliographic Details
Published inProceedings - Symposium on Reliable Distributed Systems pp. 1 - 11
Main Authors Cheng, Keyun, Wu, Si, Li, Xiaolu, Lee, Patrick P.C.
Format Conference Proceeding
LanguageEnglish
Published IEEE 30.09.2024
Online AccessGet full text
ISSN2575-8462
DOI10.1109/SRDS64841.2024.00012

Cover

Loading…
More Information
Summary:Modern storage systems not only introduce data redundancy for fault tolerance, but also conduct regular main- tenance operations on storage nodes for system robustness. Erasure coding provides storage-efficient redundancy and has been widely deployed in production, yet it also incurs substantial bandwidth and I/O overhead due to the repair of storage failures. In particular, maintenance operations make storage nodes temporarily unavailable and lead to data unavailability, thereby incurring repair overhead for erasure-coded storage. In this paper, we study Locally Repairable Codes (LRCs), a class of practical repair-efficient erasure codes, and show that there exists an inherent performance trade-off between the repair and maintenance operations of LRCs in data center settings, such that the repair performance in regular (i.e., no-maintenance) and maintenance modes cannot be simultaneously optimized. To this end, we design a configurable data placement scheme that operates along the trade-off subject to fault-tolerance constraints. We prototype our data placement scheme atop Hadoop HDFS and show how it balances the performance trade-off of repair and maintenance operations in real network environments.
ISSN:2575-8462
DOI:10.1109/SRDS64841.2024.00012