A Data Deduplication Framework of Disk Images with Adaptive Block Skipping

We describe an efficient and easily applicable data deduplication framework with heuristic prediction based adaptive block skipping for the real-world dataset such as disk images to save deduplication related overheads and improve deduplication throughput with good deduplication efficiency maintaine...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computer science and technology Vol. 31; no. 4; pp. 820 - 835
Main Authors	Zhou, Bing, Wen, Jiang-Tao
Format	Journal Article
Language	English
Published	New York Springer US 01.07.2016 Springer Nature B.V
Subjects	Algorithms Artificial Intelligence Computer Science Data Structures and Information Theory Datasets Disks Hash based algorithms Heuristic Indexing Information management Information Systems Applications (incl.Internet) Laboratories Metadata Performance evaluation Random access memory Regular Paper Reproduction Software Engineering Studies Theory of Computation 删除操作匹配过程图像数据删除框架磁盘镜像自适应重复数据 United States > US adaptive block skipping metadata data deduplication
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We describe an efficient and easily applicable data deduplication framework with heuristic prediction based adaptive block skipping for the real-world dataset such as disk images to save deduplication related overheads and improve deduplication throughput with good deduplication efficiency maintained. Under the framework, deduplication operations are skipped for data chunks determined as likely non-duplicates via heuristic prediction, in conjunction with a hit and matching extension process for duplication identification within skipped blocks and a hysteresis mechanism based hash indexing process to update the hash indices for the re-encountered skipped chunks. For performance evaluation, the proposed framework was integrated and implemented in the existing data domain and sparse indexing deduplication algorithms. The experimental results based on a real-world dataset of 1.0 TB disk images showed that the deduplication related overheads were significantly reduced with adaptive block skipping, leading to a 30%-80% improvement in deduplication throughput when deduplieation mctadata were stored on the disk for data domain, and 25%-40% RAM space saving with a 15%-20% improvement in deduplication throughput when an in-RAM sparse index was used in sparse indexing. In both cases, the corresponding deduplication ratios reduced were below 5%.
Bibliography:	11-2296/TP We describe an efficient and easily applicable data deduplication framework with heuristic prediction based adaptive block skipping for the real-world dataset such as disk images to save deduplication related overheads and improve deduplication throughput with good deduplication efficiency maintained. Under the framework, deduplication operations are skipped for data chunks determined as likely non-duplicates via heuristic prediction, in conjunction with a hit and matching extension process for duplication identification within skipped blocks and a hysteresis mechanism based hash indexing process to update the hash indices for the re-encountered skipped chunks. For performance evaluation, the proposed framework was integrated and implemented in the existing data domain and sparse indexing deduplication algorithms. The experimental results based on a real-world dataset of 1.0 TB disk images showed that the deduplication related overheads were significantly reduced with adaptive block skipping, leading to a 30%-80% improvement in deduplication throughput when deduplieation mctadata were stored on the disk for data domain, and 25%-40% RAM space saving with a 15%-20% improvement in deduplication throughput when an in-RAM sparse index was used in sparse indexing. In both cases, the corresponding deduplication ratios reduced were below 5%. data deduplication, metadata, adaptive block skipping ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1000-9000 1860-4749
DOI:	10.1007/s11390-016-1665-z