Titian: Data Provenance Support in Spark

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence ( from log files) and perfor...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the VLDB Endowment Vol. 9; no. 3; p. 216
Main Authors	Interlandi, Matteo, Shah, Kshitij, Tetali, Sai Deep, Gulzar, Muhammad Ali, Yoo, Seunghyun, Kim, Miryung, Millstein, Todd, Condie, Tyson
Format	Journal Article
Language	English
Published	United States 01.11.2015
Online Access	Get more information

Cover

Loading…

More Information
Summary:	Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence ( from log files) and performing trial and error debugging. To aid this effort, we built , a library that enables -tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.
ISSN:	2150-8097 2150-8097
DOI:	10.14778/2850583.2850595