Titian: Data Provenance Support in Spark
Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence ( from log files) and perfor...
Saved in:
Published in | Proceedings of the VLDB Endowment Vol. 9; no. 3; p. 216 |
---|---|
Main Authors | , , , , , , , |
Format | Journal Article |
Language | English |
Published |
United States
01.11.2015
|
Online Access | Get more information |
Cover
Loading…
Summary: | Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (
from log files) and performing trial and error debugging. To aid this effort, we built
, a library that enables
-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time. |
---|---|
ISSN: | 2150-8097 2150-8097 |
DOI: | 10.14778/2850583.2850595 |