Automatic String Data Validation with Pattern Discovery
In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consumin...
Saved in:
Main Authors | , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
06.08.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | In enterprise data pipelines, data insertions occur periodically and may
impact downstream services if data quality issues are not addressed. Typically,
such problems can be investigated and fixed by on-call engineers, but locating
the cause of such problems and fixing errors are often time-consuming.
Therefore, automatic data validation is a better solution to defend the system
and downstream services by enabling early detection of errors and providing
detailed error messages for quick resolution. This paper proposes a
self-validate data management system with automatic pattern discovery
techniques to verify the correctness of semi-structural string data in
enterprise data pipelines. Our solution extracts patterns from historical data
and detects erroneous incoming data in a top-down fashion. High-level
information of historical data is analyzed to discover the format skeleton of
correct values. Fine-grained semantic patterns are then extracted to strike a
balance between generalization and specification of the discovered pattern,
thus covering as many correct values as possible while avoiding over-fitting.
To tackle cold start and rapid data growth, we propose an incremental update
strategy and example generalization strategy. Experiments on large-scale
industrial and public datasets demonstrate the effectiveness and efficiency of
our method compared to alternative solutions. Furthermore, a case study on an
industrial platform (Ant Group Inc.) with thousands of applications shows that
our system captures meaningful data patterns in daily operations and helps
engineers quickly identify errors. |
---|---|
AbstractList | In enterprise data pipelines, data insertions occur periodically and may
impact downstream services if data quality issues are not addressed. Typically,
such problems can be investigated and fixed by on-call engineers, but locating
the cause of such problems and fixing errors are often time-consuming.
Therefore, automatic data validation is a better solution to defend the system
and downstream services by enabling early detection of errors and providing
detailed error messages for quick resolution. This paper proposes a
self-validate data management system with automatic pattern discovery
techniques to verify the correctness of semi-structural string data in
enterprise data pipelines. Our solution extracts patterns from historical data
and detects erroneous incoming data in a top-down fashion. High-level
information of historical data is analyzed to discover the format skeleton of
correct values. Fine-grained semantic patterns are then extracted to strike a
balance between generalization and specification of the discovered pattern,
thus covering as many correct values as possible while avoiding over-fitting.
To tackle cold start and rapid data growth, we propose an incremental update
strategy and example generalization strategy. Experiments on large-scale
industrial and public datasets demonstrate the effectiveness and efficiency of
our method compared to alternative solutions. Furthermore, a case study on an
industrial platform (Ant Group Inc.) with thousands of applications shows that
our system captures meaningful data patterns in daily operations and helps
engineers quickly identify errors. |
Author | Xiao, Chuan Ding, Zishuo Ji, Yan Zhao, Jing Onizuka, Makoto Di, Peng Mao, Rui Lin, Xinwei Shang, Weiyi Qin, Jianbin |
Author_xml | – sequence: 1 givenname: Xinwei surname: Lin fullname: Lin, Xinwei – sequence: 2 givenname: Jing surname: Zhao fullname: Zhao, Jing – sequence: 3 givenname: Peng surname: Di fullname: Di, Peng – sequence: 4 givenname: Chuan surname: Xiao fullname: Xiao, Chuan – sequence: 5 givenname: Rui surname: Mao fullname: Mao, Rui – sequence: 6 givenname: Yan surname: Ji fullname: Ji, Yan – sequence: 7 givenname: Makoto surname: Onizuka fullname: Onizuka, Makoto – sequence: 8 givenname: Zishuo surname: Ding fullname: Ding, Zishuo – sequence: 9 givenname: Weiyi surname: Shang fullname: Shang, Weiyi – sequence: 10 givenname: Jianbin surname: Qin fullname: Qin, Jianbin |
BackLink | https://doi.org/10.48550/arXiv.2408.03005$$DView paper in arXiv |
BookMark | eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIxsNAzMDYwMOVkMHcsLcnPTSzJTFYILinKzEtXcEksSVQIS8zJTAGK5ucplGeWZCgEJJaUpBblKbhkFifnl6UWVfIwsKYl5hSn8kJpbgZ5N9cQZw9dsBXxBUWZuYlFlfEgq-LBVhkTVgEABuc1IA |
ContentType | Journal Article |
Copyright | http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
Copyright_xml | – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2408.03005 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2408_03005 |
GroupedDBID | AKY GOX |
ID | FETCH-arxiv_primary_2408_030053 |
IEDL.DBID | GOX |
IngestDate | Thu Aug 08 12:20:23 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-arxiv_primary_2408_030053 |
OpenAccessLink | https://arxiv.org/abs/2408.03005 |
ParticipantIDs | arxiv_primary_2408_03005 |
PublicationCentury | 2000 |
PublicationDate | 2024-08-06 |
PublicationDateYYYYMMDD | 2024-08-06 |
PublicationDate_xml | – month: 08 year: 2024 text: 2024-08-06 day: 06 |
PublicationDecade | 2020 |
PublicationYear | 2024 |
Score | 3.8501987 |
SecondaryResourceType | preprint |
Snippet | In enterprise data pipelines, data insertions occur periodically and may
impact downstream services if data quality issues are not addressed. Typically,
such... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Databases |
Title | Automatic String Data Validation with Pattern Discovery |
URI | https://arxiv.org/abs/2408.03005 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMTcBtqGNLZN0U42TLHRNEs0sdRONE410TUAXW5sDmxSp4EU0vn5mHqEmXhGmEUwMCrC9MIlFFZllkPOBk4r1Qedv6RkYgw8pZTYyAi3ZcvePgExOgo_igqpHqAO2McFCSJWEmyADP7R1p-AIiQ4hBqbUPBEGc8fSknzwyagKwSWgYTQFl8SSRIUwYAsYcqGRAmgwVCEAfNJlnoJLZnEyaF1lpSiDvJtriLOHLtiq-ALIuRDxIFfEg11hLMbAAuy9p0owKJhZmicaJiZbJIMvdzZKTTSwSE40NU1LSTECVu6J5pIMErhMkcItJc3AZQSsXcEr0cxkGFhKikpTZYG1Y0mSHDiIAOe8aAY |
link.rule.ids | 228,230,786,891 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Automatic+String+Data+Validation+with+Pattern+Discovery&rft.au=Lin%2C+Xinwei&rft.au=Zhao%2C+Jing&rft.au=Di%2C+Peng&rft.au=Xiao%2C+Chuan&rft.date=2024-08-06&rft_id=info:doi/10.48550%2Farxiv.2408.03005&rft.externalDocID=2408_03005 |