Automatic String Data Validation with Pattern Discovery

In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consumin...

Full description

Saved in:

Bibliographic Details
Main Authors	Lin, Xinwei, Zhao, Jing, Di, Peng, Xiao, Chuan, Mao, Rui, Ji, Yan, Onizuka, Makoto, Ding, Zishuo, Shang, Weiyi, Qin, Jianbin
Format	Journal Article
Language	English
Published	06.08.2024
Subjects	Computer Science - Databases
Online Access	Get full text

Cover

Loading…

Abstract	In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consuming. Therefore, automatic data validation is a better solution to defend the system and downstream services by enabling early detection of errors and providing detailed error messages for quick resolution. This paper proposes a self-validate data management system with automatic pattern discovery techniques to verify the correctness of semi-structural string data in enterprise data pipelines. Our solution extracts patterns from historical data and detects erroneous incoming data in a top-down fashion. High-level information of historical data is analyzed to discover the format skeleton of correct values. Fine-grained semantic patterns are then extracted to strike a balance between generalization and specification of the discovered pattern, thus covering as many correct values as possible while avoiding over-fitting. To tackle cold start and rapid data growth, we propose an incremental update strategy and example generalization strategy. Experiments on large-scale industrial and public datasets demonstrate the effectiveness and efficiency of our method compared to alternative solutions. Furthermore, a case study on an industrial platform (Ant Group Inc.) with thousands of applications shows that our system captures meaningful data patterns in daily operations and helps engineers quickly identify errors.
AbstractList	In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consuming. Therefore, automatic data validation is a better solution to defend the system and downstream services by enabling early detection of errors and providing detailed error messages for quick resolution. This paper proposes a self-validate data management system with automatic pattern discovery techniques to verify the correctness of semi-structural string data in enterprise data pipelines. Our solution extracts patterns from historical data and detects erroneous incoming data in a top-down fashion. High-level information of historical data is analyzed to discover the format skeleton of correct values. Fine-grained semantic patterns are then extracted to strike a balance between generalization and specification of the discovered pattern, thus covering as many correct values as possible while avoiding over-fitting. To tackle cold start and rapid data growth, we propose an incremental update strategy and example generalization strategy. Experiments on large-scale industrial and public datasets demonstrate the effectiveness and efficiency of our method compared to alternative solutions. Furthermore, a case study on an industrial platform (Ant Group Inc.) with thousands of applications shows that our system captures meaningful data patterns in daily operations and helps engineers quickly identify errors.
Author	Xiao, Chuan Ding, Zishuo Ji, Yan Zhao, Jing Onizuka, Makoto Di, Peng Mao, Rui Lin, Xinwei Shang, Weiyi Qin, Jianbin
Author_xml	– sequence: 1 givenname: Xinwei surname: Lin fullname: Lin, Xinwei – sequence: 2 givenname: Jing surname: Zhao fullname: Zhao, Jing – sequence: 3 givenname: Peng surname: Di fullname: Di, Peng – sequence: 4 givenname: Chuan surname: Xiao fullname: Xiao, Chuan – sequence: 5 givenname: Rui surname: Mao fullname: Mao, Rui – sequence: 6 givenname: Yan surname: Ji fullname: Ji, Yan – sequence: 7 givenname: Makoto surname: Onizuka fullname: Onizuka, Makoto – sequence: 8 givenname: Zishuo surname: Ding fullname: Ding, Zishuo – sequence: 9 givenname: Weiyi surname: Shang fullname: Shang, Weiyi – sequence: 10 givenname: Jianbin surname: Qin fullname: Qin, Jianbin
BackLink	https://doi.org/10.48550/arXiv.2408.03005$$DView paper in arXiv
BookMark	eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIxsNAzMDYwMOVkMHcsLcnPTSzJTFYILinKzEtXcEksSVQIS8zJTAGK5ucplGeWZCgEJJaUpBblKbhkFifnl6UWVfIwsKYl5hSn8kJpbgZ5N9cQZw9dsBXxBUWZuYlFlfEgq-LBVhkTVgEABuc1IA
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY GOX
DOI	10.48550/arxiv.2408.03005
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2408_03005
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2408_030053
IEDL.DBID	GOX
IngestDate	Thu Aug 08 12:20:23 EDT 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2408_030053
OpenAccessLink	https://arxiv.org/abs/2408.03005
ParticipantIDs	arxiv_primary_2408_03005
PublicationCentury	2000
PublicationDate	2024-08-06
PublicationDateYYYYMMDD	2024-08-06
PublicationDate_xml	– month: 08 year: 2024 text: 2024-08-06 day: 06
PublicationDecade	2020
PublicationYear	2024
Score	3.8501987
SecondaryResourceType	preprint
Snippet	In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Databases
Title	Automatic String Data Validation with Pattern Discovery
URI	https://arxiv.org/abs/2408.03005
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMTcBtqGNLZN0U42TLHRNEs0sdRONE410TUAXW5sDmxSp4EU0vn5mHqEmXhGmEUwMCrC9MIlFFZllkPOBk4r1Qedv6RkYgw8pZTYyAi3ZcvePgExOgo_igqpHqAO2McFCSJWEmyADP7R1p-AIiQ4hBqbUPBEGc8fSknzwyagKwSWgYTQFl8SSRIUwYAsYcqGRAmgwVCEAfNJlnoJLZnEyaF1lpSiDvJtriLOHLtiq-ALIuRDxIFfEg11hLMbAAuy9p0owKJhZmicaJiZbJIMvdzZKTTSwSE40NU1LSTECVu6J5pIMErhMkcItJc3AZQSsXcEr0cxkGFhKikpTZYG1Y0mSHDiIAOe8aAY
link.rule.ids	228,230,786,891
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Automatic+String+Data+Validation+with+Pattern+Discovery&rft.au=Lin%2C+Xinwei&rft.au=Zhao%2C+Jing&rft.au=Di%2C+Peng&rft.au=Xiao%2C+Chuan&rft.date=2024-08-06&rft_id=info:doi/10.48550%2Farxiv.2408.03005&rft.externalDocID=2408_03005