Automatic String Data Validation with Pattern Discovery

In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consumin...

Full description

Saved in:
Bibliographic Details
Main Authors Lin, Xinwei, Zhao, Jing, Di, Peng, Xiao, Chuan, Mao, Rui, Ji, Yan, Onizuka, Makoto, Ding, Zishuo, Shang, Weiyi, Qin, Jianbin
Format Journal Article
LanguageEnglish
Published 06.08.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consuming. Therefore, automatic data validation is a better solution to defend the system and downstream services by enabling early detection of errors and providing detailed error messages for quick resolution. This paper proposes a self-validate data management system with automatic pattern discovery techniques to verify the correctness of semi-structural string data in enterprise data pipelines. Our solution extracts patterns from historical data and detects erroneous incoming data in a top-down fashion. High-level information of historical data is analyzed to discover the format skeleton of correct values. Fine-grained semantic patterns are then extracted to strike a balance between generalization and specification of the discovered pattern, thus covering as many correct values as possible while avoiding over-fitting. To tackle cold start and rapid data growth, we propose an incremental update strategy and example generalization strategy. Experiments on large-scale industrial and public datasets demonstrate the effectiveness and efficiency of our method compared to alternative solutions. Furthermore, a case study on an industrial platform (Ant Group Inc.) with thousands of applications shows that our system captures meaningful data patterns in daily operations and helps engineers quickly identify errors.
AbstractList In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consuming. Therefore, automatic data validation is a better solution to defend the system and downstream services by enabling early detection of errors and providing detailed error messages for quick resolution. This paper proposes a self-validate data management system with automatic pattern discovery techniques to verify the correctness of semi-structural string data in enterprise data pipelines. Our solution extracts patterns from historical data and detects erroneous incoming data in a top-down fashion. High-level information of historical data is analyzed to discover the format skeleton of correct values. Fine-grained semantic patterns are then extracted to strike a balance between generalization and specification of the discovered pattern, thus covering as many correct values as possible while avoiding over-fitting. To tackle cold start and rapid data growth, we propose an incremental update strategy and example generalization strategy. Experiments on large-scale industrial and public datasets demonstrate the effectiveness and efficiency of our method compared to alternative solutions. Furthermore, a case study on an industrial platform (Ant Group Inc.) with thousands of applications shows that our system captures meaningful data patterns in daily operations and helps engineers quickly identify errors.
Author Xiao, Chuan
Ding, Zishuo
Ji, Yan
Zhao, Jing
Onizuka, Makoto
Di, Peng
Mao, Rui
Lin, Xinwei
Shang, Weiyi
Qin, Jianbin
Author_xml – sequence: 1
  givenname: Xinwei
  surname: Lin
  fullname: Lin, Xinwei
– sequence: 2
  givenname: Jing
  surname: Zhao
  fullname: Zhao, Jing
– sequence: 3
  givenname: Peng
  surname: Di
  fullname: Di, Peng
– sequence: 4
  givenname: Chuan
  surname: Xiao
  fullname: Xiao, Chuan
– sequence: 5
  givenname: Rui
  surname: Mao
  fullname: Mao, Rui
– sequence: 6
  givenname: Yan
  surname: Ji
  fullname: Ji, Yan
– sequence: 7
  givenname: Makoto
  surname: Onizuka
  fullname: Onizuka, Makoto
– sequence: 8
  givenname: Zishuo
  surname: Ding
  fullname: Ding, Zishuo
– sequence: 9
  givenname: Weiyi
  surname: Shang
  fullname: Shang, Weiyi
– sequence: 10
  givenname: Jianbin
  surname: Qin
  fullname: Qin, Jianbin
BackLink https://doi.org/10.48550/arXiv.2408.03005$$DView paper in arXiv
BookMark eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIxsNAzMDYwMOVkMHcsLcnPTSzJTFYILinKzEtXcEksSVQIS8zJTAGK5ucplGeWZCgEJJaUpBblKbhkFifnl6UWVfIwsKYl5hSn8kJpbgZ5N9cQZw9dsBXxBUWZuYlFlfEgq-LBVhkTVgEABuc1IA
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
GOX
DOI 10.48550/arxiv.2408.03005
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2408_03005
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2408_030053
IEDL.DBID GOX
IngestDate Thu Aug 08 12:20:23 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2408_030053
OpenAccessLink https://arxiv.org/abs/2408.03005
ParticipantIDs arxiv_primary_2408_03005
PublicationCentury 2000
PublicationDate 2024-08-06
PublicationDateYYYYMMDD 2024-08-06
PublicationDate_xml – month: 08
  year: 2024
  text: 2024-08-06
  day: 06
PublicationDecade 2020
PublicationYear 2024
Score 3.8501987
SecondaryResourceType preprint
Snippet In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Databases
Title Automatic String Data Validation with Pattern Discovery
URI https://arxiv.org/abs/2408.03005
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQMTcBtqGNLZN0U42TLHRNEs0sdRONE410TUAXW5sDmxSp4EU0vn5mHqEmXhGmEUwMCrC9MIlFFZllkPOBk4r1Qedv6RkYgw8pZTYyAi3ZcvePgExOgo_igqpHqAO2McFCSJWEmyADP7R1p-AIiQ4hBqbUPBEGc8fSknzwyagKwSWgYTQFl8SSRIUwYAsYcqGRAmgwVCEAfNJlnoJLZnEyaF1lpSiDvJtriLOHLtiq-ALIuRDxIFfEg11hLMbAAuy9p0owKJhZmicaJiZbJIMvdzZKTTSwSE40NU1LSTECVu6J5pIMErhMkcItJc3AZQSsXcEr0cxkGFhKikpTZYG1Y0mSHDiIAOe8aAY
link.rule.ids 228,230,786,891
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Automatic+String+Data+Validation+with+Pattern+Discovery&rft.au=Lin%2C+Xinwei&rft.au=Zhao%2C+Jing&rft.au=Di%2C+Peng&rft.au=Xiao%2C+Chuan&rft.date=2024-08-06&rft_id=info:doi/10.48550%2Farxiv.2408.03005&rft.externalDocID=2408_03005