METHOD AND SYSTEM FOR CLEANSING AND DE-DUPLICATING DATA

Method and system for cleansing and de-duplicating data in database are provided. The method includes filtering garbage records from a plurality of records based on data fields, and applying cleansing rules to create a cleansed database. A similarity vector is generated, where each vector correspond...

Full description

Saved in:
Bibliographic Details
Main Authors DeMARCO Sofia, LAKSHMIKANTHAN Jayant, CASSIDY Hugh
Format Patent
LanguageEnglish
Published 26.10.2017
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Method and system for cleansing and de-duplicating data in database are provided. The method includes filtering garbage records from a plurality of records based on data fields, and applying cleansing rules to create a cleansed database. A similarity vector is generated, where each vector corresponds to pairwise comparison of distinct data entries in cleansed database. Matching rules are applied to label each vector as one of matched, unmatched and unclassified. The method analyzes the vectors labeled as matched and unmatched to train a machine learning model to identify duplicates in the cleansed database. Unclassified vectors in the cleansed database are labeled as matched or unmatched by applying machine learning model on unclassified vectors. Thereafter, the method processes all the vectors labeled as matched to create clusters of records that are duplicates of each other. Further, records in each cluster are merged to obtain de-duplicated cleansed database using predefined consolidated rules.
AbstractList Method and system for cleansing and de-duplicating data in database are provided. The method includes filtering garbage records from a plurality of records based on data fields, and applying cleansing rules to create a cleansed database. A similarity vector is generated, where each vector corresponds to pairwise comparison of distinct data entries in cleansed database. Matching rules are applied to label each vector as one of matched, unmatched and unclassified. The method analyzes the vectors labeled as matched and unmatched to train a machine learning model to identify duplicates in the cleansed database. Unclassified vectors in the cleansed database are labeled as matched or unmatched by applying machine learning model on unclassified vectors. Thereafter, the method processes all the vectors labeled as matched to create clusters of records that are duplicates of each other. Further, records in each cluster are merged to obtain de-duplicated cleansed database using predefined consolidated rules.
Author DeMARCO Sofia
LAKSHMIKANTHAN Jayant
CASSIDY Hugh
Author_xml – fullname: DeMARCO Sofia
– fullname: LAKSHMIKANTHAN Jayant
– fullname: CASSIDY Hugh
BookMark eNrjYmDJy89L5WQw93UN8fB3UXD0c1EIjgwOcfVVcPMPUnD2cXX0C_b0cwdLuLjquoQG-Hg6O4aAhFwcQxx5GFjTEnOKU3mhNDeDsptriLOHbmpBfnxqcUFicmpeakl8aLCRgaG5sYGFqam5o6ExcaoAu0Eqhw
ContentType Patent
DBID EVB
DatabaseName esp@cenet
DatabaseTitleList
Database_xml – sequence: 1
  dbid: EVB
  name: esp@cenet
  url: http://worldwide.espacenet.com/singleLineSearch?locale=en_EP
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
Discipline Medicine
Chemistry
Sciences
Physics
ExternalDocumentID US2017308557A1
GroupedDBID EVB
ID FETCH-epo_espacenet_US2017308557A13
IEDL.DBID EVB
IngestDate Fri Jul 19 14:34:45 EDT 2024
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-epo_espacenet_US2017308557A13
Notes Application Number: US201715488388
OpenAccessLink https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20171026&DB=EPODOC&CC=US&NR=2017308557A1
ParticipantIDs epo_espacenet_US2017308557A1
PublicationCentury 2000
PublicationDate 20171026
PublicationDateYYYYMMDD 2017-10-26
PublicationDate_xml – month: 10
  year: 2017
  text: 20171026
  day: 26
PublicationDecade 2010
PublicationYear 2017
RelatedCompanies LeanTaas
RelatedCompanies_xml – name: LeanTaas
Score 3.1088417
Snippet Method and system for cleansing and de-duplicating data in database are provided. The method includes filtering garbage records from a plurality of records...
SourceID epo
SourceType Open Access Repository
SubjectTerms CALCULATING
COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
COMPUTING
COUNTING
ELECTRIC DIGITAL DATA PROCESSING
PHYSICS
Title METHOD AND SYSTEM FOR CLEANSING AND DE-DUPLICATING DATA
URI https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20171026&DB=EPODOC&locale=&CC=US&NR=2017308557A1
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfR3LToNAcNLU501R46MaEg03IhUK4UAMZRfRyCMCpp4aSpfExNBGMP6-syvVnnrcmWQy-5idmZ3HAtyghV1qNsMdMGxNNUxmqniUS7USUZ2CWRbj7x1hZAa58TQZTXrwsaqFEX1Cv0VzRJSoEuW9Fff18v8Ri4jcyuZ29o6gxb2fOUTpvOMh15emQsYOTWISe4rnOXmqRC8Cp_OcLMtFX2kLDWmLJ4DR1zGvS1muKxX_ALYTpFe3h9BjtQR73urvNQl2wy7kLcGOyNEsGwR2ctgcgRXSLIiJ7EZETt_SjIYyenOy90zdKH2MHgSCUJXkyW-pMIKIm7nHcO3TzAtU5GX6N_Vpnq4zrp9Av17U7BTkoiiYobG5btm2waNplYGGhcbK0bAwC1adwWATpfPN6AvY50N-R9-ZA-i3n1_sEpVvO7sSa_YD8-GBFg
link.rule.ids 230,309,783,888,25578,76884
linkProvider European Patent Office
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfR3LToNAcNLUR70panxUJdFwI1KhIIfGUHaRKtBGwLQnQnGbmBjaWIy_7-xKtadeZ5LN7GPejwW4QQu70GyGN2DYmmqYzFTxKRfqTGR1cmZZjMc7wsj0U-Np3B034GPVCyPmhH6L4YjIUQXyeyXk9eI_iEVEbeXydvqOoPmDl_SIUnvHHa4vTYX0e3Q0JENXcd1eGivRi8DpvCbLctBX2kIj-55P2qevfd6XslhXKt4-bI9wvbI6gAYrJWi5q7_XJNgN65S3BDuiRrNYIrDmw-UhWCFN_CGRnYjI8SROaCijNye7AXWieBA9CgShKklHv63CCCJO4hzBtUcT11eRluxv61karxOuH0OznJfsBOQ8z5mhsTfdsm2DZ9NmBhoWGiu6ndzM2ewU2ptWOtuMvoKWn4RBFgyi53PY4ygur-_MNjSrzy92gYq4ml6K8_sBKbyEBg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Apatent&rft.title=METHOD+AND+SYSTEM+FOR+CLEANSING+AND+DE-DUPLICATING+DATA&rft.inventor=DeMARCO+Sofia&rft.inventor=LAKSHMIKANTHAN+Jayant&rft.inventor=CASSIDY+Hugh&rft.date=2017-10-26&rft.externalDBID=A1&rft.externalDocID=US2017308557A1