METHOD AND SYSTEM FOR CLEANSING AND DE-DUPLICATING DATA
Method and system for cleansing and de-duplicating data in database are provided. The method includes filtering garbage records from a plurality of records based on data fields, and applying cleansing rules to create a cleansed database. A similarity vector is generated, where each vector correspond...
Saved in:
Main Authors | , , |
---|---|
Format | Patent |
Language | English |
Published |
26.10.2017
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Method and system for cleansing and de-duplicating data in database are provided. The method includes filtering garbage records from a plurality of records based on data fields, and applying cleansing rules to create a cleansed database. A similarity vector is generated, where each vector corresponds to pairwise comparison of distinct data entries in cleansed database. Matching rules are applied to label each vector as one of matched, unmatched and unclassified. The method analyzes the vectors labeled as matched and unmatched to train a machine learning model to identify duplicates in the cleansed database. Unclassified vectors in the cleansed database are labeled as matched or unmatched by applying machine learning model on unclassified vectors. Thereafter, the method processes all the vectors labeled as matched to create clusters of records that are duplicates of each other. Further, records in each cluster are merged to obtain de-duplicated cleansed database using predefined consolidated rules. |
---|---|
AbstractList | Method and system for cleansing and de-duplicating data in database are provided. The method includes filtering garbage records from a plurality of records based on data fields, and applying cleansing rules to create a cleansed database. A similarity vector is generated, where each vector corresponds to pairwise comparison of distinct data entries in cleansed database. Matching rules are applied to label each vector as one of matched, unmatched and unclassified. The method analyzes the vectors labeled as matched and unmatched to train a machine learning model to identify duplicates in the cleansed database. Unclassified vectors in the cleansed database are labeled as matched or unmatched by applying machine learning model on unclassified vectors. Thereafter, the method processes all the vectors labeled as matched to create clusters of records that are duplicates of each other. Further, records in each cluster are merged to obtain de-duplicated cleansed database using predefined consolidated rules. |
Author | DeMARCO Sofia LAKSHMIKANTHAN Jayant CASSIDY Hugh |
Author_xml | – fullname: DeMARCO Sofia – fullname: LAKSHMIKANTHAN Jayant – fullname: CASSIDY Hugh |
BookMark | eNrjYmDJy89L5WQw93UN8fB3UXD0c1EIjgwOcfVVcPMPUnD2cXX0C_b0cwdLuLjquoQG-Hg6O4aAhFwcQxx5GFjTEnOKU3mhNDeDsptriLOHbmpBfnxqcUFicmpeakl8aLCRgaG5sYGFqam5o6ExcaoAu0Eqhw |
ContentType | Patent |
DBID | EVB |
DatabaseName | esp@cenet |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: EVB name: esp@cenet url: http://worldwide.espacenet.com/singleLineSearch?locale=en_EP sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Medicine Chemistry Sciences Physics |
ExternalDocumentID | US2017308557A1 |
GroupedDBID | EVB |
ID | FETCH-epo_espacenet_US2017308557A13 |
IEDL.DBID | EVB |
IngestDate | Fri Jul 19 14:34:45 EDT 2024 |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-epo_espacenet_US2017308557A13 |
Notes | Application Number: US201715488388 |
OpenAccessLink | https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20171026&DB=EPODOC&CC=US&NR=2017308557A1 |
ParticipantIDs | epo_espacenet_US2017308557A1 |
PublicationCentury | 2000 |
PublicationDate | 20171026 |
PublicationDateYYYYMMDD | 2017-10-26 |
PublicationDate_xml | – month: 10 year: 2017 text: 20171026 day: 26 |
PublicationDecade | 2010 |
PublicationYear | 2017 |
RelatedCompanies | LeanTaas |
RelatedCompanies_xml | – name: LeanTaas |
Score | 3.1088417 |
Snippet | Method and system for cleansing and de-duplicating data in database are provided. The method includes filtering garbage records from a plurality of records... |
SourceID | epo |
SourceType | Open Access Repository |
SubjectTerms | CALCULATING COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS |
Title | METHOD AND SYSTEM FOR CLEANSING AND DE-DUPLICATING DATA |
URI | https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20171026&DB=EPODOC&locale=&CC=US&NR=2017308557A1 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfR3LToNAcNLU501R46MaEg03IhUK4UAMZRfRyCMCpp4aSpfExNBGMP6-syvVnnrcmWQy-5idmZ3HAtyghV1qNsMdMGxNNUxmqniUS7USUZ2CWRbj7x1hZAa58TQZTXrwsaqFEX1Cv0VzRJSoEuW9Fff18v8Ri4jcyuZ29o6gxb2fOUTpvOMh15emQsYOTWISe4rnOXmqRC8Cp_OcLMtFX2kLDWmLJ4DR1zGvS1muKxX_ALYTpFe3h9BjtQR73urvNQl2wy7kLcGOyNEsGwR2ctgcgRXSLIiJ7EZETt_SjIYyenOy90zdKH2MHgSCUJXkyW-pMIKIm7nHcO3TzAtU5GX6N_Vpnq4zrp9Av17U7BTkoiiYobG5btm2waNplYGGhcbK0bAwC1adwWATpfPN6AvY50N-R9-ZA-i3n1_sEpVvO7sSa_YD8-GBFg |
link.rule.ids | 230,309,783,888,25578,76884 |
linkProvider | European Patent Office |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfR3LToNAcNLUR70panxUJdFwI1KhIIfGUHaRKtBGwLQnQnGbmBjaWIy_7-xKtadeZ5LN7GPejwW4QQu70GyGN2DYmmqYzFTxKRfqTGR1cmZZjMc7wsj0U-Np3B034GPVCyPmhH6L4YjIUQXyeyXk9eI_iEVEbeXydvqOoPmDl_SIUnvHHa4vTYX0e3Q0JENXcd1eGivRi8DpvCbLctBX2kIj-55P2qevfd6XslhXKt4-bI9wvbI6gAYrJWi5q7_XJNgN65S3BDuiRrNYIrDmw-UhWCFN_CGRnYjI8SROaCijNye7AXWieBA9CgShKklHv63CCCJO4hzBtUcT11eRluxv61karxOuH0OznJfsBOQ8z5mhsTfdsm2DZ9NmBhoWGiu6ndzM2ewU2ptWOtuMvoKWn4RBFgyi53PY4ygur-_MNjSrzy92gYq4ml6K8_sBKbyEBg |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Apatent&rft.title=METHOD+AND+SYSTEM+FOR+CLEANSING+AND+DE-DUPLICATING+DATA&rft.inventor=DeMARCO+Sofia&rft.inventor=LAKSHMIKANTHAN+Jayant&rft.inventor=CASSIDY+Hugh&rft.date=2017-10-26&rft.externalDBID=A1&rft.externalDocID=US2017308557A1 |