METHOD AND SYSTEM FOR CLEANSING AND DE-DUPLICATING DATA
Method and system for cleansing and de-duplicating data in database are provided. The method includes filtering garbage records from a plurality of records based on data fields, and applying cleansing rules to create a cleansed database. A similarity vector is generated, where each vector correspond...
Saved in:
Main Authors | , , |
---|---|
Format | Patent |
Language | English |
Published |
26.10.2017
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Method and system for cleansing and de-duplicating data in database are provided. The method includes filtering garbage records from a plurality of records based on data fields, and applying cleansing rules to create a cleansed database. A similarity vector is generated, where each vector corresponds to pairwise comparison of distinct data entries in cleansed database. Matching rules are applied to label each vector as one of matched, unmatched and unclassified. The method analyzes the vectors labeled as matched and unmatched to train a machine learning model to identify duplicates in the cleansed database. Unclassified vectors in the cleansed database are labeled as matched or unmatched by applying machine learning model on unclassified vectors. Thereafter, the method processes all the vectors labeled as matched to create clusters of records that are duplicates of each other. Further, records in each cluster are merged to obtain de-duplicated cleansed database using predefined consolidated rules. |
---|---|
Bibliography: | Application Number: US201715488388 |