Enhancing Data Quality at ETL Stage of Data Warehousing
Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to t...
Saved in:
Published in | International journal of data warehousing and mining Vol. 17; no. 1; pp. 74 - 91 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Hershey
IGI Global
01.01.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to the problem of data quality management. Data purgation occurs when the data is subject to ETL methodology in order to maintain and improve the data quality. The data may contain unnecessary information and may have inappropriate symbols which can be defined as dummy values, cryptic values, or missing values. The present work has improved the expectation-maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics to ensure dummy values, Wards algorithm with Minkowski distance to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics to handle missing values in a dataset. These distance metrics have improved the data quality and also helped in providing consistent data to be loaded into a data warehouse. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 1548-3924 1548-3932 |
DOI: | 10.4018/IJDWM.2021010105 |