Enhancing Data Quality at ETL Stage of Data Warehousing

Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to t...

Full description

Saved in:

Bibliographic Details
Published in	International journal of data warehousing and mining Vol. 17; no. 1; pp. 74 - 91
Main Authors	Gupta, Neha, Jolly, Sakshi
Format	Journal Article
Language	English
Published	Hershey IGI Global 01.01.2021
Subjects	Algorithms Data mining Data warehouses Euclidean geometry Quality management Unstructured data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to the problem of data quality management. Data purgation occurs when the data is subject to ETL methodology in order to maintain and improve the data quality. The data may contain unnecessary information and may have inappropriate symbols which can be defined as dummy values, cryptic values, or missing values. The present work has improved the expectation-maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics to ensure dummy values, Wards algorithm with Minkowski distance to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics to handle missing values in a dataset. These distance metrics have improved the data quality and also helped in providing consistent data to be loaded into a data warehouse.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1548-3924 1548-3932
DOI:	10.4018/IJDWM.2021010105