Enhancing Data Quality at ETL Stage of Data Warehousing

Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to t...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of data warehousing and mining Vol. 17; no. 1; pp. 74 - 91
Main Authors Gupta, Neha, Jolly, Sakshi
Format Journal Article
LanguageEnglish
Published Hershey IGI Global 01.01.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to the problem of data quality management. Data purgation occurs when the data is subject to ETL methodology in order to maintain and improve the data quality. The data may contain unnecessary information and may have inappropriate symbols which can be defined as dummy values, cryptic values, or missing values. The present work has improved the expectation-maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics to ensure dummy values, Wards algorithm with Minkowski distance to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics to handle missing values in a dataset. These distance metrics have improved the data quality and also helped in providing consistent data to be loaded into a data warehouse.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1548-3924
1548-3932
DOI:10.4018/IJDWM.2021010105