Enhancing Data Quality at ETL Stage of Data Warehousing

Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to t...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of data warehousing and mining Vol. 17; no. 1; pp. 74 - 91
Main Authors Gupta, Neha, Jolly, Sakshi
Format Journal Article
LanguageEnglish
Published Hershey IGI Global 01.01.2021
Subjects
Online AccessGet full text
ISSN1548-3924
1548-3932
DOI10.4018/IJDWM.2021010105

Cover

Loading…
Abstract Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to the problem of data quality management. Data purgation occurs when the data is subject to ETL methodology in order to maintain and improve the data quality. The data may contain unnecessary information and may have inappropriate symbols which can be defined as dummy values, cryptic values, or missing values. The present work has improved the expectation-maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics to ensure dummy values, Wards algorithm with Minkowski distance to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics to handle missing values in a dataset. These distance metrics have improved the data quality and also helped in providing consistent data to be loaded into a data warehouse.
AbstractList Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to the problem of data quality management. Data purgation occurs when the data is subject to ETL methodology in order to maintain and improve the data quality. The data may contain unnecessary information and may have inappropriate symbols which can be defined as dummy values, cryptic values, or missing values. The present work has improved the expectation-maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics to ensure dummy values, Wards algorithm with Minkowski distance to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics to handle missing values in a dataset. These distance metrics have improved the data quality and also helped in providing consistent data to be loaded into a data warehouse.
Author Gupta, Neha
Jolly, Sakshi
AuthorAffiliation Manav Rachna International Institute of Research and Studies, Faridabad, India
AuthorAffiliation_xml – name: Manav Rachna International Institute of Research and Studies, Faridabad, India
Author_xml – sequence: 1
  givenname: Neha
  surname: Gupta
  fullname: Gupta, Neha
  organization: Manav Rachna International Institute of Research and Studies, Faridabad, India
– sequence: 2
  givenname: Sakshi
  surname: Jolly
  fullname: Jolly, Sakshi
  organization: Manav Rachna International Institute of Research and Studies, Faridabad, India
BookMark eNp9kM1Lw0AQxRepYFu9ewx4Tt2PfOwepa21UhGx0uMySXbbLXVTN5tD_3uTplgQlDnMwLw38_gNUM-WViF0S_AowoTfz58nq5cRxZTgtuIL1CdxxEMmGO39zDS6QoOq2mLMYkZZH6VTuwGbG7sOJuAheKthZ_whAB9Ml4vg3cNaBaXulitwalPWVaO-RpcadpW6OfUh-nicLsdP4eJ1Nh8_LMKcpdyHIkp0ogSJdKGzWGPIaKFTrZKCUB0XOOEsx5BkAJHmmAuVNbliInQWpSIXORuiu-7u3pVftaq83Ja1s81LSQUjnKcY80aFO1XuyqpySsu9M5_gDpJg2eKRRzzyjKexJL8sufHgTWm9A7P7zzjrjGZtzmGOCGVLSZ4QSvCyQfjXHZKyb4IjgOU
CitedBy_id crossref_primary_10_1109_ACCESS_2022_3148131
ContentType Journal Article
Copyright Copyright © 2021, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Copyright_xml – notice: Copyright © 2021, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
DBID AAYXX
CITATION
3V.
7SC
7WY
7WZ
7XB
87Z
8AL
8FD
8FE
8FG
8FK
8FL
ABUWG
AFKRA
ALSLI
ARAPS
AZQEC
BENPR
BEZIV
BGLVJ
CCPQU
CNYFK
DWQXO
E3H
F2A
FRNLG
F~G
GNUQQ
HCIFZ
JQ2
K60
K6~
K7-
L.-
L7M
L~C
L~D
M0C
M0N
M1O
P5Z
P62
PHGZM
PHGZT
PKEHL
PQBIZ
PQBZA
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PRQQA
PYYUZ
Q9U
DOI 10.4018/IJDWM.2021010105
DatabaseName CrossRef
ProQuest Central (Corporate)
Computer and Information Systems Abstracts
ABI/INFORM Collection
ABI/INFORM Global (PDF only)
ProQuest Central (purchase pre-March 2016)
ABI/INFORM Collection
Computing Database (Alumni Edition)
Technology Research Database
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni) (purchase pre-March 2016)
ABI/INFORM Collection (Alumni)
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
Social Science Premium Collection
Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Central
Business Premium Collection
Technology Collection
ProQuest One
Library & Information Science Collection
ProQuest Central
Library & Information Sciences Abstracts (LISA)
Library & Information Science Abstracts (LISA)
Business Premium Collection (Alumni)
ABI/INFORM Global (Corporate)
ProQuest Central Student
SciTech Premium Collection
ProQuest Computer Science Collection
ProQuest Business Collection (Alumni Edition)
ProQuest Business Collection
Computer Science Database
ABI/INFORM Professional Advanced
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
ABI/INFORM Global
Computing Database
Library Science Database
Advanced Technologies & Aerospace Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Business
ProQuest One Business (Alumni)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
ProQuest One Social Sciences
ABI/INFORM Collection China
ProQuest Central Basic
DatabaseTitle CrossRef
ProQuest Business Collection (Alumni Edition)
Computer Science Database
ProQuest Central Student
Library and Information Science Abstracts (LISA)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
SciTech Premium Collection
ProQuest Central China
ABI/INFORM Complete
ProQuest One Applied & Life Sciences
Library & Information Science Collection
ProQuest Central (New)
Advanced Technologies & Aerospace Collection
Business Premium Collection
Social Science Premium Collection
ABI/INFORM Global
ProQuest One Academic Eastern Edition
ProQuest Technology Collection
ProQuest Business Collection
ProQuest One Academic UKI Edition
ProQuest One Academic
ProQuest One Academic (New)
ABI/INFORM Global (Corporate)
ProQuest One Business
Technology Collection
Technology Research Database
Computer and Information Systems Abstracts – Academic
ProQuest One Academic Middle East (New)
ProQuest Central (Alumni Edition)
ProQuest One Community College
ProQuest Central
ABI/INFORM Professional Advanced
ProQuest Library Science
ProQuest Central Korea
Advanced Technologies Database with Aerospace
ABI/INFORM Complete (Alumni Edition)
ProQuest Computing
ProQuest One Social Sciences
ABI/INFORM Global (Alumni Edition)
ProQuest Central Basic
ProQuest Computing (Alumni Edition)
ABI/INFORM China
ProQuest SciTech Collection
Computer and Information Systems Abstracts Professional
Advanced Technologies & Aerospace Database
ProQuest One Business (Alumni)
ProQuest Central (Alumni)
Business Premium Collection (Alumni)
DatabaseTitleList
ProQuest Business Collection (Alumni Edition)
CrossRef
Database_xml – sequence: 1
  dbid: 8FG
  name: ProQuest Technology Collection
  url: https://search.proquest.com/technologycollection1
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1548-3932
EndPage 91
ExternalDocumentID 10_4018_IJDWM_2021010105
ncing_Data_Quality_at_ETL10_4018_IJDWM_202101010517
GroupedDBID 0R
29J
5GY
7WY
8FL
ABEPT
ADEKF
AENEX
ALMA_UNASSIGNED_HOLDINGS
COVLG
EBS
GROUPED_ABI_INFORM_COMPLETE
HZ
IAO
ICD
JRD
K6
K60
MV1
NEEBM
O9-
P2P
RIF
XH6
0R~
4.4
AAYVP
AAYXX
ABUWG
ACGFO
ADMLS
AFKRA
ALSLI
ARAPS
AXMGO
AZQEC
BAWSF
BDBYZ
BENPR
BEZIV
BGLVJ
BLRFH
BTFVE
BYHXH
CBWLS
CCPQU
CDTDJ
CIGCI
CITATION
CKMBR
CNQXE
CNYFK
CTSEY
DWQXO
FRNLG
GNUQQ
H13
HCIFZ
HZ~
ITC
IVC
K6~
K7-
M0C
M1O
PHGZM
PHGZT
PQBIZ
PQBZA
3V.
7SC
7XB
8AL
8FD
8FE
8FG
8FK
E3H
F2A
JQ2
L.-
L7M
L~C
L~D
M0N
P62
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PRQQA
Q9U
ID FETCH-LOGICAL-c378t-946f6e914fdfb5f0ab2df7fe6d12f5d0683c0a6baa4f8089eb353519fb479c9c3
IEDL.DBID BENPR
ISSN 1548-3924
IngestDate Sun Jul 13 05:21:08 EDT 2025
Thu Apr 24 23:10:18 EDT 2025
Tue Jul 01 01:08:36 EDT 2025
Sat Feb 06 14:52:48 EST 2021
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c378t-946f6e914fdfb5f0ab2df7fe6d12f5d0683c0a6baa4f8089eb353519fb479c9c3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-0905-5457
PQID 2931887008
PQPubID 28323
PageCount 18
ParticipantIDs igi_journals_ncing_Data_Quality_at_ETL10_4018_IJDWM_202101010517
proquest_journals_2931887008
crossref_citationtrail_10_4018_IJDWM_2021010105
crossref_primary_10_4018_IJDWM_2021010105
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2021-01-01T00:00:00
2021-1-1
20210101
PublicationDateYYYYMMDD 2021-01-01
PublicationDate_xml – month: 01
  year: 2021
  text: 2021-01-01T00:00:00
  day: 01
PublicationDecade 2020
PublicationPlace Hershey
PublicationPlace_xml – name: Hershey
PublicationTitle International journal of data warehousing and mining
PublicationYear 2021
Publisher IGI Global
Publisher_xml – name: IGI Global
SSID ssj0035323
Score 2.1994445
Snippet Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured,...
SourceID proquest
crossref
igi
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 74
SubjectTerms Algorithms
Data mining
Data warehouses
Euclidean geometry
Quality management
Unstructured data
Title Enhancing Data Quality at ETL Stage of Data Warehousing
URI http://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/IJDWM.2021010105
https://www.proquest.com/docview/2931887008
Volume 17
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV05T8MwFLYoLCzciHLJAwtD1LpxHGfibDlEC0JFdLMcH2kllHKEgX_Pc-JQIUSVKXLyhs9-l_38PoSOlLuaTJgKNJE2oJSmAXdlYtpwIjWzJipZIvoDdv1Eb0fRyG-4ffiyytomloZaT5XbI2-BWyKgEOCyTl7fAsca5U5XPYVGAy2BCeaQfC2ddwcPj7UtDqOwJHhzcXkAkQCtDiohp-Ctm9vL5z4kiJDzuCf65Zgak2zyxzqXLqe3hlZ8rIjPqsldRwsm30CrNQ8D9mq5ieJuPnZtM_IMX8pC4qovxheWBe4O7zDEk5nBU1sNPst3M566cvdsCz31usOL68ATIgQqjHkRJJRZZhJCrbZpZNsy7WgbW8M06dhItxkPVVuyVEpqeZsnkChHjoDPpjROVKLCbbSYT3OzgzBouiFpQqXREfhwKlUHJDuHHTMTJ7yJWjUaQvlu4Y604kVA1uDwEyV-YoZfEx3__PFadcqY8-0pACy8unyIEiLhUBAeIiELARD9J4DETbRfz85MzmyR7M4f3kPLTli1l7KPFov3T3MA0UWRHqIG710d-oUEb31y_w2pA8rS
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9tAEB4BPcCltDxEgLZ7KAcOVmJ7ba8PqKAmaQIJpyC4Let9JEiVw8MI8af4jZ3xoxGqyg35uOs5fDs7j93Z-QC-a3qa7MfaM75yHuc88wSViRkrfGViZ6OSJWJ8Hg8u-OlVdLUEL81bGCqrbGxiaajNXNMZeRvdko8bAl3Wj9s7j1ij6Ha1odCo1OLMPj9hyvZwNOzi-h4EQb83-TnwalYBT4eJKIhT3sU29bkzLotcR2WBcYmzsfEDF5lOLELdUXGmFHeiI1LMNiNisXMZT1Kd6hDlLsMHHoYp7SjR_9VYfpxX0slRFuBh3MGra1HMYER7eNq9HGM6ihkWfdErN7h8M735xxeUDq7_CT7WkSk7qVTpMyzZfAPWG9YHVhuBTUh6-YyadORT1lWFYlUXjmemCtabjBhGr1PL5q4avFT3djan4vrpFly8C1DbsJLPc7sDDO2K9bOUK2sijBi40gFKpvAgiW2Siha0GzSkrnuTE0XGb4k5CuEnS_zkAr8WHP7947bqy_HG3GMEWNab80GWEElCQdYQSVVIhOh_AvykBfvN6izkLFRy9-3hb7A6mIxHcjQ8P9uDNRJcneLsw0px_2i_YFxTZF9LZWJw_d7a-wfCOgYJ
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LTxRBEK4AJsYL4CusoPZBDx6G3Z7t6ek5GDXublhg0QMEbm1PPxaCmeUxxOBP86_4Z6iahxs1cuNg5tg9lUn11_WYrq4P4JWlq8lc2shxEyIhRB4pKhNzXnHjZPBJxRIx2ZNbB2L7KDlagB_tXRgqq2xtYmWo3czSP_IuuiWOGwJdVjc0ZRGfB6N3Z-cRMUjRSWtLp1FDZMdff8P07fLteIBr_TqOR8P9j1tRwzAQ2X6qSuKXD9JnXAQX8iT0TB67kAYvHY9D4npS9W3PyNwYEVRPZZh5JsRoF3KRZjazfZS7CPdSIRWVE074p9YL4LyKWo4ygghjEFEfkWI2o7rj7cHhBFNTzLboSX5ziYsn05O__ELl7EYr8LNVU13jcrp5Veab9vsfHST_Tz2uwnITg7MP9aZ5CAu-eAQrLb8Fa8zdY0iHxTG1IymmbGBKw-p-I9fMlGy4v8swTp96Ngv14KG58MczukYwfQIHd_L5T2GpmBV-DRhaUM_zTBjvEoyNhLExSqZAKJU-zVQHuu1aa9t0YScykK8aszFCh67Qoefo6MCbX2-c1R1Ibpn7HuGjGzN0qSsVadKCblSkTalRRf8SwNMObLS4mcuZg-bZ7cMv4T6iSe-O93bW4QHJrX9XbcBSeXHln2MAV-Yvqp3C4MtdQ-oGgTBSkQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Enhancing+Data+Quality+at+ETL+Stage+of+Data+Warehousing&rft.jtitle=International+journal+of+data+warehousing+and+mining&rft.au=Gupta%2C+Neha&rft.au=Jolly%2C+Sakshi&rft.date=2021-01-01&rft.pub=IGI+Global&rft.issn=1548-3924&rft.eissn=1548-3932&rft.volume=17&rft.issue=1&rft.spage=74&rft.epage=91&rft_id=info:doi/10.4018%2FIJDWM.2021010105&rft.externalDBID=HAS_PDF_LINK
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1548-3924&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1548-3924&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1548-3924&client=summon