Enhancing Data Quality at ETL Stage of Data Warehousing
Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to t...
Saved in:
Published in | International journal of data warehousing and mining Vol. 17; no. 1; pp. 74 - 91 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Hershey
IGI Global
01.01.2021
|
Subjects | |
Online Access | Get full text |
ISSN | 1548-3924 1548-3932 |
DOI | 10.4018/IJDWM.2021010105 |
Cover
Loading…
Abstract | Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to the problem of data quality management. Data purgation occurs when the data is subject to ETL methodology in order to maintain and improve the data quality. The data may contain unnecessary information and may have inappropriate symbols which can be defined as dummy values, cryptic values, or missing values. The present work has improved the expectation-maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics to ensure dummy values, Wards algorithm with Minkowski distance to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics to handle missing values in a dataset. These distance metrics have improved the data quality and also helped in providing consistent data to be loaded into a data warehouse. |
---|---|
AbstractList | Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured, semi-structured, and unstructured). Various data mining technologies are used to collect, refine, and analyze the data which further leads to the problem of data quality management. Data purgation occurs when the data is subject to ETL methodology in order to maintain and improve the data quality. The data may contain unnecessary information and may have inappropriate symbols which can be defined as dummy values, cryptic values, or missing values. The present work has improved the expectation-maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics to ensure dummy values, Wards algorithm with Minkowski distance to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics to handle missing values in a dataset. These distance metrics have improved the data quality and also helped in providing consistent data to be loaded into a data warehouse. |
Author | Gupta, Neha Jolly, Sakshi |
AuthorAffiliation | Manav Rachna International Institute of Research and Studies, Faridabad, India |
AuthorAffiliation_xml | – name: Manav Rachna International Institute of Research and Studies, Faridabad, India |
Author_xml | – sequence: 1 givenname: Neha surname: Gupta fullname: Gupta, Neha organization: Manav Rachna International Institute of Research and Studies, Faridabad, India – sequence: 2 givenname: Sakshi surname: Jolly fullname: Jolly, Sakshi organization: Manav Rachna International Institute of Research and Studies, Faridabad, India |
BookMark | eNp9kM1Lw0AQxRepYFu9ewx4Tt2PfOwepa21UhGx0uMySXbbLXVTN5tD_3uTplgQlDnMwLw38_gNUM-WViF0S_AowoTfz58nq5cRxZTgtuIL1CdxxEMmGO39zDS6QoOq2mLMYkZZH6VTuwGbG7sOJuAheKthZ_whAB9Ml4vg3cNaBaXulitwalPWVaO-RpcadpW6OfUh-nicLsdP4eJ1Nh8_LMKcpdyHIkp0ogSJdKGzWGPIaKFTrZKCUB0XOOEsx5BkAJHmmAuVNbliInQWpSIXORuiu-7u3pVftaq83Ja1s81LSQUjnKcY80aFO1XuyqpySsu9M5_gDpJg2eKRRzzyjKexJL8sufHgTWm9A7P7zzjrjGZtzmGOCGVLSZ4QSvCyQfjXHZKyb4IjgOU |
CitedBy_id | crossref_primary_10_1109_ACCESS_2022_3148131 |
ContentType | Journal Article |
Copyright | Copyright © 2021, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. |
Copyright_xml | – notice: Copyright © 2021, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. |
DBID | AAYXX CITATION 3V. 7SC 7WY 7WZ 7XB 87Z 8AL 8FD 8FE 8FG 8FK 8FL ABUWG AFKRA ALSLI ARAPS AZQEC BENPR BEZIV BGLVJ CCPQU CNYFK DWQXO E3H F2A FRNLG F~G GNUQQ HCIFZ JQ2 K60 K6~ K7- L.- L7M L~C L~D M0C M0N M1O P5Z P62 PHGZM PHGZT PKEHL PQBIZ PQBZA PQEST PQGLB PQQKQ PQUKI PRINS PRQQA PYYUZ Q9U |
DOI | 10.4018/IJDWM.2021010105 |
DatabaseName | CrossRef ProQuest Central (Corporate) Computer and Information Systems Abstracts ABI/INFORM Collection ABI/INFORM Global (PDF only) ProQuest Central (purchase pre-March 2016) ABI/INFORM Collection Computing Database (Alumni Edition) Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ABI/INFORM Collection (Alumni) ProQuest Central (Alumni) ProQuest Central UK/Ireland Social Science Premium Collection Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Central Business Premium Collection Technology Collection ProQuest One Library & Information Science Collection ProQuest Central Library & Information Sciences Abstracts (LISA) Library & Information Science Abstracts (LISA) Business Premium Collection (Alumni) ABI/INFORM Global (Corporate) ProQuest Central Student SciTech Premium Collection ProQuest Computer Science Collection ProQuest Business Collection (Alumni Edition) ProQuest Business Collection Computer Science Database ABI/INFORM Professional Advanced Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ABI/INFORM Global Computing Database Library Science Database Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Business ProQuest One Business (Alumni) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China ProQuest One Social Sciences ABI/INFORM Collection China ProQuest Central Basic |
DatabaseTitle | CrossRef ProQuest Business Collection (Alumni Edition) Computer Science Database ProQuest Central Student Library and Information Science Abstracts (LISA) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection Computer and Information Systems Abstracts SciTech Premium Collection ProQuest Central China ABI/INFORM Complete ProQuest One Applied & Life Sciences Library & Information Science Collection ProQuest Central (New) Advanced Technologies & Aerospace Collection Business Premium Collection Social Science Premium Collection ABI/INFORM Global ProQuest One Academic Eastern Edition ProQuest Technology Collection ProQuest Business Collection ProQuest One Academic UKI Edition ProQuest One Academic ProQuest One Academic (New) ABI/INFORM Global (Corporate) ProQuest One Business Technology Collection Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest One Academic Middle East (New) ProQuest Central (Alumni Edition) ProQuest One Community College ProQuest Central ABI/INFORM Professional Advanced ProQuest Library Science ProQuest Central Korea Advanced Technologies Database with Aerospace ABI/INFORM Complete (Alumni Edition) ProQuest Computing ProQuest One Social Sciences ABI/INFORM Global (Alumni Edition) ProQuest Central Basic ProQuest Computing (Alumni Edition) ABI/INFORM China ProQuest SciTech Collection Computer and Information Systems Abstracts Professional Advanced Technologies & Aerospace Database ProQuest One Business (Alumni) ProQuest Central (Alumni) Business Premium Collection (Alumni) |
DatabaseTitleList | ProQuest Business Collection (Alumni Edition) CrossRef |
Database_xml | – sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 1548-3932 |
EndPage | 91 |
ExternalDocumentID | 10_4018_IJDWM_2021010105 ncing_Data_Quality_at_ETL10_4018_IJDWM_202101010517 |
GroupedDBID | 0R 29J 5GY 7WY 8FL ABEPT ADEKF AENEX ALMA_UNASSIGNED_HOLDINGS COVLG EBS GROUPED_ABI_INFORM_COMPLETE HZ IAO ICD JRD K6 K60 MV1 NEEBM O9- P2P RIF XH6 0R~ 4.4 AAYVP AAYXX ABUWG ACGFO ADMLS AFKRA ALSLI ARAPS AXMGO AZQEC BAWSF BDBYZ BENPR BEZIV BGLVJ BLRFH BTFVE BYHXH CBWLS CCPQU CDTDJ CIGCI CITATION CKMBR CNQXE CNYFK CTSEY DWQXO FRNLG GNUQQ H13 HCIFZ HZ~ ITC IVC K6~ K7- M0C M1O PHGZM PHGZT PQBIZ PQBZA 3V. 7SC 7XB 8AL 8FD 8FE 8FG 8FK E3H F2A JQ2 L.- L7M L~C L~D M0N P62 PKEHL PQEST PQGLB PQQKQ PQUKI PRINS PRQQA Q9U |
ID | FETCH-LOGICAL-c378t-946f6e914fdfb5f0ab2df7fe6d12f5d0683c0a6baa4f8089eb353519fb479c9c3 |
IEDL.DBID | BENPR |
ISSN | 1548-3924 |
IngestDate | Sun Jul 13 05:21:08 EDT 2025 Thu Apr 24 23:10:18 EDT 2025 Tue Jul 01 01:08:36 EDT 2025 Sat Feb 06 14:52:48 EST 2021 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 1 |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c378t-946f6e914fdfb5f0ab2df7fe6d12f5d0683c0a6baa4f8089eb353519fb479c9c3 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0003-0905-5457 |
PQID | 2931887008 |
PQPubID | 28323 |
PageCount | 18 |
ParticipantIDs | igi_journals_ncing_Data_Quality_at_ETL10_4018_IJDWM_202101010517 proquest_journals_2931887008 crossref_citationtrail_10_4018_IJDWM_2021010105 crossref_primary_10_4018_IJDWM_2021010105 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2021-01-01T00:00:00 2021-1-1 20210101 |
PublicationDateYYYYMMDD | 2021-01-01 |
PublicationDate_xml | – month: 01 year: 2021 text: 2021-01-01T00:00:00 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | Hershey |
PublicationPlace_xml | – name: Hershey |
PublicationTitle | International journal of data warehousing and mining |
PublicationYear | 2021 |
Publisher | IGI Global |
Publisher_xml | – name: IGI Global |
SSID | ssj0035323 |
Score | 2.1994445 |
Snippet | Data usually comes into data warehouses from multiple sources having different formats and are specifically categorized into three groups (i.e., structured,... |
SourceID | proquest crossref igi |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 74 |
SubjectTerms | Algorithms Data mining Data warehouses Euclidean geometry Quality management Unstructured data |
Title | Enhancing Data Quality at ETL Stage of Data Warehousing |
URI | http://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/IJDWM.2021010105 https://www.proquest.com/docview/2931887008 |
Volume | 17 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV05T8MwFLYoLCzciHLJAwtD1LpxHGfibDlEC0JFdLMcH2kllHKEgX_Pc-JQIUSVKXLyhs9-l_38PoSOlLuaTJgKNJE2oJSmAXdlYtpwIjWzJipZIvoDdv1Eb0fRyG-4ffiyytomloZaT5XbI2-BWyKgEOCyTl7fAsca5U5XPYVGAy2BCeaQfC2ddwcPj7UtDqOwJHhzcXkAkQCtDiohp-Ctm9vL5z4kiJDzuCf65Zgak2zyxzqXLqe3hlZ8rIjPqsldRwsm30CrNQ8D9mq5ieJuPnZtM_IMX8pC4qovxheWBe4O7zDEk5nBU1sNPst3M566cvdsCz31usOL68ATIgQqjHkRJJRZZhJCrbZpZNsy7WgbW8M06dhItxkPVVuyVEpqeZsnkChHjoDPpjROVKLCbbSYT3OzgzBouiFpQqXREfhwKlUHJDuHHTMTJ7yJWjUaQvlu4Y604kVA1uDwEyV-YoZfEx3__PFadcqY8-0pACy8unyIEiLhUBAeIiELARD9J4DETbRfz85MzmyR7M4f3kPLTli1l7KPFov3T3MA0UWRHqIG710d-oUEb31y_w2pA8rS |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9tAEB4BPcCltDxEgLZ7KAcOVmJ7ba8PqKAmaQIJpyC4Let9JEiVw8MI8af4jZ3xoxGqyg35uOs5fDs7j93Z-QC-a3qa7MfaM75yHuc88wSViRkrfGViZ6OSJWJ8Hg8u-OlVdLUEL81bGCqrbGxiaajNXNMZeRvdko8bAl3Wj9s7j1ij6Ha1odCo1OLMPj9hyvZwNOzi-h4EQb83-TnwalYBT4eJKIhT3sU29bkzLotcR2WBcYmzsfEDF5lOLELdUXGmFHeiI1LMNiNisXMZT1Kd6hDlLsMHHoYp7SjR_9VYfpxX0slRFuBh3MGra1HMYER7eNq9HGM6ihkWfdErN7h8M735xxeUDq7_CT7WkSk7qVTpMyzZfAPWG9YHVhuBTUh6-YyadORT1lWFYlUXjmemCtabjBhGr1PL5q4avFT3djan4vrpFly8C1DbsJLPc7sDDO2K9bOUK2sijBi40gFKpvAgiW2Siha0GzSkrnuTE0XGb4k5CuEnS_zkAr8WHP7947bqy_HG3GMEWNab80GWEElCQdYQSVVIhOh_AvykBfvN6izkLFRy9-3hb7A6mIxHcjQ8P9uDNRJcneLsw0px_2i_YFxTZF9LZWJw_d7a-wfCOgYJ |
linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LTxRBEK4AJsYL4CusoPZBDx6G3Z7t6ek5GDXublhg0QMEbm1PPxaCmeUxxOBP86_4Z6iahxs1cuNg5tg9lUn11_WYrq4P4JWlq8lc2shxEyIhRB4pKhNzXnHjZPBJxRIx2ZNbB2L7KDlagB_tXRgqq2xtYmWo3czSP_IuuiWOGwJdVjc0ZRGfB6N3Z-cRMUjRSWtLp1FDZMdff8P07fLteIBr_TqOR8P9j1tRwzAQ2X6qSuKXD9JnXAQX8iT0TB67kAYvHY9D4npS9W3PyNwYEVRPZZh5JsRoF3KRZjazfZS7CPdSIRWVE074p9YL4LyKWo4ygghjEFEfkWI2o7rj7cHhBFNTzLboSX5ziYsn05O__ELl7EYr8LNVU13jcrp5Veab9vsfHST_Tz2uwnITg7MP9aZ5CAu-eAQrLb8Fa8zdY0iHxTG1IymmbGBKw-p-I9fMlGy4v8swTp96Ngv14KG58MczukYwfQIHd_L5T2GpmBV-DRhaUM_zTBjvEoyNhLExSqZAKJU-zVQHuu1aa9t0YScykK8aszFCh67Qoefo6MCbX2-c1R1Ibpn7HuGjGzN0qSsVadKCblSkTalRRf8SwNMObLS4mcuZg-bZ7cMv4T6iSe-O93bW4QHJrX9XbcBSeXHln2MAV-Yvqp3C4MtdQ-oGgTBSkQ |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Enhancing+Data+Quality+at+ETL+Stage+of+Data+Warehousing&rft.jtitle=International+journal+of+data+warehousing+and+mining&rft.au=Gupta%2C+Neha&rft.au=Jolly%2C+Sakshi&rft.date=2021-01-01&rft.pub=IGI+Global&rft.issn=1548-3924&rft.eissn=1548-3932&rft.volume=17&rft.issue=1&rft.spage=74&rft.epage=91&rft_id=info:doi/10.4018%2FIJDWM.2021010105&rft.externalDBID=HAS_PDF_LINK |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1548-3924&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1548-3924&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1548-3924&client=summon |