A missing data imputation approach using clustering and maximum likelihood estimation

Missing data is a data mining problem that adversely affects data analysis and decision making processes that are frequently encountered in healthcare data for a variety of reasons. Missing data is still an important research topic because the success of the method is influenced by many factors such...

Full description

Saved in:
Bibliographic Details
Published in2017 Medical Technologies National Congress (TIPTEKNO) pp. 1 - 4
Main Authors Albayrak, Muammer, Turhan, Kemal, Kurt, Burcin
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.10.2017
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Missing data is a data mining problem that adversely affects data analysis and decision making processes that are frequently encountered in healthcare data for a variety of reasons. Missing data is still an important research topic because the success of the method is influenced by many factors such as the characteristics of the data and the type of the missing data. In this study, a clustering and maximum likelihood estimation (MLE) based approach to the missing data problem is proposed. In order to test the proposed method, the "Mesothelioma" (Mesothelioma) data set prepared by the Dicle University Medical School and uploaded to UCI international open source database was used. New data sets have been created that are compatible with missing data patterns such as Missing completely at random (MCAR), Missing at random (MAR), and Missing not at random (MNAR). In the second step, these new data sets are divided into clusters in order to increase the computation success of the MLE method by a k-means clustering process in which 3 features with missing data are not included. In the last step, the missing data are completed with the MLE method for these clusters in which the features with missing values are added again, and the clusters are merged to obtain the complete data set. The new data sets obtained as a result of the completed operations in three steps (data reduction, clustering and data completion) were compared with the original data set according to the root mean square error (RMSE) criterion, and an average of 96.5% success was achieved.
AbstractList Missing data is a data mining problem that adversely affects data analysis and decision making processes that are frequently encountered in healthcare data for a variety of reasons. Missing data is still an important research topic because the success of the method is influenced by many factors such as the characteristics of the data and the type of the missing data. In this study, a clustering and maximum likelihood estimation (MLE) based approach to the missing data problem is proposed. In order to test the proposed method, the "Mesothelioma" (Mesothelioma) data set prepared by the Dicle University Medical School and uploaded to UCI international open source database was used. New data sets have been created that are compatible with missing data patterns such as Missing completely at random (MCAR), Missing at random (MAR), and Missing not at random (MNAR). In the second step, these new data sets are divided into clusters in order to increase the computation success of the MLE method by a k-means clustering process in which 3 features with missing data are not included. In the last step, the missing data are completed with the MLE method for these clusters in which the features with missing values are added again, and the clusters are merged to obtain the complete data set. The new data sets obtained as a result of the completed operations in three steps (data reduction, clustering and data completion) were compared with the original data set according to the root mean square error (RMSE) criterion, and an average of 96.5% success was achieved.
Author Kurt, Burcin
Albayrak, Muammer
Turhan, Kemal
Author_xml – sequence: 1
  givenname: Muammer
  surname: Albayrak
  fullname: Albayrak, Muammer
  email: m.albayrak@ktu.edu.tr
  organization: Dept. of Biostat. & Med. Inf., Karadeniz Tech. Univ., Trabzon, Turkey
– sequence: 2
  givenname: Kemal
  surname: Turhan
  fullname: Turhan, Kemal
  email: kemalturhan@ktu.edu.tr
  organization: Dept. of Biostat. & Med. Inf., Karadeniz Tech. Univ., Trabzon, Turkey
– sequence: 3
  givenname: Burcin
  surname: Kurt
  fullname: Kurt, Burcin
  email: burcinkurt@ktu.edu.tr
  organization: Dept. of Biostat. & Med. Inf., Karadeniz Tech. Univ., Trabzon, Turkey
BookMark eNotT19LwzAcjKAPbu4TiJAv0Jr_aR_HmDoczocOfBu_NokLNmlpU9Bv75x7uuOOu-Nm6Dp20SL0QElOKSkfq817tX592-WMUJ0XjBdEiSs0o5IXiijOP27RfomDH0cfP7GBBNiHfkqQfBcx9P3QQXPE09lu2mlMdvijEA0O8O3DFHDrv2zrj11nsB2TD-fsHbpx0I52ccE5qp7W1eol2-6eN6vlNvMlSZnQqnGNlhSklMyokkvGyqKuBYMaqHOFrktqmDyJzCnBNTCnhSRS6MaB43N0_1_rrbWHfjitDz-Hy1H-CwimT1w
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/TIPTEKNO.2017.8238064
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 153860633X
9781538606339
EndPage 4
ExternalDocumentID 8238064
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i90t-476cfc751a5552d69352298bb42aba1ff87b91d252982f6437a2f7450547cfaf3
IEDL.DBID RIE
IngestDate Thu Jun 29 18:37:48 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-476cfc751a5552d69352298bb42aba1ff87b91d252982f6437a2f7450547cfaf3
PageCount 4
ParticipantIDs ieee_primary_8238064
PublicationCentury 2000
PublicationDate 2017-Oct.
PublicationDateYYYYMMDD 2017-10-01
PublicationDate_xml – month: 10
  year: 2017
  text: 2017-Oct.
PublicationDecade 2010
PublicationTitle 2017 Medical Technologies National Congress (TIPTEKNO)
PublicationTitleAbbrev TIPTEKNO
PublicationYear 2017
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.7223316
Snippet Missing data is a data mining problem that adversely affects data analysis and decision making processes that are frequently encountered in healthcare data for...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Clustering
Clustering methods
CMLE approach
Data analysis
Data mining
Decision making
Maximum likelihood estimation
Missing data
MLE
Proteins
Root mean square
Title A missing data imputation approach using clustering and maximum likelihood estimation
URI https://ieeexplore.ieee.org/document/8238064
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA7bTp5UNvE3OXi0XZslaXMU2ZjK5g4d7DaSNIGyrhNpQfzrzWu7ieLBW0gLLe-Rfu_1ve97CN0xK7mKifI4Ey5BcZjmwQR4j2kXyilFtKxLMbM5ny7p84qtOuj-wIUxxtTNZ8aHZV3LT3e6gl9lw9jhi4PQLurGAWm4Wi0pJwzEMHlaJOOX-Sv0a0V-e--PoSk1ZkyO0Wz_tKZVZONXpfL15y8hxv--zgkafLPz8OKAO6eoY4o-Wj5g5zFI_DE0feIMhjXUVsd72XBc1Zd1XoE4AixlkeKt_Mi21Rbn2cbkGYgcYxDeaBiNA5RMxsnj1GtHJniZCEqPRlxbHbFQMsZIygWEVyJWihKpZGhtHCkRpoS5TWKhZieJjaiLgmikrbSjM9QrdoU5RzhKufOitMIFBDQWPB6NqEt23MfTnXmSBheoDxZZvzWiGOvWGJd_b1-hI_BK0wV3jXrle2VuHJqX6rZ24xczqqEv
link.rule.ids 310,311,786,790,795,796,802,27956,55107
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFH7MedCTyib-NgePtmuzpmmOIhub--EOHew2kjSBsq0TaUH8603abqJ48BbSQst7pN97fe_7HsAD0TwUERZOSJhJUAymOXYCvEOkCeWEwJKXpZjJNBzMg5cFWTTgcc-FUUqVzWfKtcuylp9sZWF_lXUigy8GQg_g0OC8Ryu2Vk3L8T3WiYezuDeavtqOLerWd_8Ym1KiRv8EJrvnVc0iK7fIhSs_f0kx_veFTqH9zc9Dsz3ynEFDZS2YPyHjM5v6I9v2iVI7rqG0O9oJh6OivCzXhZVHsEueJWjDP9JNsUHrdKXWqZU5RlZ6o-I0tiHu9-LngVMPTXBS5uVOQEOpJSU-J4TgJGQ2wGKREAHmgvtaR1QwP8HEbGJtq3YcaxqYOCigUnPdPYdmts3UBSCahMaPXDMTEgQRC6NuNzDpjvl8mlOPE-8SWtYiy7dKFmNZG-Pq7-17OBrEk_FyPJyOruHYeqjqibuBZv5eqFuD7bm4K136Bd_XpIM
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2017+Medical+Technologies+National+Congress+%28TIPTEKNO%29&rft.atitle=A+missing+data+imputation+approach+using+clustering+and+maximum+likelihood+estimation&rft.au=Albayrak%2C+Muammer&rft.au=Turhan%2C+Kemal&rft.au=Kurt%2C+Burcin&rft.date=2017-10-01&rft.pub=IEEE&rft.spage=1&rft.epage=4&rft_id=info:doi/10.1109%2FTIPTEKNO.2017.8238064&rft.externalDocID=8238064