A missing data imputation approach using clustering and maximum likelihood estimation
Missing data is a data mining problem that adversely affects data analysis and decision making processes that are frequently encountered in healthcare data for a variety of reasons. Missing data is still an important research topic because the success of the method is influenced by many factors such...
Saved in:
Published in | 2017 Medical Technologies National Congress (TIPTEKNO) pp. 1 - 4 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.10.2017
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Missing data is a data mining problem that adversely affects data analysis and decision making processes that are frequently encountered in healthcare data for a variety of reasons. Missing data is still an important research topic because the success of the method is influenced by many factors such as the characteristics of the data and the type of the missing data. In this study, a clustering and maximum likelihood estimation (MLE) based approach to the missing data problem is proposed. In order to test the proposed method, the "Mesothelioma" (Mesothelioma) data set prepared by the Dicle University Medical School and uploaded to UCI international open source database was used. New data sets have been created that are compatible with missing data patterns such as Missing completely at random (MCAR), Missing at random (MAR), and Missing not at random (MNAR). In the second step, these new data sets are divided into clusters in order to increase the computation success of the MLE method by a k-means clustering process in which 3 features with missing data are not included. In the last step, the missing data are completed with the MLE method for these clusters in which the features with missing values are added again, and the clusters are merged to obtain the complete data set. The new data sets obtained as a result of the completed operations in three steps (data reduction, clustering and data completion) were compared with the original data set according to the root mean square error (RMSE) criterion, and an average of 96.5% success was achieved. |
---|---|
AbstractList | Missing data is a data mining problem that adversely affects data analysis and decision making processes that are frequently encountered in healthcare data for a variety of reasons. Missing data is still an important research topic because the success of the method is influenced by many factors such as the characteristics of the data and the type of the missing data. In this study, a clustering and maximum likelihood estimation (MLE) based approach to the missing data problem is proposed. In order to test the proposed method, the "Mesothelioma" (Mesothelioma) data set prepared by the Dicle University Medical School and uploaded to UCI international open source database was used. New data sets have been created that are compatible with missing data patterns such as Missing completely at random (MCAR), Missing at random (MAR), and Missing not at random (MNAR). In the second step, these new data sets are divided into clusters in order to increase the computation success of the MLE method by a k-means clustering process in which 3 features with missing data are not included. In the last step, the missing data are completed with the MLE method for these clusters in which the features with missing values are added again, and the clusters are merged to obtain the complete data set. The new data sets obtained as a result of the completed operations in three steps (data reduction, clustering and data completion) were compared with the original data set according to the root mean square error (RMSE) criterion, and an average of 96.5% success was achieved. |
Author | Kurt, Burcin Albayrak, Muammer Turhan, Kemal |
Author_xml | – sequence: 1 givenname: Muammer surname: Albayrak fullname: Albayrak, Muammer email: m.albayrak@ktu.edu.tr organization: Dept. of Biostat. & Med. Inf., Karadeniz Tech. Univ., Trabzon, Turkey – sequence: 2 givenname: Kemal surname: Turhan fullname: Turhan, Kemal email: kemalturhan@ktu.edu.tr organization: Dept. of Biostat. & Med. Inf., Karadeniz Tech. Univ., Trabzon, Turkey – sequence: 3 givenname: Burcin surname: Kurt fullname: Kurt, Burcin email: burcinkurt@ktu.edu.tr organization: Dept. of Biostat. & Med. Inf., Karadeniz Tech. Univ., Trabzon, Turkey |
BookMark | eNotT19LwzAcjKAPbu4TiJAv0Jr_aR_HmDoczocOfBu_NokLNmlpU9Bv75x7uuOOu-Nm6Dp20SL0QElOKSkfq817tX592-WMUJ0XjBdEiSs0o5IXiijOP27RfomDH0cfP7GBBNiHfkqQfBcx9P3QQXPE09lu2mlMdvijEA0O8O3DFHDrv2zrj11nsB2TD-fsHbpx0I52ccE5qp7W1eol2-6eN6vlNvMlSZnQqnGNlhSklMyokkvGyqKuBYMaqHOFrktqmDyJzCnBNTCnhSRS6MaB43N0_1_rrbWHfjitDz-Hy1H-CwimT1w |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/TIPTEKNO.2017.8238064 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 153860633X 9781538606339 |
EndPage | 4 |
ExternalDocumentID | 8238064 |
Genre | orig-research |
GroupedDBID | 6IE 6IL CBEJK RIE RIL |
ID | FETCH-LOGICAL-i90t-476cfc751a5552d69352298bb42aba1ff87b91d252982f6437a2f7450547cfaf3 |
IEDL.DBID | RIE |
IngestDate | Thu Jun 29 18:37:48 EDT 2023 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i90t-476cfc751a5552d69352298bb42aba1ff87b91d252982f6437a2f7450547cfaf3 |
PageCount | 4 |
ParticipantIDs | ieee_primary_8238064 |
PublicationCentury | 2000 |
PublicationDate | 2017-Oct. |
PublicationDateYYYYMMDD | 2017-10-01 |
PublicationDate_xml | – month: 10 year: 2017 text: 2017-Oct. |
PublicationDecade | 2010 |
PublicationTitle | 2017 Medical Technologies National Congress (TIPTEKNO) |
PublicationTitleAbbrev | TIPTEKNO |
PublicationYear | 2017 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
Score | 1.7223316 |
Snippet | Missing data is a data mining problem that adversely affects data analysis and decision making processes that are frequently encountered in healthcare data for... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 1 |
SubjectTerms | Clustering Clustering methods CMLE approach Data analysis Data mining Decision making Maximum likelihood estimation Missing data MLE Proteins Root mean square |
Title | A missing data imputation approach using clustering and maximum likelihood estimation |
URI | https://ieeexplore.ieee.org/document/8238064 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA7bTp5UNvE3OXi0XZslaXMU2ZjK5g4d7DaSNIGyrhNpQfzrzWu7ieLBW0gLLe-Rfu_1ve97CN0xK7mKifI4Ey5BcZjmwQR4j2kXyilFtKxLMbM5ny7p84qtOuj-wIUxxtTNZ8aHZV3LT3e6gl9lw9jhi4PQLurGAWm4Wi0pJwzEMHlaJOOX-Sv0a0V-e--PoSk1ZkyO0Wz_tKZVZONXpfL15y8hxv--zgkafLPz8OKAO6eoY4o-Wj5g5zFI_DE0feIMhjXUVsd72XBc1Zd1XoE4AixlkeKt_Mi21Rbn2cbkGYgcYxDeaBiNA5RMxsnj1GtHJniZCEqPRlxbHbFQMsZIygWEVyJWihKpZGhtHCkRpoS5TWKhZieJjaiLgmikrbSjM9QrdoU5RzhKufOitMIFBDQWPB6NqEt23MfTnXmSBheoDxZZvzWiGOvWGJd_b1-hI_BK0wV3jXrle2VuHJqX6rZ24xczqqEv |
link.rule.ids | 310,311,786,790,795,796,802,27956,55107 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFH7MedCTyib-NgePtmuzpmmOIhub--EOHew2kjSBsq0TaUH8603abqJ48BbSQst7pN97fe_7HsAD0TwUERZOSJhJUAymOXYCvEOkCeWEwJKXpZjJNBzMg5cFWTTgcc-FUUqVzWfKtcuylp9sZWF_lXUigy8GQg_g0OC8Ryu2Vk3L8T3WiYezuDeavtqOLerWd_8Ym1KiRv8EJrvnVc0iK7fIhSs_f0kx_veFTqH9zc9Dsz3ynEFDZS2YPyHjM5v6I9v2iVI7rqG0O9oJh6OivCzXhZVHsEueJWjDP9JNsUHrdKXWqZU5RlZ6o-I0tiHu9-LngVMPTXBS5uVOQEOpJSU-J4TgJGQ2wGKREAHmgvtaR1QwP8HEbGJtq3YcaxqYOCigUnPdPYdmts3UBSCahMaPXDMTEgQRC6NuNzDpjvl8mlOPE-8SWtYiy7dKFmNZG-Pq7-17OBrEk_FyPJyOruHYeqjqibuBZv5eqFuD7bm4K136Bd_XpIM |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2017+Medical+Technologies+National+Congress+%28TIPTEKNO%29&rft.atitle=A+missing+data+imputation+approach+using+clustering+and+maximum+likelihood+estimation&rft.au=Albayrak%2C+Muammer&rft.au=Turhan%2C+Kemal&rft.au=Kurt%2C+Burcin&rft.date=2017-10-01&rft.pub=IEEE&rft.spage=1&rft.epage=4&rft_id=info:doi/10.1109%2FTIPTEKNO.2017.8238064&rft.externalDocID=8238064 |