고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법

Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selecti...

Full description

Saved in:

Bibliographic Details
Published in	品質經營學會誌 Vol. 47; no. 3; pp. 537 - 552
Main Authors	이창기, Changki Lee, 정욱, Uk Jung
Format	Journal Article
Language	Korean
Published	한국품질경영학회 30.09.2019
Subjects	Association-based Dissimilarity Distance Metric Feature Selection High-dimensional Categorical Data Unsupervised Learning 학제간연구 Feature Selection Distance Metric High-dimensional Categorical Data Unsupervised Learning Association-based Dissimilarity
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness- to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.
Bibliography:	The Korean Society for Quality Management KISTI1.1003/JNL.JAKO201929860938115
ISSN:	1229-1889 2287-9005
DOI:	10.7469/JKSQM.2019.47.3.537