고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법
Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selecti...
Saved in:
Published in | 品質經營學會誌 Vol. 47; no. 3; pp. 537 - 552 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | Korean |
Published |
한국품질경영학회
30.09.2019
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data.
Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness- to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset.
Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases.
Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation. |
---|---|
Bibliography: | The Korean Society for Quality Management KISTI1.1003/JNL.JAKO201929860938115 |
ISSN: | 1229-1889 2287-9005 |
DOI: | 10.7469/JKSQM.2019.47.3.537 |