A lightweight filter based feature selection approach for multi-label text classification
Multi-label Text Classification (MTC) is a challenging task in Natural Language Processing (NLP). The goal of the MTC task is to label a document with a set of labels. By incorporating various term weighting schemes in MTC, high dimensional feature space has been generated; due to that, multi-label...
Saved in:
Published in | Journal of ambient intelligence and humanized computing Vol. 14; no. 9; pp. 12345 - 12357 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Berlin/Heidelberg
Springer Berlin Heidelberg
01.09.2023
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Multi-label Text Classification (MTC) is a challenging task in Natural Language Processing (NLP). The goal of the MTC task is to label a document with a set of labels. By incorporating various term weighting schemes in MTC, high dimensional feature space has been generated; due to that, multi-label learning algorithms face substantial problems in performing MTC tasks. To deal with these issues, Feature Selection (FS) approaches are effective solutions. This paper proposes a Lightweight Term-weighting FS (LwTwFS) approach based on a modified Chi-square (CHI) filter-based FS method to deal with this issue. The modified CHI approach works for Inter-Class Concentration (ICC) and Intra-Class Dispersion (ICD), and its strength has been increased by adding positive and negative correlations. A novel modified equation has been introduced to distribute the features among the categories (i.e., here, multi-label) in the corpus. The proposed modified CHI-based FS approach works on the term weighting-based Feature Extraction (FE) approach. Multi-Layer Perceptron (MLP) has been used in the classification phase due to the adaptive learning property, which refers to learning how to do tasks based on data provided during training or prior experience. We have used two publicly available multi-label corpora for experimental verification: the Arxiv Academic Paper Dataset (AAPD) and the Reuters Corpus Volume I (RCVI-V2). According to the results, in terms of performance, the LwTwFS methodology combined with the MLP classifier surpasses other combinations in terms of Jaccard Score (JS), Hamming Loss (HL), Ranking Loss (RL), Precision (Pr), Recall (Re), and F-micro and F-macro. For the AAPD corpus, the LwTwFS method achieves the best JS, HL, RL, Pr, F-micro, and F-macro values, which are 0.9636, 0.0121, 0.0303, 0.9636, 0.9882, and 0.9894. For the RCVI-V2 corpus, the LwTwFS method achieves the best JS, Pr, Re, F-micro, and F-macro values of 1.0000, and HL, RL values of 0.0000. Empirical results on widely used two benchmark multi-label text corpus show that LwTwFS achieves competitive performance, especially when labels are limited. |
---|---|
ISSN: | 1868-5137 1868-5145 |
DOI: | 10.1007/s12652-022-04335-5 |