Wikipedia-based cross-language text classification

This paper presents the application of a Wikipedia-based bag of concepts (WikiBoC) document representation to cross-language text classification (CLTC). Its main objective is to alleviate the major drawbacks of the state-of-the-art CLTC approaches – typically based on the machine translation (MT) of...

Full description

Saved in:

Bibliographic Details
Published in	Information sciences Vol. 406-407; pp. 12 - 28
Main Authors	Mouriño García, Marcos Antonio, Pérez Rodríguez, Roberto, Anido Rifón, Luis
Format	Journal Article
Language	English
Published	Elsevier Inc 01.09.2017
Subjects	Bag of concepts Bag of words Cross-language text classification Document representation Hybrid Wikipedia Miner Hybrid Bag of words Document representation Wikipedia Miner Bag of concepts Cross-language text classification
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper presents the application of a Wikipedia-based bag of concepts (WikiBoC) document representation to cross-language text classification (CLTC). Its main objective is to alleviate the major drawbacks of the state-of-the-art CLTC approaches – typically based on the machine translation (MT) of documents, which are represented as bags of words (BoW). We propose a technique called cross-language concept matching (CLCM), to convert concept-based representations of documents from one language to another using Wikipedia correspondences between concepts in different languages and thus not relying on automated full-text translations. We describe two proposals: the first proposal consists in the use of the WikiBoC representation in conjunction with the CLCM technique (WikiBoC-CLCM) to classify documents written in a language L1 by using a SVM algorithm that was trained with documents written in another language L2; the second proposal consists of a hybrid model for representing documents that combines WikiBoC-CLCM with the classic BoW-MT approach. To evaluate the two proposals we conducted several experiments with three cross-lingual corpora: the JRC-Acquis corpus and two purpose-built corpora composed of Wikipedia articles. The first proposal outperforms state-of-the-art approaches when training sequences are short, achieving performance increases up to 233.33%. The second proposal outperforms state-of-the-art approaches in the whole range of training sequences, achieving performance increases up to 23.78%. Results obtained show the benefits of the WikiBoC-CLCM approach, since concepts extracted from documents add useful information to the classifier, thus improving its performance.
ISSN:	0020-0255 1872-6291
DOI:	10.1016/j.ins.2017.04.024