Comparative evaluation of text classification techniques using a large diverse Arabic dataset

A vast amount of valuable human knowledge is recorded in documents. The rapid growth in the number of machine-readable documents for public or private access necessitates the use of automatic text classification. While a lot of effort has been put into Western languages—mostly English—minimal experi...

Full description

Saved in:

Bibliographic Details
Published in	Language Resources and Evaluation Vol. 47; no. 2; pp. 513 - 538
Main Authors	Khorsheed, Mohammad S., Al-Thubaity, Abdulmohsen O.
Format	Journal Article
Language	English
Published	Dordrecht Springer 01.06.2013 Springer Netherlands Springer Nature B.V
Subjects	Algorithms Applied linguistics Arabic language Benchmarking Boolean data Classification Computational Linguistics Computer Science Data visualization Datasets Decision trees English language Experimentation Human Information classification Language and Literature Linguistics Literature reviews News content Original Paper Poetry Religious poetry Social Sciences Support vector machines Texts Training Weighting methods Words Writers Arabic text categorization Machine learning Arabic text classification Automatic classification Arabic Text Natural language processing Algorithm Categorization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	A vast amount of valuable human knowledge is recorded in documents. The rapid growth in the number of machine-readable documents for public or private access necessitates the use of automatic text classification. While a lot of effort has been put into Western languages—mostly English—minimal experimentation has been done with Arabic. This paper presents, first, an up-to-date review of the work done in the field of Arabic text classification and, second, a large and diverse dataset that can be used for benchmarking Arabic text classification algorithms. The different techniques derived from the literature review are illustrated by their application to the proposed dataset. The results of various feature selections, weighting methods, and classification algorithms show, on average, the superiority of support vector machine, followed by the decision tree algorithm (C4.5) and Naïve Bayes. The best classification accuracy was 97 % for the Islamic Topics dataset, and the least accurate was 61 % for the Arabic Poems dataset.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2
ISSN:	1574-020X 1572-8412 1574-0218
DOI:	10.1007/s10579-013-9221-8