Language morphology offset: Text classification on a Croatian–English parallel corpus

We investigate how, and to what extent, morphological complexity of the language influences text classification using support vector machines (SVM). The Croatian–English parallel corpus provides the basis for direct comparison of two languages of radically different morphological complexity. We quan...

Full description

Saved in:

Bibliographic Details
Published in	Information processing & management Vol. 44; no. 1; pp. 325 - 339
Main Authors	Malenica, M., Šmuc, T., Šnajder, J., Dalbelo Bašić, B.
Format	Journal Article
Language	English
Published	Kidlington Elsevier Ltd 2008 Elsevier Elsevier Science Ltd
Subjects	Classification Comparative analysis Computerized information retrieval Content analysis Croatian English Exact sciences and technology Feature selection Indexing. Classification. Abstracting Indexing. Classification. Abstracting. Syntheses Information and communication sciences Information and document structure and analysis Information processing Information processing and retrieval Information science. Documentation Language Languages Lemmatization Morphological normalisation Morphology Sciences and techniques of general use Stemming Studies Support vector machines SVM Text categorization Text classification English Feature selection Morphological normalisation Lemmatization SVM Text classification Croatian Stemming Automatic classification Text Support vector machine Morphological analysis Corpus analysis
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We investigate how, and to what extent, morphological complexity of the language influences text classification using support vector machines (SVM). The Croatian–English parallel corpus provides the basis for direct comparison of two languages of radically different morphological complexity. We quantified, compared, and statistically tested the effects of morphological normalisation on SVM classifier performance based on a series of parallel experiments on both languages, carried over a large scale of different feature subset sizes obtained by different feature selection methods, and applying different levels of morphological normalisation. We also quantified the trade-off between feature space size and performance for different levels of morphological normalisation, and compared the results for both languages. Our experiments have shown that the improvements in SVM classifier performance is statistically significant; they are greater for small and medium number of features, especially for Croatian, whereas for large number of features the improvements are rather small and may be negligible in practice for both languages.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0306-4573 1873-5371
DOI:	10.1016/j.ipm.2006.12.007