An Improved Stemming Approach Using HMM for a Highly Inflectional Language

Stemming is a common method for morphological normalization of natural language texts. Modern information retrieval systems rely on such normalization techniques for automatic document processing tasks. High quality stemming is difficult in highly inflectional Indic languages. Little research has be...

Full description

Saved in:

Bibliographic Details
Published in	Computational Linguistics and Intelligent Text Processing pp. 164 - 173
Main Authors	Saharia, Navanath, Konwar, Kishori M., Sharma, Utpal, Kalita, Jugal K.
Format	Book Chapter
Language	English
Published	Berlin, Heidelberg Springer Berlin Heidelberg 2013
Series	Lecture Notes in Computer Science
Subjects	Indic Language Root Word SIGIR Forum South Asian Language Southeast Asian Natural
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Stemming is a common method for morphological normalization of natural language texts. Modern information retrieval systems rely on such normalization techniques for automatic document processing tasks. High quality stemming is difficult in highly inflectional Indic languages. Little research has been performed on designing algorithms for stemming of texts in Indic languages. In this study, we focus on the problem of stemming texts in Assamese, a low resource Indic language spoken in the North-Eastern part of India by approximately 30 million people. Stemming is hard in Assamese due to the common appearance of single letter suffixes as morphological inflections. More than 50% of the inflections in Assamese appear as single letter suffixes. Such single letter morphological inflections cause ambiguity when predicting underlying root word. Therefore, we propose a new method that combines a rule based algorithm for predicting multiple letter suffixes and an HMM based algorithm for predicting the single letter suffixes. The combined approach can predict morphologically inflected words with 92% accuracy.
ISBN:	9783642372469 3642372465
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-642-37247-6_14