An Improved Stemming Approach Using HMM for a Highly Inflectional Language

Stemming is a common method for morphological normalization of natural language texts. Modern information retrieval systems rely on such normalization techniques for automatic document processing tasks. High quality stemming is difficult in highly inflectional Indic languages. Little research has be...

Full description

Saved in:
Bibliographic Details
Published inComputational Linguistics and Intelligent Text Processing pp. 164 - 173
Main Authors Saharia, Navanath, Konwar, Kishori M., Sharma, Utpal, Kalita, Jugal K.
Format Book Chapter
LanguageEnglish
Published Berlin, Heidelberg Springer Berlin Heidelberg 2013
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Stemming is a common method for morphological normalization of natural language texts. Modern information retrieval systems rely on such normalization techniques for automatic document processing tasks. High quality stemming is difficult in highly inflectional Indic languages. Little research has been performed on designing algorithms for stemming of texts in Indic languages. In this study, we focus on the problem of stemming texts in Assamese, a low resource Indic language spoken in the North-Eastern part of India by approximately 30 million people. Stemming is hard in Assamese due to the common appearance of single letter suffixes as morphological inflections. More than 50% of the inflections in Assamese appear as single letter suffixes. Such single letter morphological inflections cause ambiguity when predicting underlying root word. Therefore, we propose a new method that combines a rule based algorithm for predicting multiple letter suffixes and an HMM based algorithm for predicting the single letter suffixes. The combined approach can predict morphologically inflected words with 92% accuracy.
ISBN:9783642372469
3642372465
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-642-37247-6_14