Tagging Icelandic text: A linguistic rule-based approach

The Icelandic language is a morphologically complex language, for which a large tagset has been created. This paper describes the design of a linguistic rule-based system for part-of-speech tagging Icelandic text. The system contains two main components: a disambiguator, IceTagger, and an unknown wo...

Full description

Saved in:
Bibliographic Details
Published inNordic journal of linguistics Vol. 31; no. 1; pp. 47 - 72
Main Author Loftsson, Hrafn
Format Journal Article
LanguageEnglish
Published Cambridge, UK Cambridge University Press 01.06.2008
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The Icelandic language is a morphologically complex language, for which a large tagset has been created. This paper describes the design of a linguistic rule-based system for part-of-speech tagging Icelandic text. The system contains two main components: a disambiguator, IceTagger, and an unknown word guesser, IceMorphy. IceTagger uses a small number of local elimination rules along with a global heuristics component. The heuristics guess the functional roles of the words in a sentence, mark prepositional phrases, and use the acquired knowledge to force feature agreement where appropriate. IceMorphy is used for guessing the tag profile for unknown words and for automatically filling tag profile gaps in the lexicon. Evaluation shows that IceTagger achieves 91.54% accuracy, a substantial improvement on the highest accuracy, 90.44%, obtained using three state-of-the-art data-driven taggers. Furthermore, the accuracy increases to 92.95% by using IceTagger along with two data-driven taggers in a simple voting scheme. The development time of the tagging system was only seven man-months, which can be considered a short development period for a linguistic rule-based system.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0332-5865
1502-4717
DOI:10.1017/S0332586508001820