Tagging Icelandic text: A linguistic rule-based approach
The Icelandic language is a morphologically complex language, for which a large tagset has been created. This paper describes the design of a linguistic rule-based system for part-of-speech tagging Icelandic text. The system contains two main components: a disambiguator, IceTagger, and an unknown wo...
Saved in:
Published in | Nordic journal of linguistics Vol. 31; no. 1; pp. 47 - 72 |
---|---|
Main Author | |
Format | Journal Article |
Language | English |
Published |
Cambridge, UK
Cambridge University Press
01.06.2008
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The Icelandic language is a morphologically complex language, for which a large tagset has been created. This paper describes the design of a linguistic rule-based system for part-of-speech tagging Icelandic text. The system contains two main components: a disambiguator, IceTagger, and an unknown word guesser, IceMorphy. IceTagger uses a small number of local elimination rules along with a global heuristics component. The heuristics guess the functional roles of the words in a sentence, mark prepositional phrases, and use the acquired knowledge to force feature agreement where appropriate. IceMorphy is used for guessing the tag profile for unknown words and for automatically filling tag profile gaps in the lexicon. Evaluation shows that IceTagger achieves 91.54% accuracy, a substantial improvement on the highest accuracy, 90.44%, obtained using three state-of-the-art data-driven taggers. Furthermore, the accuracy increases to 92.95% by using IceTagger along with two data-driven taggers in a simple voting scheme. The development time of the tagging system was only seven man-months, which can be considered a short development period for a linguistic rule-based system. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0332-5865 1502-4717 |
DOI: | 10.1017/S0332586508001820 |