Language Identification in Multi-lingual Web-Documents

Language identification an important task for web information retrieval. This paper presents the implementation of a tool for language identification in mono- and multi-lingual documents. The tool implements four algorithms for language identification. Furthermore, we present a n-gram approach for t...

Full description

Saved in:

Bibliographic Details
Published in	Lecture notes in computer science pp. 153 - 163
Main Authors	Mandl, Thomas, Shramko, Margaryta, Tartakovski, Olga, Womser-Hacker, Christa
Format	Book Chapter Conference Proceeding
Language	English
Published	Berlin, Heidelberg Springer Berlin Heidelberg 2006 Springer
Series	Lecture Notes in Computer Science
Subjects	Applied sciences Artificial intelligence Computer science; control theory; systems Exact sciences and technology Information systems. Data bases Memory organisation. Data processing Software Speech and sound recognition and synthesis. Linguistics Multilingualism Electronic document Linguistics Information retrieval Natural language Text System identification Localization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Language identification an important task for web information retrieval. This paper presents the implementation of a tool for language identification in mono- and multi-lingual documents. The tool implements four algorithms for language identification. Furthermore, we present a n-gram approach for the identification of languages in multi-lingual documents. An evaluation for monolingual texts of varied length is presented. Results for eight languages including Ukrainian and Russian are shown. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The evaluation for multi-lingual documents is based on both short synthetic documents and real world web documents. Our tool is able to recognize the languages present as well as the location of the language change with reasonable accuracy.
ISBN:	9783540346166 3540346163
ISSN:	0302-9743 1611-3349
DOI:	10.1007/11765448_14