Linguini: Language Identification for Multilingual Documents

Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those document...

Full description

Saved in:

Bibliographic Details
Published in	Journal of management information systems Vol. 16; no. 3; pp. 71 - 101
Main Author	Prager, John M.
Format	Journal Article
Language	English
Published	Abingdon Routledge 01.12.1999 M. E. Sharpe Taylor & Francis Ltd
Subjects	categorization Comparative analysis Cosine function Dictionaries Document management Dot product of vectors Electronic publishing End users Information retrieval Information systems Language language identification Languages Nonnative languages Search engines Special Section: Exploring the Outlands of the MIS Discipline Statistical analysis Studies Term weighting vector-space models Weighted averages Words
Online Access	Get full text
ISSN	0742-1222 1557-928X
DOI	10.1080/07421222.1999.11518257

Cover

More Information
Summary:	Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
Bibliography:	SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14
ISSN:	0742-1222 1557-928X
DOI:	10.1080/07421222.1999.11518257