Linguini: Language Identification for Multilingual Documents
Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those document...
Saved in:
Published in | Journal of management information systems Vol. 16; no. 3; pp. 71 - 101 |
---|---|
Main Author | |
Format | Journal Article |
Language | English |
Published |
Abingdon
Routledge
01.12.1999
M. E. Sharpe Taylor & Francis Ltd |
Subjects | |
Online Access | Get full text |
ISSN | 0742-1222 1557-928X |
DOI | 10.1080/07421222.1999.11518257 |
Cover
Abstract | Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none. |
---|---|
AbstractList | Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none. Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in this context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. The functional dependencies of Linguini's performance is determined and it is demonstrated that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none. |
Author | Prager, John M. |
Author_xml | – sequence: 1 givenname: John M. surname: Prager fullname: Prager, John M. |
BookMark | eNqFkE1LAzEURYNUsK3-BGVwPzUfk5lEuin1EypuFNyFNJOUlGlSkwzSf--MtS7cdJUH79xc3hmBgfNOA3CF4ARBBm9gVWCEMZ4gzvkEIYoYptUJGCJKq5xj9jEAwx7Ke-oMjGJcQwgRx3wIpgvrVq119jZbyG6SK50919ola6ySyXqXGR-yl7ZJtulR2WR3XrWbDonn4NTIJuqL33cM3h_u3-ZP-eL18Xk-W-SKlDDlqMKEk1qSpWSQVYpTBbGmEteqMLyEuii6CypDasIQV6Whekkw5su6MJTjmozB9f7fbfCfrY5JrH0bXFcpMGIccsbKDir3kAo-xqCN2Aa7kWEnEBS9KHEQJXpR4iCqC07_BZVNP6enIG1zPH65j69j8uGvtICEs6Kg3X6231vXmdzILx-aWiS5a3wwQTployBHOr4B51GMgw |
CitedBy_id | crossref_primary_10_1016_j_ipm_2018_09_009 crossref_primary_10_1016_j_patrec_2012_06_012 crossref_primary_10_1016_j_protcy_2012_02_099 crossref_primary_10_1162_tacl_a_00163 crossref_primary_10_4018_IJEA_2020010105 crossref_primary_10_1016_j_giq_2008_07_003 crossref_primary_10_1007_s10115_016_0997_x crossref_primary_10_1016_j_csl_2012_01_004 |
Cites_doi | 10.1007/978-1-4615-5661-9 10.1108/eb046814 10.1126/science.280.5360.98 10.1002/(SICI)1097-4571(199601)47:1<70::AID-ASI7>3.0.CO;2-# 10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-P 10.1145/183422.183423 10.1126/science.267.5199.843 10.1023/A:1006560730186 |
ContentType | Journal Article |
Copyright | 2000 by M. E. Sharpe, Inc. All rights reserved. 2000 Copyright 2000 M.E. Sharpe, Inc. Copyright M. E. Sharpe Inc. Winter 1999/2000 |
Copyright_xml | – notice: 2000 by M. E. Sharpe, Inc. All rights reserved. 2000 – notice: Copyright 2000 M.E. Sharpe, Inc. – notice: Copyright M. E. Sharpe Inc. Winter 1999/2000 |
DBID | AAYXX CITATION 3V. 7WY 7WZ 7XB 87Z 88K 8AL 8BJ 8FE 8FG 8FK 8FL 8G5 ABUWG AFKRA ARAPS AZQEC BENPR BEZIV BGLVJ CCPQU DWQXO FQK FRNLG F~G GNUQQ GUQSH HCIFZ JBE JQ2 K60 K6~ K7- M0C M0N M2O M2T MBDVC P5Z P62 PHGZM PHGZT PKEHL PQBIZ PQBZA PQEST PQGLB PQQKQ PQUKI Q9U |
DOI | 10.1080/07421222.1999.11518257 |
DatabaseName | CrossRef ProQuest Central (Corporate) ABI/INFORM Collection ABI/INFORM Global (PDF only) ProQuest Central (purchase pre-March 2016) ABI/INFORM Collection Telecommunications (Alumni Edition) Computing Database (Alumni Edition) International Bibliography of the Social Sciences (IBSS) ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ABI/INFORM Collection (Alumni) Research Library (Alumni) ProQuest Central (Alumni) ProQuest Central UK/Ireland Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Central Business Premium Collection Technology Collection (via ProQuest SciTech Premium Collection) ProQuest One ProQuest Central International Bibliography of the Social Sciences Business Premium Collection (Alumni) ABI/INFORM Global (Corporate) ProQuest Central Student ProQuest Research Library SciTech Premium Collection International Bibliography of the Social Sciences ProQuest Computer Science Collection ProQuest Business Collection (Alumni Edition) ProQuest Business Collection Computer Science Database ABI/INFORM Global (OCUL) Computing Database Research Library Telecommunications Database Research Library (Corporate) Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Business ProQuest One Business (Alumni) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central Basic |
DatabaseTitle | CrossRef ABI/INFORM Global (Corporate) ProQuest Business Collection (Alumni Edition) ProQuest One Business Research Library Prep Computer Science Database ProQuest Central Student Technology Collection ProQuest One Academic Middle East (New) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College Research Library (Alumni Edition) ABI/INFORM Complete ProQuest Telecommunications ProQuest Central ProQuest One Applied & Life Sciences International Bibliography of the Social Sciences (IBSS) ProQuest Central Korea ProQuest Research Library ProQuest Central (New) ABI/INFORM Complete (Alumni Edition) Advanced Technologies & Aerospace Collection Business Premium Collection ABI/INFORM Global ProQuest Computing ABI/INFORM Global (Alumni Edition) ProQuest Central Basic ProQuest Computing (Alumni Edition) ProQuest One Academic Eastern Edition ProQuest Technology Collection ProQuest Telecommunications (Alumni Edition) ProQuest SciTech Collection ProQuest Business Collection Advanced Technologies & Aerospace Database ProQuest One Academic UKI Edition ProQuest One Business (Alumni) ProQuest One Academic ProQuest One Academic (New) ProQuest Central (Alumni) Business Premium Collection (Alumni) |
DatabaseTitleList | ABI/INFORM Global (Corporate) |
Database_xml | – sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 1557-928X |
EndPage | 101 |
ExternalDocumentID | 53775005 10_1080_07421222_1999_11518257 40398445 11518257 |
Genre | Article Feature |
GroupedDBID | -~X .4S .DC 0BK 0R~ 1OL 29K 2AX 3V. 4R4 5GY 61N 7WY 85S 8FE 8FG 8FL 8G5 8R4 8R5 8VB AAMFJ AAMIU AAPUL AAVDF AAZMC ABBHK ABBOH ABCCY ABJNI ABKVW ABLIJ ABPEM ABPPZ ABSGB ABTAH ABTAI ABUWG ABXSQ ABXUL ABXYU ABYYQ ACDEK ACGFO ACGFS ACHQT ACNCT ACTIO ACTOA ACXJH ACYNR ACYUN ADAAO ADAHI ADCVX ADGDI ADKVQ ADMHG ADULT AECIN AEGXH AEISY AEMOZ AENEX AEOZU AEULS AEUPB AEYOC AEZRU AFARG AFFNX AFKRA AGDLA AGRBW AHAJD AHDZW AHIZY AHQJS AI. AIAGR AICXO AIJEM AJBCO AKBVH AKVCP ALMA_UNASSIGNED_HOLDINGS ALQZU APTMU ARAPS ARCSS AS~ AWYRJ AZQEC BEJHT BENPR BEZIV BGLVJ BIPZW BKOMP BLEHA BMOTO BOHLJ BPHCQ BXSLM CBXGM CCCUG CCPQU CMDUG CPEAS CS3 D-I DGFLZ DJVHL DKJDH DKSSO DU5 DWQXO E.L EBE EBO EBR EBS EBU ECR ECS EDO EHE EJD EMK EPL FRNLG GNUQQ GROUPED_ABI_INFORM_RESEARCH GTTXZ GUQSH H13 HCIFZ HVGLF HZ~ H~9 I-F IPSME JAAYA JBMMH JENOY JHFFW JKQEH JLEZI JLXEF JPL JPPEU JSODD JST K1G K60 K6V K6~ K7- KYCEM M0C M0N M2O M4Z NB9 NHB O9- P2P P62 PQBIZ PQBZA PQQKQ PRG PROAC Q2X QCKGC QF4 QM6 QN7 QWB RNANH ROSJB RSYQP SA0 STATR TAE TBQAZ TDBHL TFH TFL TFW TGZ TH9 TLJZZ TN5 TNTFI TRJHH TUROJ TUS U5U UPT VH1 WH7 YQT ZL0 ZY4 AAGDL AAHIA AATGJ ADYSH AEFOU AFRVT AIYEW AMPGV ASMEE PHGZM PHGZT 4.4 41~ 6TJ AAFTK AASPC AAWNQ AAYXX ABFIM ABFWB AICBI AMATQ AVBZW AVKUR BETGC BMHKA CAG CITATION COF HF~ INYZX IPNFZ LJTGL MAY MES MET O-X QBZMT RIG WHG YF5 ZCG 7XB 88K 8AL 8BJ 8FK AFKJL FQK JBE JQ2 M2T MBDVC PKEHL PQEST PQGLB PQUKI PUEGO Q9U TASJS |
ID | FETCH-LOGICAL-c360t-172393da3ba8087c95c02e5a2dc4f960e444217f3d3819c6f5eb3229bd4f592d3 |
IEDL.DBID | BENPR |
ISSN | 0742-1222 |
IngestDate | Mon Sep 08 01:21:28 EDT 2025 Thu Apr 24 22:53:46 EDT 2025 Tue Jul 01 01:50:39 EDT 2025 Thu Jul 03 21:32:29 EDT 2025 Wed Dec 25 09:04:17 EST 2024 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 3 |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c360t-172393da3ba8087c95c02e5a2dc4f960e444217f3d3819c6f5eb3229bd4f592d3 |
Notes | SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 |
PQID | 218909886 |
PQPubID | 3397 |
PageCount | 31 |
ParticipantIDs | crossref_primary_10_1080_07421222_1999_11518257 informaworld_taylorfrancis_310_1080_07421222_1999_11518257 proquest_journals_218909886 crossref_citationtrail_10_1080_07421222_1999_11518257 jstor_primary_40398445 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 1900 |
PublicationDate | 1999-12-01 |
PublicationDateYYYYMMDD | 1999-12-01 |
PublicationDate_xml | – month: 12 year: 1999 text: 1999-12-01 day: 01 |
PublicationDecade | 1990 |
PublicationPlace | Abingdon |
PublicationPlace_xml | – name: Abingdon |
PublicationSubtitle | JMIS |
PublicationTitle | Journal of management information systems |
PublicationYear | 1999 |
Publisher | Routledge M. E. Sharpe Taylor & Francis Ltd |
Publisher_xml | – name: Routledge – name: M. E. Sharpe – name: Taylor & Francis Ltd |
References | Duda R.O. (CIT0007) 1973 Salton G. (CIT0023) 1983 CIT0010 Harmon D. (CIT0011) 1992 CIT0021 CIT0001 Boguraev B. (CIT0003) 1999 Teufel S. (CIT0025) 1999 Cavnar W.B. (CIT0005) 1994 Hearst M.A. (CIT0012) 1998 Baeza-Yates R. (CIT0002) 1999 CIT0013 CIT0016 CIT0027 CIT0018 CIT0006 CIT0009 |
References_xml | – ident: CIT0009 doi: 10.1007/978-1-4615-5661-9 – volume-title: Advances in Automatic Text Summarization year: 1999 ident: CIT0003 – start-page: 363 volume-title: Information Retrieval: Data Structures and Algorithms year: 1992 ident: CIT0011 – ident: CIT0021 doi: 10.1108/eb046814 – volume-title: Advances in Automatic Text Summarization year: 1999 ident: CIT0025 – volume-title: Modern Information Retrieval year: 1999 ident: CIT0002 – volume-title: WordNet: an Electronic Lexical Database year: 1998 ident: CIT0012 – ident: CIT0016 doi: 10.1126/science.280.5360.98 – ident: CIT0013 doi: 10.1002/(SICI)1097-4571(199601)47:1<70::AID-ASI7>3.0.CO;2-# – ident: CIT0010 doi: 10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-P – ident: CIT0027 – ident: CIT0001 doi: 10.1145/183422.183423 – volume-title: Introduction to Modern Information Retrieval year: 1983 ident: CIT0023 – volume-title: Pattern Classification and Scene Analysis year: 1973 ident: CIT0007 – ident: CIT0006 doi: 10.1126/science.267.5199.843 – ident: CIT0018 doi: 10.1023/A:1006560730186 – start-page: 161 volume-title: Symposium on Document Analysis and Information Retrieval year: 1994 ident: CIT0005 |
SSID | ssj0001929 |
Score | 1.6853824 |
Snippet | Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the... |
SourceID | proquest crossref jstor informaworld |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 71 |
SubjectTerms | categorization Comparative analysis Cosine function Dictionaries Document management Dot product of vectors Electronic publishing End users Information retrieval Information systems Language language identification Languages Nonnative languages Search engines Special Section: Exploring the Outlands of the MIS Discipline Statistical analysis Studies Term weighting vector-space models Weighted averages Words |
Title | Linguini: Language Identification for Multilingual Documents |
URI | https://www.tandfonline.com/doi/abs/10.1080/07421222.1999.11518257 https://www.jstor.org/stable/40398445 https://www.proquest.com/docview/218909886 |
Volume | 16 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwhV3dS8MwEA9ze9EH8RN1Ovrga1mbjzYZiqhsjqEiY4JvpWkSGcicuv7_XrJkTAR96kNJ0t5d7iO5-x1C58ZoLFWVxRJsbUxTakAPSh0TkRomSiEps9XID4_Z8JmOXthLAz2EWhibVhl0olPU6r2yZ-RdMEUiEZxnV_OP2DaNsperoYNG6TsrqEuHMLaBWqCROYh966b_-DReqWZwZ8QSl9N29MA4lAxbsO3c3o1ibMv3BCgSmAZbm7VmrX5gmYb8xV863BmmwQ7a9h5ldL0UgV3U0LM9tLWGM7iPLiDifK2ns2kvuvfnk9FU-Twhx5oIloxccqEtT69hPqBI7crfDtDzoD-5Hca-bUJckSxZxOCSEEFUSWTJE55XglUJ1qzEqqIGAhZNKfxqboiy0VqVGQYBNcZCKgrswYocoubsfaaPUKTBueBYcCNzRXWOeWqkzpOSpVpKnbFjxAJ5ispjitvWFm9FGqBHPVkLS9YikPUYdVfj5ktUjX9H9NapXyzceYZZNh8pyH-DDx2vVmvRhAhOKXx_OzCv8Hv3q1hJ2smfb9to00E4uNSWU9RcfNb6DByUheygDT6463jhg-dkPBoOvwEDi-AG |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT-QwDLZYOOxyQMAuWt49LMeKNo82QSCEgGFYBk4gccs2TbIaCQ2vGSF-FP8RJ21Gg5DgxLlKE9mWPzuxPwP8cc4Sbeoi1Yi1KcuZQz-obUpl7rispGbcdyOfXxTdK_b3ml9PwUvshfFlldEnBkdtbmt_R76NUCQzKUSxf3ef-qFR_nE1TtBorOLMPj9hxva4d3qE6t0ipHN8edhN26ECaU2LbJgiYFNJTUV1JTJR1pLXGbG8IqZmDsN5yxjDMN1R43OZunAc001CpDYMD08Mxf9-gxlGqfSTIkTnZOz4MViSDeunnxdCSGxI9lTepX95JcQ3B0p0UxzDeo-IE1j4hik1Vke-Q4gAe515mGvj1eSgMbAFmLKDRZidYDH8CbuYz_4f9Qf9naTX3n4mfdNWIQXFJ7hlEkoXffP7CP-H8h6F5rpfcPUl8luC6cHtwP6GxGLoIogUTpeG2ZKI3GlbZhXPrda24MvAo3hU3TKW-8EZNyqPxKatWJUXq4piXYbt8bq7hrPj0xU7k9JXw3Bb4prRJop-tngp6Gq8F8uoFIzh-Vej8lTrGR7V2I5XPvy6Cd-7l-c91Tu9OFuFH4EsIhTRrMH08GFk1zEUGuqNYIAJ_Ptqi38F21wS4w |
linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3JTsMwEB1BKyE4sCOgLDlwDSRekrjigoCqrEKoSNys2LERAhUEyYWvZ5ylKiDUQz_Auz3zxpr3BuDAWkNUpiNfoa_1Wcgs2kFlfCpCy0UqFOOOjXxzG_Uf2OUjfxzjwri0ShdD20ooorTV7nG_Z7bJiDty4VyIfs0x7QS-eY4Ymcez0I4cbbQF7cH9Zb8_sseIYUQlxunKeBDS8IT_7emHi_ohYNokLf4x3KU36i2BbtZRJaG8HBa5OtRfvyQep1voMizWYNU7qW7XCsyY4SosjEkYrsExBrNPxfPwuetd11-fXkX-tfVvoIcL80qer2O-F9gfOraiZNatw0PvfHDa9-uKDL6mUZD7iHaooFlKVZoESawF1wExPCWZZhZjIcMYzji2NHOBoI4sx1idEKEyhidPMroBreHb0GyCZxC3JEQkVsUZMzFJQqtMHKQ8NEqZiG8Bbw5B6lqu3FXNeJVho2pa7450uyOb3dmCo1G790qwY2KL7vgZy7z8KrFVXRNJJzXeKG_EaCwWUJEwhvPvNFdE1mbhUyKeEoFIkmh7mjH3Ye7urCevL26vOjBf6kiU-TU70Mo_CrOLKClXe_Uz-AbyGwCo |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Linguini%3A+Language+Identification+for+Multilingual+Documents&rft.jtitle=Journal+of+management+information+systems&rft.au=Prager%2C+John+M.&rft.date=1999-12-01&rft.pub=M.+E.+Sharpe&rft.issn=0742-1222&rft.volume=16&rft.issue=3&rft.spage=71&rft.epage=101&rft_id=info:doi/10.1080%2F07421222.1999.11518257&rft.externalDocID=40398445 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0742-1222&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0742-1222&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0742-1222&client=summon |