Linguini: Language Identification for Multilingual Documents

Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those document...

Full description

Saved in:
Bibliographic Details
Published inJournal of management information systems Vol. 16; no. 3; pp. 71 - 101
Main Author Prager, John M.
Format Journal Article
LanguageEnglish
Published Abingdon Routledge 01.12.1999
M. E. Sharpe
Taylor & Francis Ltd
Subjects
Online AccessGet full text
ISSN0742-1222
1557-928X
DOI10.1080/07421222.1999.11518257

Cover

Abstract Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
AbstractList Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in this context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. The functional dependencies of Linguini's performance is determined and it is demonstrated that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
Author Prager, John M.
Author_xml – sequence: 1
  givenname: John M.
  surname: Prager
  fullname: Prager, John M.
BookMark eNqFkE1LAzEURYNUsK3-BGVwPzUfk5lEuin1EypuFNyFNJOUlGlSkwzSf--MtS7cdJUH79xc3hmBgfNOA3CF4ARBBm9gVWCEMZ4gzvkEIYoYptUJGCJKq5xj9jEAwx7Ke-oMjGJcQwgRx3wIpgvrVq119jZbyG6SK50919ola6ySyXqXGR-yl7ZJtulR2WR3XrWbDonn4NTIJuqL33cM3h_u3-ZP-eL18Xk-W-SKlDDlqMKEk1qSpWSQVYpTBbGmEteqMLyEuii6CypDasIQV6Whekkw5su6MJTjmozB9f7fbfCfrY5JrH0bXFcpMGIccsbKDir3kAo-xqCN2Aa7kWEnEBS9KHEQJXpR4iCqC07_BZVNP6enIG1zPH65j69j8uGvtICEs6Kg3X6231vXmdzILx-aWiS5a3wwQTployBHOr4B51GMgw
CitedBy_id crossref_primary_10_1016_j_ipm_2018_09_009
crossref_primary_10_1016_j_patrec_2012_06_012
crossref_primary_10_1016_j_protcy_2012_02_099
crossref_primary_10_1162_tacl_a_00163
crossref_primary_10_4018_IJEA_2020010105
crossref_primary_10_1016_j_giq_2008_07_003
crossref_primary_10_1007_s10115_016_0997_x
crossref_primary_10_1016_j_csl_2012_01_004
Cites_doi 10.1007/978-1-4615-5661-9
10.1108/eb046814
10.1126/science.280.5360.98
10.1002/(SICI)1097-4571(199601)47:1<70::AID-ASI7>3.0.CO;2-#
10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-P
10.1145/183422.183423
10.1126/science.267.5199.843
10.1023/A:1006560730186
ContentType Journal Article
Copyright 2000 by M. E. Sharpe, Inc. All rights reserved. 2000
Copyright 2000 M.E. Sharpe, Inc.
Copyright M. E. Sharpe Inc. Winter 1999/2000
Copyright_xml – notice: 2000 by M. E. Sharpe, Inc. All rights reserved. 2000
– notice: Copyright 2000 M.E. Sharpe, Inc.
– notice: Copyright M. E. Sharpe Inc. Winter 1999/2000
DBID AAYXX
CITATION
3V.
7WY
7WZ
7XB
87Z
88K
8AL
8BJ
8FE
8FG
8FK
8FL
8G5
ABUWG
AFKRA
ARAPS
AZQEC
BENPR
BEZIV
BGLVJ
CCPQU
DWQXO
FQK
FRNLG
F~G
GNUQQ
GUQSH
HCIFZ
JBE
JQ2
K60
K6~
K7-
M0C
M0N
M2O
M2T
MBDVC
P5Z
P62
PHGZM
PHGZT
PKEHL
PQBIZ
PQBZA
PQEST
PQGLB
PQQKQ
PQUKI
Q9U
DOI 10.1080/07421222.1999.11518257
DatabaseName CrossRef
ProQuest Central (Corporate)
ABI/INFORM Collection
ABI/INFORM Global (PDF only)
ProQuest Central (purchase pre-March 2016)
ABI/INFORM Collection
Telecommunications (Alumni Edition)
Computing Database (Alumni Edition)
International Bibliography of the Social Sciences (IBSS)
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni) (purchase pre-March 2016)
ABI/INFORM Collection (Alumni)
Research Library (Alumni)
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Central
Business Premium Collection
Technology Collection (via ProQuest SciTech Premium Collection)
ProQuest One
ProQuest Central
International Bibliography of the Social Sciences
Business Premium Collection (Alumni)
ABI/INFORM Global (Corporate)
ProQuest Central Student
ProQuest Research Library
SciTech Premium Collection
International Bibliography of the Social Sciences
ProQuest Computer Science Collection
ProQuest Business Collection (Alumni Edition)
ProQuest Business Collection
Computer Science Database
ABI/INFORM Global (OCUL)
Computing Database
Research Library
Telecommunications Database
Research Library (Corporate)
Advanced Technologies & Aerospace Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Business
ProQuest One Business (Alumni)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central Basic
DatabaseTitle CrossRef
ABI/INFORM Global (Corporate)
ProQuest Business Collection (Alumni Edition)
ProQuest One Business
Research Library Prep
Computer Science Database
ProQuest Central Student
Technology Collection
ProQuest One Academic Middle East (New)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
Research Library (Alumni Edition)
ABI/INFORM Complete
ProQuest Telecommunications
ProQuest Central
ProQuest One Applied & Life Sciences
International Bibliography of the Social Sciences (IBSS)
ProQuest Central Korea
ProQuest Research Library
ProQuest Central (New)
ABI/INFORM Complete (Alumni Edition)
Advanced Technologies & Aerospace Collection
Business Premium Collection
ABI/INFORM Global
ProQuest Computing
ABI/INFORM Global (Alumni Edition)
ProQuest Central Basic
ProQuest Computing (Alumni Edition)
ProQuest One Academic Eastern Edition
ProQuest Technology Collection
ProQuest Telecommunications (Alumni Edition)
ProQuest SciTech Collection
ProQuest Business Collection
Advanced Technologies & Aerospace Database
ProQuest One Academic UKI Edition
ProQuest One Business (Alumni)
ProQuest One Academic
ProQuest One Academic (New)
ProQuest Central (Alumni)
Business Premium Collection (Alumni)
DatabaseTitleList
ABI/INFORM Global (Corporate)

Database_xml – sequence: 1
  dbid: 8FG
  name: ProQuest Technology Collection
  url: https://search.proquest.com/technologycollection1
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1557-928X
EndPage 101
ExternalDocumentID 53775005
10_1080_07421222_1999_11518257
40398445
11518257
Genre Article
Feature
GroupedDBID -~X
.4S
.DC
0BK
0R~
1OL
29K
2AX
3V.
4R4
5GY
61N
7WY
85S
8FE
8FG
8FL
8G5
8R4
8R5
8VB
AAMFJ
AAMIU
AAPUL
AAVDF
AAZMC
ABBHK
ABBOH
ABCCY
ABJNI
ABKVW
ABLIJ
ABPEM
ABPPZ
ABSGB
ABTAH
ABTAI
ABUWG
ABXSQ
ABXUL
ABXYU
ABYYQ
ACDEK
ACGFO
ACGFS
ACHQT
ACNCT
ACTIO
ACTOA
ACXJH
ACYNR
ACYUN
ADAAO
ADAHI
ADCVX
ADGDI
ADKVQ
ADMHG
ADULT
AECIN
AEGXH
AEISY
AEMOZ
AENEX
AEOZU
AEULS
AEUPB
AEYOC
AEZRU
AFARG
AFFNX
AFKRA
AGDLA
AGRBW
AHAJD
AHDZW
AHIZY
AHQJS
AI.
AIAGR
AICXO
AIJEM
AJBCO
AKBVH
AKVCP
ALMA_UNASSIGNED_HOLDINGS
ALQZU
APTMU
ARAPS
ARCSS
AS~
AWYRJ
AZQEC
BEJHT
BENPR
BEZIV
BGLVJ
BIPZW
BKOMP
BLEHA
BMOTO
BOHLJ
BPHCQ
BXSLM
CBXGM
CCCUG
CCPQU
CMDUG
CPEAS
CS3
D-I
DGFLZ
DJVHL
DKJDH
DKSSO
DU5
DWQXO
E.L
EBE
EBO
EBR
EBS
EBU
ECR
ECS
EDO
EHE
EJD
EMK
EPL
FRNLG
GNUQQ
GROUPED_ABI_INFORM_RESEARCH
GTTXZ
GUQSH
H13
HCIFZ
HVGLF
HZ~
H~9
I-F
IPSME
JAAYA
JBMMH
JENOY
JHFFW
JKQEH
JLEZI
JLXEF
JPL
JPPEU
JSODD
JST
K1G
K60
K6V
K6~
K7-
KYCEM
M0C
M0N
M2O
M4Z
NB9
NHB
O9-
P2P
P62
PQBIZ
PQBZA
PQQKQ
PRG
PROAC
Q2X
QCKGC
QF4
QM6
QN7
QWB
RNANH
ROSJB
RSYQP
SA0
STATR
TAE
TBQAZ
TDBHL
TFH
TFL
TFW
TGZ
TH9
TLJZZ
TN5
TNTFI
TRJHH
TUROJ
TUS
U5U
UPT
VH1
WH7
YQT
ZL0
ZY4
AAGDL
AAHIA
AATGJ
ADYSH
AEFOU
AFRVT
AIYEW
AMPGV
ASMEE
PHGZM
PHGZT
4.4
41~
6TJ
AAFTK
AASPC
AAWNQ
AAYXX
ABFIM
ABFWB
AICBI
AMATQ
AVBZW
AVKUR
BETGC
BMHKA
CAG
CITATION
COF
HF~
INYZX
IPNFZ
LJTGL
MAY
MES
MET
O-X
QBZMT
RIG
WHG
YF5
ZCG
7XB
88K
8AL
8BJ
8FK
AFKJL
FQK
JBE
JQ2
M2T
MBDVC
PKEHL
PQEST
PQGLB
PQUKI
PUEGO
Q9U
TASJS
ID FETCH-LOGICAL-c360t-172393da3ba8087c95c02e5a2dc4f960e444217f3d3819c6f5eb3229bd4f592d3
IEDL.DBID BENPR
ISSN 0742-1222
IngestDate Mon Sep 08 01:21:28 EDT 2025
Thu Apr 24 22:53:46 EDT 2025
Tue Jul 01 01:50:39 EDT 2025
Thu Jul 03 21:32:29 EDT 2025
Wed Dec 25 09:04:17 EST 2024
IsPeerReviewed true
IsScholarly true
Issue 3
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c360t-172393da3ba8087c95c02e5a2dc4f960e444217f3d3819c6f5eb3229bd4f592d3
Notes SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
PQID 218909886
PQPubID 3397
PageCount 31
ParticipantIDs crossref_primary_10_1080_07421222_1999_11518257
informaworld_taylorfrancis_310_1080_07421222_1999_11518257
proquest_journals_218909886
crossref_citationtrail_10_1080_07421222_1999_11518257
jstor_primary_40398445
ProviderPackageCode CITATION
AAYXX
PublicationCentury 1900
PublicationDate 1999-12-01
PublicationDateYYYYMMDD 1999-12-01
PublicationDate_xml – month: 12
  year: 1999
  text: 1999-12-01
  day: 01
PublicationDecade 1990
PublicationPlace Abingdon
PublicationPlace_xml – name: Abingdon
PublicationSubtitle JMIS
PublicationTitle Journal of management information systems
PublicationYear 1999
Publisher Routledge
M. E. Sharpe
Taylor & Francis Ltd
Publisher_xml – name: Routledge
– name: M. E. Sharpe
– name: Taylor & Francis Ltd
References Duda R.O. (CIT0007) 1973
Salton G. (CIT0023) 1983
CIT0010
Harmon D. (CIT0011) 1992
CIT0021
CIT0001
Boguraev B. (CIT0003) 1999
Teufel S. (CIT0025) 1999
Cavnar W.B. (CIT0005) 1994
Hearst M.A. (CIT0012) 1998
Baeza-Yates R. (CIT0002) 1999
CIT0013
CIT0016
CIT0027
CIT0018
CIT0006
CIT0009
References_xml – ident: CIT0009
  doi: 10.1007/978-1-4615-5661-9
– volume-title: Advances in Automatic Text Summarization
  year: 1999
  ident: CIT0003
– start-page: 363
  volume-title: Information Retrieval: Data Structures and Algorithms
  year: 1992
  ident: CIT0011
– ident: CIT0021
  doi: 10.1108/eb046814
– volume-title: Advances in Automatic Text Summarization
  year: 1999
  ident: CIT0025
– volume-title: Modern Information Retrieval
  year: 1999
  ident: CIT0002
– volume-title: WordNet: an Electronic Lexical Database
  year: 1998
  ident: CIT0012
– ident: CIT0016
  doi: 10.1126/science.280.5360.98
– ident: CIT0013
  doi: 10.1002/(SICI)1097-4571(199601)47:1<70::AID-ASI7>3.0.CO;2-#
– ident: CIT0010
  doi: 10.1002/(SICI)1097-4571(199101)42:1<7::AID-ASI2>3.0.CO;2-P
– ident: CIT0027
– ident: CIT0001
  doi: 10.1145/183422.183423
– volume-title: Introduction to Modern Information Retrieval
  year: 1983
  ident: CIT0023
– volume-title: Pattern Classification and Scene Analysis
  year: 1973
  ident: CIT0007
– ident: CIT0006
  doi: 10.1126/science.267.5199.843
– ident: CIT0018
  doi: 10.1023/A:1006560730186
– start-page: 161
  volume-title: Symposium on Document Analysis and Information Retrieval
  year: 1994
  ident: CIT0005
SSID ssj0001929
Score 1.6853824
Snippet Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the...
SourceID proquest
crossref
jstor
informaworld
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 71
SubjectTerms categorization
Comparative analysis
Cosine function
Dictionaries
Document management
Dot product of vectors
Electronic publishing
End users
Information retrieval
Information systems
Language
language identification
Languages
Nonnative languages
Search engines
Special Section: Exploring the Outlands of the MIS Discipline
Statistical analysis
Studies
Term weighting
vector-space models
Weighted averages
Words
Title Linguini: Language Identification for Multilingual Documents
URI https://www.tandfonline.com/doi/abs/10.1080/07421222.1999.11518257
https://www.jstor.org/stable/40398445
https://www.proquest.com/docview/218909886
Volume 16
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwhV3dS8MwEA9ze9EH8RN1Ovrga1mbjzYZiqhsjqEiY4JvpWkSGcicuv7_XrJkTAR96kNJ0t5d7iO5-x1C58ZoLFWVxRJsbUxTakAPSh0TkRomSiEps9XID4_Z8JmOXthLAz2EWhibVhl0olPU6r2yZ-RdMEUiEZxnV_OP2DaNsperoYNG6TsrqEuHMLaBWqCROYh966b_-DReqWZwZ8QSl9N29MA4lAxbsO3c3o1ibMv3BCgSmAZbm7VmrX5gmYb8xV863BmmwQ7a9h5ldL0UgV3U0LM9tLWGM7iPLiDifK2ns2kvuvfnk9FU-Twhx5oIloxccqEtT69hPqBI7crfDtDzoD-5Hca-bUJckSxZxOCSEEFUSWTJE55XglUJ1qzEqqIGAhZNKfxqboiy0VqVGQYBNcZCKgrswYocoubsfaaPUKTBueBYcCNzRXWOeWqkzpOSpVpKnbFjxAJ5ispjitvWFm9FGqBHPVkLS9YikPUYdVfj5ktUjX9H9NapXyzceYZZNh8pyH-DDx2vVmvRhAhOKXx_OzCv8Hv3q1hJ2smfb9to00E4uNSWU9RcfNb6DByUheygDT6463jhg-dkPBoOvwEDi-AG
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT-QwDLZYOOxyQMAuWt49LMeKNo82QSCEgGFYBk4gccs2TbIaCQ2vGSF-FP8RJ21Gg5DgxLlKE9mWPzuxPwP8cc4Sbeoi1Yi1KcuZQz-obUpl7rispGbcdyOfXxTdK_b3ml9PwUvshfFlldEnBkdtbmt_R76NUCQzKUSxf3ef-qFR_nE1TtBorOLMPj9hxva4d3qE6t0ipHN8edhN26ECaU2LbJgiYFNJTUV1JTJR1pLXGbG8IqZmDsN5yxjDMN1R43OZunAc001CpDYMD08Mxf9-gxlGqfSTIkTnZOz4MViSDeunnxdCSGxI9lTepX95JcQ3B0p0UxzDeo-IE1j4hik1Vke-Q4gAe515mGvj1eSgMbAFmLKDRZidYDH8CbuYz_4f9Qf9naTX3n4mfdNWIQXFJ7hlEkoXffP7CP-H8h6F5rpfcPUl8luC6cHtwP6GxGLoIogUTpeG2ZKI3GlbZhXPrda24MvAo3hU3TKW-8EZNyqPxKatWJUXq4piXYbt8bq7hrPj0xU7k9JXw3Bb4prRJop-tngp6Gq8F8uoFIzh-Vej8lTrGR7V2I5XPvy6Cd-7l-c91Tu9OFuFH4EsIhTRrMH08GFk1zEUGuqNYIAJ_Ptqi38F21wS4w
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3JTsMwEB1BKyE4sCOgLDlwDSRekrjigoCqrEKoSNys2LERAhUEyYWvZ5ylKiDUQz_Auz3zxpr3BuDAWkNUpiNfoa_1Wcgs2kFlfCpCy0UqFOOOjXxzG_Uf2OUjfxzjwri0ShdD20ooorTV7nG_Z7bJiDty4VyIfs0x7QS-eY4Ymcez0I4cbbQF7cH9Zb8_sseIYUQlxunKeBDS8IT_7emHi_ohYNokLf4x3KU36i2BbtZRJaG8HBa5OtRfvyQep1voMizWYNU7qW7XCsyY4SosjEkYrsExBrNPxfPwuetd11-fXkX-tfVvoIcL80qer2O-F9gfOraiZNatw0PvfHDa9-uKDL6mUZD7iHaooFlKVZoESawF1wExPCWZZhZjIcMYzji2NHOBoI4sx1idEKEyhidPMroBreHb0GyCZxC3JEQkVsUZMzFJQqtMHKQ8NEqZiG8Bbw5B6lqu3FXNeJVho2pa7450uyOb3dmCo1G790qwY2KL7vgZy7z8KrFVXRNJJzXeKG_EaCwWUJEwhvPvNFdE1mbhUyKeEoFIkmh7mjH3Ye7urCevL26vOjBf6kiU-TU70Mo_CrOLKClXe_Uz-AbyGwCo
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Linguini%3A+Language+Identification+for+Multilingual+Documents&rft.jtitle=Journal+of+management+information+systems&rft.au=Prager%2C+John+M.&rft.date=1999-12-01&rft.pub=M.+E.+Sharpe&rft.issn=0742-1222&rft.volume=16&rft.issue=3&rft.spage=71&rft.epage=101&rft_id=info:doi/10.1080%2F07421222.1999.11518257&rft.externalDocID=40398445
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0742-1222&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0742-1222&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0742-1222&client=summon