MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages

MIRACL is a multilingual dataset for retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered...

Full description

Saved in:
Bibliographic Details
Published inTransactions of the Association for Computational Linguistics Vol. 11; pp. 1114 - 1131
Main Authors Zhang, Xinyu, Thakur, Nandan, Ogundepo, Odunayo, Kamalloo, Ehsan, Alfonso-Hermelo, David, Li, Xiaoguang, Liu, Qun, Rezagholizadeh, Mehdi, Lin, Jimmy
Format Journal Article
LanguageEnglish
Published One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA MIT Press 01.09.2023
MIT Press Journals, The
The MIT Press
Subjects
Online AccessGet full text
ISSN2307-387X
2307-387X
DOI10.1162/tacl_a_00595

Cover

Loading…
Abstract MIRACL is a multilingual dataset for retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at .
AbstractList MIRACL is a multilingual dataset for retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at .
MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at http://miracl.ai/.
Author Rezagholizadeh, Mehdi
Zhang, Xinyu
Kamalloo, Ehsan
Liu, Qun
Lin, Jimmy
Thakur, Nandan
Li, Xiaoguang
Ogundepo, Odunayo
Alfonso-Hermelo, David
Author_xml – sequence: 1
  givenname: Xinyu
  surname: Zhang
  fullname: Zhang, Xinyu
  organization: David R. Cheriton School of Computer Science, University of Waterloo, Canada
– sequence: 2
  givenname: Nandan
  surname: Thakur
  fullname: Thakur, Nandan
  organization: David R. Cheriton School of Computer Science, University of Waterloo, Canada
– sequence: 3
  givenname: Odunayo
  surname: Ogundepo
  fullname: Ogundepo, Odunayo
  organization: David R. Cheriton School of Computer Science, University of Waterloo, Canada
– sequence: 4
  givenname: Ehsan
  surname: Kamalloo
  fullname: Kamalloo, Ehsan
  organization: David R. Cheriton School of Computer Science, University of Waterloo, Canada
– sequence: 5
  givenname: David
  surname: Alfonso-Hermelo
  fullname: Alfonso-Hermelo, David
  organization: Huawei Noah’s Ark Lab, Canada
– sequence: 6
  givenname: Xiaoguang
  surname: Li
  fullname: Li, Xiaoguang
  organization: Huawei Noah’s Ark Lab, China
– sequence: 7
  givenname: Qun
  surname: Liu
  fullname: Liu, Qun
  organization: Huawei Noah’s Ark Lab, China
– sequence: 8
  givenname: Mehdi
  surname: Rezagholizadeh
  fullname: Rezagholizadeh, Mehdi
  organization: Huawei Noah’s Ark Lab, Canada
– sequence: 9
  givenname: Jimmy
  surname: Lin
  fullname: Lin, Jimmy
  organization: David R. Cheriton School of Computer Science, University of Waterloo, Canada
BookMark eNp1kE-LE0EQxRtZwXXdmx9gwIsHo139fzwIIatrIIuwKHhranq6Q4fZ6djdCayf3slGJIp7qkfV770q6jk5G9PoCXkJ9C2AYu8qusGipVS28gk5Z5zqGTf6-9mJfkYuS9lQSsGAoYqdk-ub5e18sWreN_PmZjfUOMRxvcOhufU1R7-f1BVWLL42i7T3eZo2YJqrOOnimxUe6LUvL8jTgEPxl7_rBfn26ePXxefZ6sv1cjFfzZxkUGeIQrpOIpWopHcduKCNYB5aAaynQQrT90rTABAM0FZLptAL7rteB-eBX5DlMbdPuLHbHO8w39uE0T40Ul5bzDW6wdvQBg66DT12vTCdQi4Yd-20QXGt-CHr1TFrm9OPnS_VbtIuj9P5lpmWt0IpbSbqzZFyOZWSffizFag9fN6efn7C2T-4ixVrTGPNGIfHTK-Pprt4csQj6If_oAdkD2C50GqyMcq4faj2Z9z-HfALaRCpYQ
CitedBy_id crossref_primary_10_5715_jnlp_32_176
crossref_primary_10_1109_ACCESS_2024_3496867
Cites_doi 10.1162/tacl_a_00317
10.26818/9780814252703
10.1145/3404835.3463098
10.18653/v1/2022.naacl-main.272
10.1162/COLI_a_00111
10.1145/3404835.3463238
10.1145/3239571
10.18653/v1/P19-1493
10.18653/v1/D18-1029
10.18653/v1/2021.emnlp-main.471
10.1162/tacl_a_00276
10.1162/tacl_a_00148
10.1007/978-3-031-02181-7
10.1561/1500000019
10.1145/3539618.3591805
10.18653/v1/2021.naacl-main.46
10.18653/v1/2020.emnlp-main.550
10.1093/llc/fqu047
10.1145/3477495.3531725
10.1007/978-3-030-99736-6_26
10.3115/v1/D14-1018
10.1162/tacl_a_00433
10.1515/lity.1999.3.3.279
10.18653/v1/2020.acl-main.560
10.1145/290941.291017
10.1086/464575
10.18653/v1/D16-1264
10.18653/v1/P17-1147
10.18653/v1/2020.findings-emnlp.249
10.1017/9781108378291.011
10.18653/v1/2020.emnlp-main.340
10.18653/v1/2021.mrl-1.12
10.1145/3404835.3462804
10.1162/coli_a_00357
10.1007/978-3-030-45442-5_31
10.1145/3397271.3401075
10.1093/llc/fqt031
10.18653/v1/2020.findings-emnlp.63
ContentType Journal Article
Copyright 2023. This work is published under https://creativecommons.org/licenses/by/4.0/legalcode (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2023. This work is published under https://creativecommons.org/licenses/by/4.0/legalcode (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID AAYXX
CITATION
7T9
8FE
8FG
ABUWG
AFKRA
ALSLI
ARAPS
AZQEC
BENPR
BGLVJ
CCPQU
CPGLG
CRLPW
DWQXO
GNUQQ
HCIFZ
JQ2
K7-
P5Z
P62
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRQQA
DOA
DOI 10.1162/tacl_a_00595
DatabaseName CrossRef
Linguistics and Language Behavior Abstracts (LLBA)
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
Social Science Premium Collection
Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Central
Technology Collection
ProQuest One Community College
Linguistics Collection
Linguistics Database
ProQuest Central
ProQuest Central Student
SciTech Premium Collection
ProQuest Computer Science Collection
Computer Science Database
Advanced Technologies & Aerospace Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic (New)
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest One Social Sciences
DOAJ Directory of Open Access Journals (ODIN)
DatabaseTitle CrossRef
Publicly Available Content Database
Computer Science Database
ProQuest Central Student
Technology Collection
ProQuest One Academic Middle East (New)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Central
ProQuest One Applied & Life Sciences
Linguistics Collection
ProQuest Central Korea
ProQuest Central (New)
Advanced Technologies & Aerospace Collection
Social Science Premium Collection
ProQuest One Social Sciences
ProQuest One Academic Eastern Edition
Linguistics and Language Behavior Abstracts (LLBA)
ProQuest Technology Collection
ProQuest SciTech Collection
Advanced Technologies & Aerospace Database
ProQuest One Academic UKI Edition
Linguistics Database
ProQuest One Academic
ProQuest One Academic (New)
DatabaseTitleList
CrossRef
Publicly Available Content Database
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: 8FG
  name: ProQuest Technology Collection
  url: https://search.proquest.com/technologycollection1
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
EISSN 2307-387X
EndPage 1131
ExternalDocumentID oai_doaj_org_article_f9f3179fdabd48b6a3423c948d637631
10_1162_tacl_a_00595
tacl_a_00595.pdf
GroupedDBID AAFWJ
AFPKN
ALMA_UNASSIGNED_HOLDINGS
EBS
GROUPED_DOAJ
JMNJE
M~E
OJV
OK1
RMI
AAYXX
ABUWG
AFKRA
ALSLI
ARAPS
BENPR
BGLVJ
CCPQU
CITATION
CPGLG
CRLPW
DWQXO
HCIFZ
K7-
PHGZM
PHGZT
PIMPY
7T9
8FE
8FG
AZQEC
GNUQQ
JQ2
P62
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRQQA
PUEGO
ID FETCH-LOGICAL-c521t-aa45cb5a05a65ecb1cf7842e19412d0f548dd670f11f81097526ae43ebd7fce13
IEDL.DBID BENPR
ISSN 2307-387X
IngestDate Wed Aug 27 01:01:37 EDT 2025
Fri Jul 25 20:55:42 EDT 2025
Thu Apr 24 22:55:23 EDT 2025
Tue Jul 01 03:28:36 EDT 2025
Sat Oct 21 05:18:29 EDT 2023
Fri Oct 20 12:12:55 EDT 2023
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c521t-aa45cb5a05a65ecb1cf7842e19412d0f548dd670f11f81097526ae43ebd7fce13
Notes 2023
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
OpenAccessLink https://www.proquest.com/docview/2893946678?pq-origsite=%requestingapplication%
PQID 2893946678
PQPubID 6535866
PageCount 18
ParticipantIDs proquest_journals_2893946678
mit_journals_10_1162_tacl_a_00595
crossref_citationtrail_10_1162_tacl_a_00595
doaj_primary_oai_doaj_org_article_f9f3179fdabd48b6a3423c948d637631
crossref_primary_10_1162_tacl_a_00595
mit_journals_taclv11_347610_2023_10_20_zip_tacl_a_00595
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2023-09-01
PublicationDateYYYYMMDD 2023-09-01
PublicationDate_xml – month: 09
  year: 2023
  text: 2023-09-01
  day: 01
PublicationDecade 2020
PublicationPlace One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA
PublicationPlace_xml – name: One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA
– name: Cambridge
PublicationTitle Transactions of the Association for Computational Linguistics
PublicationYear 2023
Publisher MIT Press
MIT Press Journals, The
The MIT Press
Publisher_xml – name: MIT Press
– name: MIT Press Journals, The
– name: The MIT Press
References Lin (2023090614075107500_bib25) 2021
Thakur (2023090614075107500_bib42) 2021
Gerz (2023090614075107500_bib11) 2018
Rajpurkar (2023090614075107500_bib37) 2016
Izacard (2023090614075107500_bib14) 2022
Lembersky (2023090614075107500_bib22) 2012; 38
Kwiatkowski (2023090614075107500_bib20) 2019; 7
Longpre (2023090614075107500_bib26) 2021; 9
Nogueira (2023090614075107500_bib30) 2020
Santhanam (2023090614075107500_bib39) 2021
Eetemadi (2023090614075107500_bib8) 2014
Dawson (2023090614075107500_bib7) 2016
Joshi (2023090614075107500_bib17) 2020
Robertson (2023090614075107500_bib38) 2009; 3
Khattab (2023090614075107500_bib19) 2020
Pires (2023090614075107500_bib32) 2019
Zhang (2023090614075107500_bib49) 2022
Bajaj (2023090614075107500_bib3) 2018
Volansky (2023090614075107500_bib43) 2015; 30
Nair (2023090614075107500_bib28) 2022
Karpukhin (2023090614075107500_bib18) 2020
Craswell (2023090614075107500_bib6) 2021
Xiong (2023090614075107500_bib46) 2021
Junjie (2023090614075107500_bib13) 2020
Ponti (2023090614075107500_bib34) 2019; 45
Lin (2023090614075107500_bib24) 2021
Rabinovich (2023090614075107500_bib36) 2015; 3
Gao (2023090614075107500_bib10) 2023
MacAvaney (2023090614075107500_bib27) 2020
Yang (2023090614075107500_bib47) 2018; 10
Zhang (2023090614075107500_bib48) 2021
Formal (2023090614075107500_bib9) 2021
Lawrie (2023090614075107500_bib21) 2023
Nogueira (2023090614075107500_bib29) 2019
Sun (2023090614075107500_bib41) 2020
Shi (2023090614075107500_bib40) 2020
Greenberg (2023090614075107500_bib12) 1960; 26
Asai (2023090614075107500_bib1) 2021
Lin (2023090614075107500_bib23) 2022
Clark (2023090614075107500_bib5) 2020; 8
Voorhees (2023090614075107500_bib44) 1998
Avner (2023090614075107500_bib2) 2016; 31
Nübling (2023090614075107500_bib31) 2020
Bonifacio (2023090614075107500_bib4) 2021
Jones (2023090614075107500_bib15) 2021
Joshi (2023090614075107500_bib16) 2017
Wenzek (2023090614075107500_bib45) 2020
Plank (2023090614075107500_bib33) 1999; 3
Yingqi (2023090614075107500_bib35) 2021
References_xml – volume: 8
  start-page: 454
  year: 2020
  ident: 2023090614075107500_bib5
  article-title: TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages
  publication-title: Transactions of the Association for Computational Linguistics
  doi: 10.1162/tacl_a_00317
– volume-title: Language Files: Materials for an Introduction to Language and Linguistics, 12th Edition
  year: 2016
  ident: 2023090614075107500_bib7
  article-title: Morphological types of languages
  doi: 10.26818/9780814252703
– year: 2022
  ident: 2023090614075107500_bib14
  article-title: Unsupervised dense information retrieval with contrastive learning
  publication-title: Transactions on Machine Learning Research
– year: 2019
  ident: 2023090614075107500_bib29
  article-title: Passage re-ranking with BERT
  publication-title: arXiv:1901. 04085
– start-page: 4003
  volume-title: Proceedings of the Twelfth Language Resources and Evaluation Conference
  year: 2020
  ident: 2023090614075107500_bib45
  article-title: CCNet: Extracting high quality monolingual datasets from web crawl data
– start-page: 2288
  volume-title: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
  year: 2021
  ident: 2023090614075107500_bib9
  article-title: SPLADE: Sparse lexical and expansion model for first stage ranking
  doi: 10.1145/3404835.3463098
– start-page: 3715
  year: 2021
  ident: 2023090614075107500_bib39
  article-title: ColBERTv2: Effective and efficient retrieval via lightweight late interaction
  publication-title: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
  doi: 10.18653/v1/2022.naacl-main.272
– volume: 38
  start-page: 799
  issue: 4
  year: 2012
  ident: 2023090614075107500_bib22
  article-title: Language models for machine translation: Original vs. translated texts
  publication-title: Computational Linguistics
  doi: 10.1162/COLI_a_00111
– start-page: 2356
  volume-title: Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021)
  year: 2021
  ident: 2023090614075107500_bib24
  article-title: Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations
  doi: 10.1145/3404835.3463238
– volume: 10
  start-page: Article 16
  issue: 4
  year: 2018
  ident: 2023090614075107500_bib47
  article-title: Anserini: Reproducible ranking baselines using Lucene
  publication-title: Journal of Data and Information Quality
  doi: 10.1145/3239571
– year: 2021
  ident: 2023090614075107500_bib4
  article-title: mMARCO: A multilingual version of the MS MARCO passage ranking dataset
  publication-title: arXiv:2108.13897
– start-page: pages 4996–pages 5001
  volume-title: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
  year: 2019
  ident: 2023090614075107500_bib32
  article-title: How multilingual is multilingual BERT?
  doi: 10.18653/v1/P19-1493
– start-page: 316
  volume-title: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
  year: 2018
  ident: 2023090614075107500_bib11
  article-title: On the relation between linguistic typology and (limitations of) multilingual language modeling
  doi: 10.18653/v1/D18-1029
– start-page: 5833
  year: 2021
  ident: 2023090614075107500_bib15
  article-title: A massively multilingual analysis of cross-linguality in shared embedding space
  publication-title: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
  doi: 10.18653/v1/2021.emnlp-main.471
– volume: 7
  start-page: 452
  year: 2019
  ident: 2023090614075107500_bib20
  article-title: Natural Questions: A benchmark for question answering research
  publication-title: Transactions of the Association for Computational Linguistics
  doi: 10.1162/tacl_a_00276
– volume: 3
  start-page: 419
  year: 2015
  ident: 2023090614075107500_bib36
  article-title: Unsupervised identification of translationese
  publication-title: Transactions of the Association for Computational Linguistics
  doi: 10.1162/tacl_a_00148
– volume-title: Proceedings of the 31st Text REtrieval Conference
  year: 2023
  ident: 2023090614075107500_bib21
  article-title: Overview of the TREC 2022 NeuCLIR track
– year: 2018
  ident: 2023090614075107500_bib3
  article-title: MS MARCO: A human generated MAchine Reading COmprehension dataset
  publication-title: arXiv: 1611.09268v3
– volume-title: Pretrained Transformers for Text Ranking: BERT and Beyond
  year: 2021
  ident: 2023090614075107500_bib25
  doi: 10.1007/978-3-031-02181-7
– volume: 3
  start-page: 333
  issue: 4
  year: 2009
  ident: 2023090614075107500_bib38
  article-title: The probabilistic relevance framework: BM25 and beyond
  publication-title: Foundations and Trends in Information Retrieval
  doi: 10.1561/1500000019
– start-page: 3120
  volume-title: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
  year: 2023
  ident: 2023090614075107500_bib10
  article-title: Tevatron: An efficient and flexible toolkit for Neural Retrieval
  doi: 10.1145/3539618.3591805
– start-page: 547
  volume-title: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
  year: 2021
  ident: 2023090614075107500_bib1
  article-title: XOR QA: Cross-lingual open-retrieval question answering
  doi: 10.18653/v1/2021.naacl-main.46
– start-page: 6769
  volume-title: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
  year: 2020
  ident: 2023090614075107500_bib18
  article-title: Dense passage retrieval for open-domain question answering
  doi: 10.18653/v1/2020.emnlp-main.550
– volume-title: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021)
  year: 2021
  ident: 2023090614075107500_bib46
  article-title: Approximate nearest neighbor negative contrastive learning for dense text retrieval
– volume: 31
  start-page: 30
  issue: 1
  year: 2016
  ident: 2023090614075107500_bib2
  article-title: Identifying translationese at the word and sub-word level
  publication-title: Digital Scholarship in the Humanities
  doi: 10.1093/llc/fqu047
– start-page: 2939
  volume-title: Proceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022)
  year: 2022
  ident: 2023090614075107500_bib23
  article-title: Fostering coopetition while plugging leaks: The design and implementation of the MS MARCO leaderboards
  doi: 10.1145/3477495.3531725
– start-page: 382
  volume-title: Proceedings of the 44th European Conference on Information Retrieval (ECIR 2022), Part I
  year: 2022
  ident: 2023090614075107500_bib28
  article-title: Transfer learning approaches for building cross-language dense retrieval models
  doi: 10.1007/978-3-030-99736-6_26
– start-page: 159
  volume-title: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
  year: 2014
  ident: 2023090614075107500_bib8
  article-title: Asymmetric features of human generated translation
  doi: 10.3115/v1/D14-1018
– volume: 9
  start-page: 1389
  year: 2021
  ident: 2023090614075107500_bib26
  article-title: MKQA: A linguistically diverse benchmark for multilingual open domain question answering
  publication-title: Transactions of the Association for Computational Linguistics
  doi: 10.1162/tacl_a_00433
– volume: 3
  start-page: 279
  year: 1999
  ident: 2023090614075107500_bib33
  article-title: Split morphology: How agglutination and flexion mix
  publication-title: Linguistic Typology
  doi: 10.1515/lity.1999.3.3.279
– volume-title: Neural Information Processing Systems: Datasets and Benchmarks Track
  year: 2021
  ident: 2023090614075107500_bib42
  article-title: BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models
– start-page: 6282
  volume-title: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
  year: 2020
  ident: 2023090614075107500_bib17
  article-title: The state and fate of linguistic diversity and inclusion in the NLP world
  doi: 10.18653/v1/2020.acl-main.560
– start-page: 5835
  volume-title: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
  year: 2021
  ident: 2023090614075107500_bib35
  article-title: RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering
– start-page: 315
  volume-title: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998)
  year: 1998
  ident: 2023090614075107500_bib44
  article-title: Variations in relevance judgments and the measurement of retrieval effectiveness
  doi: 10.1145/290941.291017
– volume: 26
  start-page: 178
  year: 1960
  ident: 2023090614075107500_bib12
  article-title: A quantitative approach to the morphological typology of language
  publication-title: International Journal of American Linguistics
  doi: 10.1086/464575
– start-page: 2383
  volume-title: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
  year: 2016
  ident: 2023090614075107500_bib37
  article-title: SQuAD: 100,000+ questions for machine comprehension of text
  doi: 10.18653/v1/D16-1264
– year: 2022
  ident: 2023090614075107500_bib49
  article-title: Towards best practices for training multilingual dense retrieval models
  publication-title: arXiv:2204.02363
– start-page: 1601
  volume-title: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
  year: 2017
  ident: 2023090614075107500_bib16
  article-title: TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension
  doi: 10.18653/v1/P17-1147
– start-page: 2768
  volume-title: Findings of the Association for Computational Linguistics: EMNLP 2020
  year: 2020
  ident: 2023090614075107500_bib40
  article-title: Cross-lingual training of neural models for document ranking
  doi: 10.18653/v1/2020.findings-emnlp.249
– volume-title: The Cambridge Handbook of Germanic Linguistics
  year: 2020
  ident: 2023090614075107500_bib31
  article-title: Inflectional morphology
  doi: 10.1017/9781108378291.011
– start-page: 4160
  volume-title: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
  year: 2020
  ident: 2023090614075107500_bib41
  article-title: CLIRMatrix: A massively large collection of bilingual and multilingual datasets for cross-lingual information retrieval
  doi: 10.18653/v1/2020.emnlp-main.340
– start-page: 127
  volume-title: Proceedings of the 1st Workshop on Multilingual Representation Learning
  year: 2021
  ident: 2023090614075107500_bib48
  article-title: Mr. TyDi: A multi-lingual benchmark for dense retrieval
  doi: 10.18653/v1/2021.mrl-1.12
– start-page: 1566
  volume-title: Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021)
  year: 2021
  ident: 2023090614075107500_bib6
  article-title: MS MARCO: Benchmarking ranking models in the large-data regime
  doi: 10.1145/3404835.3462804
– volume: 45
  start-page: 559
  issue: 3
  year: 2019
  ident: 2023090614075107500_bib34
  article-title: Modeling language variation and universals: A survey on typological linguistics for natural language processing
  publication-title: Computational Linguistics
  doi: 10.1162/coli_a_00357
– start-page: 4411
  volume-title: Proceedings of the 37th International Conference on Machine Learning
  year: 2020
  ident: 2023090614075107500_bib13
  article-title: XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation
– start-page: 246
  volume-title: Proceedings of the 42nd European Conference on Information Retrieval, Part II (ECIR 2020)
  year: 2020
  ident: 2023090614075107500_bib27
  article-title: Teaching a new dog old tricks: Resurrecting multilingual retrieval using zero-shot learning
  doi: 10.1007/978-3-030-45442-5_31
– start-page: 39
  volume-title: Proceedings of the 43rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020)
  year: 2020
  ident: 2023090614075107500_bib19
  article-title: ColBERT: Efficient and effective passage search via contextualized late interaction over BERT
  doi: 10.1145/3397271.3401075
– volume: 30
  start-page: 98
  issue: 1
  year: 2015
  ident: 2023090614075107500_bib43
  article-title: On the features of translationese
  publication-title: Digital Scholarship in the Humanities
  doi: 10.1093/llc/fqt031
– start-page: 708
  volume-title: Findings of the Association for Computational Linguistics: EMNLP 2020
  year: 2020
  ident: 2023090614075107500_bib30
  article-title: Document ranking with a pretrained sequence-to-sequence model
  doi: 10.18653/v1/2020.findings-emnlp.63
SSID ssj0001818062
Score 2.332418
Snippet MIRACL is a multilingual dataset for retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This...
MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This...
SourceID doaj
proquest
crossref
mit
SourceType Open Website
Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 1114
SubjectTerms Annotations
Computational linguistics
Control data (computers)
Datasets
Language diversity
Languages
Multilingualism
Native speakers
Queries
Retrieval
SummonAdditionalLinks – databaseName: DOAJ Directory of Open Access Journals (ODIN)
  dbid: DOA
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA7iyYsoKtYXEfQki81uNrvxVluLSvUgCt5CnlCotejqwV_vTHZbW0W8eFrYTNhkJsnMZGe-IeTIsbbzsG0SayRPuLdlYmRuE6Z9cDJzsjCYKHxzKy4f-PVj_jhX6gtjwmp44Jpxp0EGUHEyOG0cL43QCFlnJS-dwL0RHR_QeXPOVLxdwRRmkU4j3UV6Wmk7UlphsmW-oIMiVD9olqdh9eM8jkqmv0ZWG-uQdupRrZMlP94g_Zuru053cEY7NKbLYgL5G1DdxWJYsFJoT1egjCraxXhMaKWspL0YcOHpoLmQfN0kD_2L--5l0pQ_SCxWGUi05rk1uW7nWuTeGmZDUfLUM8lZ6toBfA3nRNEOjIUSfyTnqdCeZ964IljPsi2yPH4e-21CtQNJSXA8XbA8BAZHYClswb0QYABI3yInU4Yo22CDY4mKkYo-gkjVPPta5HhGPakxMX6hO0fezmgQyTq-APmqRr7qL_m2yCFIRjU76_WXDxULNNj2Do5NxguwDhVWiFfxqT6Gk28996YS_-oOPmiGsPtFufMfM9glKziCOjBtjyxXL29-HyyZyhzERfsJ1H7wKg
  priority: 102
  providerName: Directory of Open Access Journals
Title MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages
URI https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00595
https://www.proquest.com/docview/2893946678
https://doaj.org/article/f9f3179fdabd48b6a3423c948d637631
Volume 11
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LTxsxELZ4XHqpWpWqaSFypXJCK-Jdr9fmUoVAeAhQFRUpN8tPhESTlCw99Nd3xnFIoaKnldazu1qPPS_PzEfIF896PsC2KZxVvODBycKq2hXMhOhV5VVjsVD48kqcXvPzcT3OAbd5TqtcysQkqP3UYYx8HxyDCnuhN_Lr7GeBqFF4upohNNbJJohgCc7X5uHx1bfRKsqCpcwJVRQTnrGR7HiZ_S7K_da4O200FmDWT_RSat8P2ubHbfuPjE6KZ_iGvM4WI-0vWPyWrIXJOzK8PBv1BxcHtE9TCS0WlT8A1SgBZMHqoUemBQXV0gHmaMIoZZIepSSMQC9ykHK-Ra6Hx98Hp0WGRCgcIg8UxvDa2dr0aiPq4CxzsZG8DExxVvpeBP_De9H0ImNR4uFyXQoTeBWsb6ILrHpPNibTSfhAqPHAPQXOqI-Ox8hALErhGh6EAKNAhQ7ZW06IdrlfOMJW3OnkN4hS_z19HbL7SD1b9Ml4ge4Q5_aRBrtbpxvT-xudN4uOKoJZo6I31nNphcE2hU7BvwmUh6xDPgNndN5t8xc-1DyhwbFf4OxUvAGLUSNqvE5X_ft29uzJ7SXHV4-vlt_H_w9_Iq_w3Ys0tG2y0d4_hB2wW1rbJetyeNLNS7SbvP8_gXrtPg
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwzV1fb9MwELfG9gAv_BEgygYYiT2hbHXiOAkSD6Wlalk6oWoTfTP-iya2dlpTEPsqfBU-HHdusrKh8TaJp0jxOU7sO9-dc_c7Ql5Z1rYOxCYyuuARdyaPdJGaiCnnbZHYItOYKDzaF4ND_mGSTtbIzyYXBsMqmz0xbNR2ZvCMfBccgwSx0LO8jqDccz--g382fzvswWJux3H__UF3ENUlBCKDSP2RUjw1OlXtVInUGc2Mz3IeO3DdWWzbHux1a0XW9oz5HH_GprFQjidO28wbxxJ47i2yAV5FClK_0R2XHz-tjnAwTzqULMVoakSpnTSh9SLerZQ5lkpidmd6SemF2gCgyk6Oqr8UQNBq_XvkVzMfy2CWrzuLSu-Y8ytQkf_phN0nd2trmnaW7P-ArLnpQ9IfDcedbvmGdmhIL8aE-wVQjUPxMJAs2lMVKO-KdjF-FVopy2kvBKg4WtYHuPNH5PBGXv0xWZ_Opu4JocoCZxfgqFtvuPcMVEYuTMadEGAwFa5FXjfrKU2NpY4lPY5l8KlELP9c_RbZvqA-XWKIXEP3DlnjggaRv8ON2dkXWW8k0hceTL7CW6Utz7VQCOFoCvg2gbqCtchLYCxZ70TzawbKLtFg2zdwBBOegTUtY7DjZLjK86PTKz23GmZbdV9x2tN_N78gtwcHo1KWw_29TXIHx1mG622R9eps4Z6BfVfp57WcUfL5pjn1N3Z-XQQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MIRACL%3A+A+Multilingual+Retrieval+Dataset+Covering+18+Diverse+Languages&rft.jtitle=Transactions+of+the+Association+for+Computational+Linguistics&rft.au=Zhang%2C+Xinyu&rft.au=Thakur%2C+Nandan&rft.au=Ogundepo%2C+Odunayo&rft.au=Kamalloo%2C+Ehsan&rft.date=2023-09-01&rft.pub=MIT+Press+Journals%2C+The&rft.issn=2307-387X&rft.eissn=2307-387X&rft.volume=11&rft.spage=1114&rft_id=info:doi/10.1162%2Ftacl_a_00595
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2307-387X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2307-387X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2307-387X&client=summon