MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages

MIRACL is a multilingual dataset for retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered...

Full description

Saved in:

Bibliographic Details
Published in	Transactions of the Association for Computational Linguistics Vol. 11; pp. 1114 - 1131
Main Authors	Zhang, Xinyu, Thakur, Nandan, Ogundepo, Odunayo, Kamalloo, Ehsan, Alfonso-Hermelo, David, Li, Xiaoguang, Liu, Qun, Rezagholizadeh, Mehdi, Lin, Jimmy
Format	Journal Article
Language	English
Published	One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA MIT Press 01.09.2023 MIT Press Journals, The The MIT Press
Subjects	Annotations Computational linguistics Control data (computers) Datasets Language diversity Languages Multilingualism Native speakers Queries Retrieval
Online Access	Get full text
ISSN	2307-387X 2307-387X
DOI	10.1162/tacl_a_00595

Cover

Loading…

Abstract	MIRACL is a multilingual dataset for retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at .
AbstractList	MIRACL is a multilingual dataset for retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at . MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at http://miracl.ai/.
Author	Rezagholizadeh, Mehdi Zhang, Xinyu Kamalloo, Ehsan Liu, Qun Lin, Jimmy Thakur, Nandan Li, Xiaoguang Ogundepo, Odunayo Alfonso-Hermelo, David
Author_xml	– sequence: 1 givenname: Xinyu surname: Zhang fullname: Zhang, Xinyu organization: David R. Cheriton School of Computer Science, University of Waterloo, Canada – sequence: 2 givenname: Nandan surname: Thakur fullname: Thakur, Nandan organization: David R. Cheriton School of Computer Science, University of Waterloo, Canada – sequence: 3 givenname: Odunayo surname: Ogundepo fullname: Ogundepo, Odunayo organization: David R. Cheriton School of Computer Science, University of Waterloo, Canada – sequence: 4 givenname: Ehsan surname: Kamalloo fullname: Kamalloo, Ehsan organization: David R. Cheriton School of Computer Science, University of Waterloo, Canada – sequence: 5 givenname: David surname: Alfonso-Hermelo fullname: Alfonso-Hermelo, David organization: Huawei Noah’s Ark Lab, Canada – sequence: 6 givenname: Xiaoguang surname: Li fullname: Li, Xiaoguang organization: Huawei Noah’s Ark Lab, China – sequence: 7 givenname: Qun surname: Liu fullname: Liu, Qun organization: Huawei Noah’s Ark Lab, China – sequence: 8 givenname: Mehdi surname: Rezagholizadeh fullname: Rezagholizadeh, Mehdi organization: Huawei Noah’s Ark Lab, Canada – sequence: 9 givenname: Jimmy surname: Lin fullname: Lin, Jimmy organization: David R. Cheriton School of Computer Science, University of Waterloo, Canada
BookMark	eNp1kE-LE0EQxRtZwXXdmx9gwIsHo139fzwIIatrIIuwKHhranq6Q4fZ6djdCayf3slGJIp7qkfV770q6jk5G9PoCXkJ9C2AYu8qusGipVS28gk5Z5zqGTf6-9mJfkYuS9lQSsGAoYqdk-ub5e18sWreN_PmZjfUOMRxvcOhufU1R7-f1BVWLL42i7T3eZo2YJqrOOnimxUe6LUvL8jTgEPxl7_rBfn26ePXxefZ6sv1cjFfzZxkUGeIQrpOIpWopHcduKCNYB5aAaynQQrT90rTABAM0FZLptAL7rteB-eBX5DlMbdPuLHbHO8w39uE0T40Ul5bzDW6wdvQBg66DT12vTCdQi4Yd-20QXGt-CHr1TFrm9OPnS_VbtIuj9P5lpmWt0IpbSbqzZFyOZWSffizFag9fN6efn7C2T-4ixVrTGPNGIfHTK-Pprt4csQj6If_oAdkD2C50GqyMcq4faj2Z9z-HfALaRCpYQ
CitedBy_id	crossref_primary_10_5715_jnlp_32_176 crossref_primary_10_1109_ACCESS_2024_3496867
Cites_doi	10.1162/tacl_a_00317 10.26818/9780814252703 10.1145/3404835.3463098 10.18653/v1/2022.naacl-main.272 10.1162/COLI_a_00111 10.1145/3404835.3463238 10.1145/3239571 10.18653/v1/P19-1493 10.18653/v1/D18-1029 10.18653/v1/2021.emnlp-main.471 10.1162/tacl_a_00276 10.1162/tacl_a_00148 10.1007/978-3-031-02181-7 10.1561/1500000019 10.1145/3539618.3591805 10.18653/v1/2021.naacl-main.46 10.18653/v1/2020.emnlp-main.550 10.1093/llc/fqu047 10.1145/3477495.3531725 10.1007/978-3-030-99736-6_26 10.3115/v1/D14-1018 10.1162/tacl_a_00433 10.1515/lity.1999.3.3.279 10.18653/v1/2020.acl-main.560 10.1145/290941.291017 10.1086/464575 10.18653/v1/D16-1264 10.18653/v1/P17-1147 10.18653/v1/2020.findings-emnlp.249 10.1017/9781108378291.011 10.18653/v1/2020.emnlp-main.340 10.18653/v1/2021.mrl-1.12 10.1145/3404835.3462804 10.1162/coli_a_00357 10.1007/978-3-030-45442-5_31 10.1145/3397271.3401075 10.1093/llc/fqt031 10.18653/v1/2020.findings-emnlp.63
ContentType	Journal Article
Copyright	2023. This work is published under https://creativecommons.org/licenses/by/4.0/legalcode (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml	– notice: 2023. This work is published under https://creativecommons.org/licenses/by/4.0/legalcode (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID	AAYXX CITATION 7T9 8FE 8FG ABUWG AFKRA ALSLI ARAPS AZQEC BENPR BGLVJ CCPQU CPGLG CRLPW DWQXO GNUQQ HCIFZ JQ2 K7- P5Z P62 PHGZM PHGZT PIMPY PKEHL PQEST PQGLB PQQKQ PQUKI PRQQA DOA
DOI	10.1162/tacl_a_00595
DatabaseName	CrossRef Linguistics and Language Behavior Abstracts (LLBA) ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) ProQuest Central UK/Ireland Social Science Premium Collection Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Central Technology Collection ProQuest One Community College Linguistics Collection Linguistics Database ProQuest Central ProQuest Central Student SciTech Premium Collection ProQuest Computer Science Collection Computer Science Database Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) Publicly Available Content Database ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition ProQuest One Social Sciences DOAJ Directory of Open Access Journals (ODIN)
DatabaseTitle	CrossRef Publicly Available Content Database Computer Science Database ProQuest Central Student Technology Collection ProQuest One Academic Middle East (New) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Central ProQuest One Applied & Life Sciences Linguistics Collection ProQuest Central Korea ProQuest Central (New) Advanced Technologies & Aerospace Collection Social Science Premium Collection ProQuest One Social Sciences ProQuest One Academic Eastern Edition Linguistics and Language Behavior Abstracts (LLBA) ProQuest Technology Collection ProQuest SciTech Collection Advanced Technologies & Aerospace Database ProQuest One Academic UKI Edition Linguistics Database ProQuest One Academic ProQuest One Academic (New)
DatabaseTitleList	CrossRef Publicly Available Content Database
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database
DeliveryMethod	fulltext_linktorsrc
EISSN	2307-387X
EndPage	1131
ExternalDocumentID	oai_doaj_org_article_f9f3179fdabd48b6a3423c948d637631 10_1162_tacl_a_00595 tacl_a_00595.pdf
GroupedDBID	AAFWJ AFPKN ALMA_UNASSIGNED_HOLDINGS EBS GROUPED_DOAJ JMNJE M~E OJV OK1 RMI AAYXX ABUWG AFKRA ALSLI ARAPS BENPR BGLVJ CCPQU CITATION CPGLG CRLPW DWQXO HCIFZ K7- PHGZM PHGZT PIMPY 7T9 8FE 8FG AZQEC GNUQQ JQ2 P62 PKEHL PQEST PQGLB PQQKQ PQUKI PRQQA PUEGO
ID	FETCH-LOGICAL-c521t-aa45cb5a05a65ecb1cf7842e19412d0f548dd670f11f81097526ae43ebd7fce13
IEDL.DBID	BENPR
ISSN	2307-387X
IngestDate	Wed Aug 27 01:01:37 EDT 2025 Fri Jul 25 20:55:42 EDT 2025 Thu Apr 24 22:55:23 EDT 2025 Tue Jul 01 03:28:36 EDT 2025 Sat Oct 21 05:18:29 EDT 2023 Fri Oct 20 12:12:55 EDT 2023
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c521t-aa45cb5a05a65ecb1cf7842e19412d0f548dd670f11f81097526ae43ebd7fce13
Notes	2023 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
OpenAccessLink	https://www.proquest.com/docview/2893946678?pq-origsite=%requestingapplication%
PQID	2893946678
PQPubID	6535866
PageCount	18
ParticipantIDs	proquest_journals_2893946678 mit_journals_10_1162_tacl_a_00595 crossref_citationtrail_10_1162_tacl_a_00595 doaj_primary_oai_doaj_org_article_f9f3179fdabd48b6a3423c948d637631 crossref_primary_10_1162_tacl_a_00595 mit_journals_taclv11_347610_2023_10_20_zip_tacl_a_00595
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2023-09-01
PublicationDateYYYYMMDD	2023-09-01
PublicationDate_xml	– month: 09 year: 2023 text: 2023-09-01 day: 01
PublicationDecade	2020
PublicationPlace	One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA
PublicationPlace_xml	– name: One Broadway, 12th Floor, Cambridge, Massachusetts 02142, USA – name: Cambridge
PublicationTitle	Transactions of the Association for Computational Linguistics
PublicationYear	2023
Publisher	MIT Press MIT Press Journals, The The MIT Press
Publisher_xml	– name: MIT Press – name: MIT Press Journals, The – name: The MIT Press
References	Lin (2023090614075107500_bib25) 2021 Thakur (2023090614075107500_bib42) 2021 Gerz (2023090614075107500_bib11) 2018 Rajpurkar (2023090614075107500_bib37) 2016 Izacard (2023090614075107500_bib14) 2022 Lembersky (2023090614075107500_bib22) 2012; 38 Kwiatkowski (2023090614075107500_bib20) 2019; 7 Longpre (2023090614075107500_bib26) 2021; 9 Nogueira (2023090614075107500_bib30) 2020 Santhanam (2023090614075107500_bib39) 2021 Eetemadi (2023090614075107500_bib8) 2014 Dawson (2023090614075107500_bib7) 2016 Joshi (2023090614075107500_bib17) 2020 Robertson (2023090614075107500_bib38) 2009; 3 Khattab (2023090614075107500_bib19) 2020 Pires (2023090614075107500_bib32) 2019 Zhang (2023090614075107500_bib49) 2022 Bajaj (2023090614075107500_bib3) 2018 Volansky (2023090614075107500_bib43) 2015; 30 Nair (2023090614075107500_bib28) 2022 Karpukhin (2023090614075107500_bib18) 2020 Craswell (2023090614075107500_bib6) 2021 Xiong (2023090614075107500_bib46) 2021 Junjie (2023090614075107500_bib13) 2020 Ponti (2023090614075107500_bib34) 2019; 45 Lin (2023090614075107500_bib24) 2021 Rabinovich (2023090614075107500_bib36) 2015; 3 Gao (2023090614075107500_bib10) 2023 MacAvaney (2023090614075107500_bib27) 2020 Yang (2023090614075107500_bib47) 2018; 10 Zhang (2023090614075107500_bib48) 2021 Formal (2023090614075107500_bib9) 2021 Lawrie (2023090614075107500_bib21) 2023 Nogueira (2023090614075107500_bib29) 2019 Sun (2023090614075107500_bib41) 2020 Shi (2023090614075107500_bib40) 2020 Greenberg (2023090614075107500_bib12) 1960; 26 Asai (2023090614075107500_bib1) 2021 Lin (2023090614075107500_bib23) 2022 Clark (2023090614075107500_bib5) 2020; 8 Voorhees (2023090614075107500_bib44) 1998 Avner (2023090614075107500_bib2) 2016; 31 Nübling (2023090614075107500_bib31) 2020 Bonifacio (2023090614075107500_bib4) 2021 Jones (2023090614075107500_bib15) 2021 Joshi (2023090614075107500_bib16) 2017 Wenzek (2023090614075107500_bib45) 2020 Plank (2023090614075107500_bib33) 1999; 3 Yingqi (2023090614075107500_bib35) 2021
References_xml	– volume: 8 start-page: 454 year: 2020 ident: 2023090614075107500_bib5 article-title: TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages publication-title: Transactions of the Association for Computational Linguistics doi: 10.1162/tacl_a_00317 – volume-title: Language Files: Materials for an Introduction to Language and Linguistics, 12th Edition year: 2016 ident: 2023090614075107500_bib7 article-title: Morphological types of languages doi: 10.26818/9780814252703 – year: 2022 ident: 2023090614075107500_bib14 article-title: Unsupervised dense information retrieval with contrastive learning publication-title: Transactions on Machine Learning Research – year: 2019 ident: 2023090614075107500_bib29 article-title: Passage re-ranking with BERT publication-title: arXiv:1901. 04085 – start-page: 4003 volume-title: Proceedings of the Twelfth Language Resources and Evaluation Conference year: 2020 ident: 2023090614075107500_bib45 article-title: CCNet: Extracting high quality monolingual datasets from web crawl data – start-page: 2288 volume-title: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval year: 2021 ident: 2023090614075107500_bib9 article-title: SPLADE: Sparse lexical and expansion model for first stage ranking doi: 10.1145/3404835.3463098 – start-page: 3715 year: 2021 ident: 2023090614075107500_bib39 article-title: ColBERTv2: Effective and efficient retrieval via lightweight late interaction publication-title: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies doi: 10.18653/v1/2022.naacl-main.272 – volume: 38 start-page: 799 issue: 4 year: 2012 ident: 2023090614075107500_bib22 article-title: Language models for machine translation: Original vs. translated texts publication-title: Computational Linguistics doi: 10.1162/COLI_a_00111 – start-page: 2356 volume-title: Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021) year: 2021 ident: 2023090614075107500_bib24 article-title: Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations doi: 10.1145/3404835.3463238 – volume: 10 start-page: Article 16 issue: 4 year: 2018 ident: 2023090614075107500_bib47 article-title: Anserini: Reproducible ranking baselines using Lucene publication-title: Journal of Data and Information Quality doi: 10.1145/3239571 – year: 2021 ident: 2023090614075107500_bib4 article-title: mMARCO: A multilingual version of the MS MARCO passage ranking dataset publication-title: arXiv:2108.13897 – start-page: pages 4996–pages 5001 volume-title: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics year: 2019 ident: 2023090614075107500_bib32 article-title: How multilingual is multilingual BERT? doi: 10.18653/v1/P19-1493 – start-page: 316 volume-title: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing year: 2018 ident: 2023090614075107500_bib11 article-title: On the relation between linguistic typology and (limitations of) multilingual language modeling doi: 10.18653/v1/D18-1029 – start-page: 5833 year: 2021 ident: 2023090614075107500_bib15 article-title: A massively multilingual analysis of cross-linguality in shared embedding space publication-title: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing doi: 10.18653/v1/2021.emnlp-main.471 – volume: 7 start-page: 452 year: 2019 ident: 2023090614075107500_bib20 article-title: Natural Questions: A benchmark for question answering research publication-title: Transactions of the Association for Computational Linguistics doi: 10.1162/tacl_a_00276 – volume: 3 start-page: 419 year: 2015 ident: 2023090614075107500_bib36 article-title: Unsupervised identification of translationese publication-title: Transactions of the Association for Computational Linguistics doi: 10.1162/tacl_a_00148 – volume-title: Proceedings of the 31st Text REtrieval Conference year: 2023 ident: 2023090614075107500_bib21 article-title: Overview of the TREC 2022 NeuCLIR track – year: 2018 ident: 2023090614075107500_bib3 article-title: MS MARCO: A human generated MAchine Reading COmprehension dataset publication-title: arXiv: 1611.09268v3 – volume-title: Pretrained Transformers for Text Ranking: BERT and Beyond year: 2021 ident: 2023090614075107500_bib25 doi: 10.1007/978-3-031-02181-7 – volume: 3 start-page: 333 issue: 4 year: 2009 ident: 2023090614075107500_bib38 article-title: The probabilistic relevance framework: BM25 and beyond publication-title: Foundations and Trends in Information Retrieval doi: 10.1561/1500000019 – start-page: 3120 volume-title: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval year: 2023 ident: 2023090614075107500_bib10 article-title: Tevatron: An efficient and flexible toolkit for Neural Retrieval doi: 10.1145/3539618.3591805 – start-page: 547 volume-title: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies year: 2021 ident: 2023090614075107500_bib1 article-title: XOR QA: Cross-lingual open-retrieval question answering doi: 10.18653/v1/2021.naacl-main.46 – start-page: 6769 volume-title: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) year: 2020 ident: 2023090614075107500_bib18 article-title: Dense passage retrieval for open-domain question answering doi: 10.18653/v1/2020.emnlp-main.550 – volume-title: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021) year: 2021 ident: 2023090614075107500_bib46 article-title: Approximate nearest neighbor negative contrastive learning for dense text retrieval – volume: 31 start-page: 30 issue: 1 year: 2016 ident: 2023090614075107500_bib2 article-title: Identifying translationese at the word and sub-word level publication-title: Digital Scholarship in the Humanities doi: 10.1093/llc/fqu047 – start-page: 2939 volume-title: Proceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022) year: 2022 ident: 2023090614075107500_bib23 article-title: Fostering coopetition while plugging leaks: The design and implementation of the MS MARCO leaderboards doi: 10.1145/3477495.3531725 – start-page: 382 volume-title: Proceedings of the 44th European Conference on Information Retrieval (ECIR 2022), Part I year: 2022 ident: 2023090614075107500_bib28 article-title: Transfer learning approaches for building cross-language dense retrieval models doi: 10.1007/978-3-030-99736-6_26 – start-page: 159 volume-title: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) year: 2014 ident: 2023090614075107500_bib8 article-title: Asymmetric features of human generated translation doi: 10.3115/v1/D14-1018 – volume: 9 start-page: 1389 year: 2021 ident: 2023090614075107500_bib26 article-title: MKQA: A linguistically diverse benchmark for multilingual open domain question answering publication-title: Transactions of the Association for Computational Linguistics doi: 10.1162/tacl_a_00433 – volume: 3 start-page: 279 year: 1999 ident: 2023090614075107500_bib33 article-title: Split morphology: How agglutination and flexion mix publication-title: Linguistic Typology doi: 10.1515/lity.1999.3.3.279 – volume-title: Neural Information Processing Systems: Datasets and Benchmarks Track year: 2021 ident: 2023090614075107500_bib42 article-title: BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models – start-page: 6282 volume-title: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics year: 2020 ident: 2023090614075107500_bib17 article-title: The state and fate of linguistic diversity and inclusion in the NLP world doi: 10.18653/v1/2020.acl-main.560 – start-page: 5835 volume-title: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies year: 2021 ident: 2023090614075107500_bib35 article-title: RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering – start-page: 315 volume-title: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998) year: 1998 ident: 2023090614075107500_bib44 article-title: Variations in relevance judgments and the measurement of retrieval effectiveness doi: 10.1145/290941.291017 – volume: 26 start-page: 178 year: 1960 ident: 2023090614075107500_bib12 article-title: A quantitative approach to the morphological typology of language publication-title: International Journal of American Linguistics doi: 10.1086/464575 – start-page: 2383 volume-title: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing year: 2016 ident: 2023090614075107500_bib37 article-title: SQuAD: 100,000+ questions for machine comprehension of text doi: 10.18653/v1/D16-1264 – year: 2022 ident: 2023090614075107500_bib49 article-title: Towards best practices for training multilingual dense retrieval models publication-title: arXiv:2204.02363 – start-page: 1601 volume-title: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) year: 2017 ident: 2023090614075107500_bib16 article-title: TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension doi: 10.18653/v1/P17-1147 – start-page: 2768 volume-title: Findings of the Association for Computational Linguistics: EMNLP 2020 year: 2020 ident: 2023090614075107500_bib40 article-title: Cross-lingual training of neural models for document ranking doi: 10.18653/v1/2020.findings-emnlp.249 – volume-title: The Cambridge Handbook of Germanic Linguistics year: 2020 ident: 2023090614075107500_bib31 article-title: Inflectional morphology doi: 10.1017/9781108378291.011 – start-page: 4160 volume-title: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) year: 2020 ident: 2023090614075107500_bib41 article-title: CLIRMatrix: A massively large collection of bilingual and multilingual datasets for cross-lingual information retrieval doi: 10.18653/v1/2020.emnlp-main.340 – start-page: 127 volume-title: Proceedings of the 1st Workshop on Multilingual Representation Learning year: 2021 ident: 2023090614075107500_bib48 article-title: Mr. TyDi: A multi-lingual benchmark for dense retrieval doi: 10.18653/v1/2021.mrl-1.12 – start-page: 1566 volume-title: Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021) year: 2021 ident: 2023090614075107500_bib6 article-title: MS MARCO: Benchmarking ranking models in the large-data regime doi: 10.1145/3404835.3462804 – volume: 45 start-page: 559 issue: 3 year: 2019 ident: 2023090614075107500_bib34 article-title: Modeling language variation and universals: A survey on typological linguistics for natural language processing publication-title: Computational Linguistics doi: 10.1162/coli_a_00357 – start-page: 4411 volume-title: Proceedings of the 37th International Conference on Machine Learning year: 2020 ident: 2023090614075107500_bib13 article-title: XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation – start-page: 246 volume-title: Proceedings of the 42nd European Conference on Information Retrieval, Part II (ECIR 2020) year: 2020 ident: 2023090614075107500_bib27 article-title: Teaching a new dog old tricks: Resurrecting multilingual retrieval using zero-shot learning doi: 10.1007/978-3-030-45442-5_31 – start-page: 39 volume-title: Proceedings of the 43rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020) year: 2020 ident: 2023090614075107500_bib19 article-title: ColBERT: Efficient and effective passage search via contextualized late interaction over BERT doi: 10.1145/3397271.3401075 – volume: 30 start-page: 98 issue: 1 year: 2015 ident: 2023090614075107500_bib43 article-title: On the features of translationese publication-title: Digital Scholarship in the Humanities doi: 10.1093/llc/fqt031 – start-page: 708 volume-title: Findings of the Association for Computational Linguistics: EMNLP 2020 year: 2020 ident: 2023090614075107500_bib30 article-title: Document ranking with a pretrained sequence-to-sequence model doi: 10.18653/v1/2020.findings-emnlp.63
SSID	ssj0001818062
Score	2.332418
Snippet	MIRACL is a multilingual dataset for retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This... MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This...
SourceID	doaj proquest crossref mit
SourceType	Open Website Aggregation Database Enrichment Source Index Database Publisher
StartPage	1114
SubjectTerms	Annotations Computational linguistics Control data (computers) Datasets Language diversity Languages Multilingualism Native speakers Queries Retrieval
SummonAdditionalLinks	– databaseName: DOAJ Directory of Open Access Journals (ODIN) dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA7iyYsoKtYXEfQki81uNrvxVluLSvUgCt5CnlCotejqwV_vTHZbW0W8eFrYTNhkJsnMZGe-IeTIsbbzsG0SayRPuLdlYmRuE6Z9cDJzsjCYKHxzKy4f-PVj_jhX6gtjwmp44Jpxp0EGUHEyOG0cL43QCFlnJS-dwL0RHR_QeXPOVLxdwRRmkU4j3UV6Wmk7UlphsmW-oIMiVD9olqdh9eM8jkqmv0ZWG-uQdupRrZMlP94g_Zuru053cEY7NKbLYgL5G1DdxWJYsFJoT1egjCraxXhMaKWspL0YcOHpoLmQfN0kD_2L--5l0pQ_SCxWGUi05rk1uW7nWuTeGmZDUfLUM8lZ6toBfA3nRNEOjIUSfyTnqdCeZ964IljPsi2yPH4e-21CtQNJSXA8XbA8BAZHYClswb0QYABI3yInU4Yo22CDY4mKkYo-gkjVPPta5HhGPakxMX6hO0fezmgQyTq-APmqRr7qL_m2yCFIRjU76_WXDxULNNj2Do5NxguwDhVWiFfxqT6Gk28996YS_-oOPmiGsPtFufMfM9glKziCOjBtjyxXL29-HyyZyhzERfsJ1H7wKg priority: 102 providerName: Directory of Open Access Journals
Title	MIRACL : A Multilingual Retrieval Dataset Covering 18 Diverse Languages
URI	https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00595 https://www.proquest.com/docview/2893946678 https://doaj.org/article/f9f3179fdabd48b6a3423c948d637631
Volume	11
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LTxsxELZ4XHqpWpWqaSFypXJCK-Jdr9fmUoVAeAhQFRUpN8tPhESTlCw99Nd3xnFIoaKnldazu1qPPS_PzEfIF896PsC2KZxVvODBycKq2hXMhOhV5VVjsVD48kqcXvPzcT3OAbd5TqtcysQkqP3UYYx8HxyDCnuhN_Lr7GeBqFF4upohNNbJJohgCc7X5uHx1bfRKsqCpcwJVRQTnrGR7HiZ_S7K_da4O200FmDWT_RSat8P2ubHbfuPjE6KZ_iGvM4WI-0vWPyWrIXJOzK8PBv1BxcHtE9TCS0WlT8A1SgBZMHqoUemBQXV0gHmaMIoZZIepSSMQC9ykHK-Ra6Hx98Hp0WGRCgcIg8UxvDa2dr0aiPq4CxzsZG8DExxVvpeBP_De9H0ImNR4uFyXQoTeBWsb6ILrHpPNibTSfhAqPHAPQXOqI-Ox8hALErhGh6EAKNAhQ7ZW06IdrlfOMJW3OnkN4hS_z19HbL7SD1b9Ml4ge4Q5_aRBrtbpxvT-xudN4uOKoJZo6I31nNphcE2hU7BvwmUh6xDPgNndN5t8xc-1DyhwbFf4OxUvAGLUSNqvE5X_ft29uzJ7SXHV4-vlt_H_w9_Iq_w3Ys0tG2y0d4_hB2wW1rbJetyeNLNS7SbvP8_gXrtPg
linkProvider	ProQuest
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwzV1fb9MwELfG9gAv_BEgygYYiT2hbHXiOAkSD6Wlalk6oWoTfTP-iya2dlpTEPsqfBU-HHdusrKh8TaJp0jxOU7sO9-dc_c7Ql5Z1rYOxCYyuuARdyaPdJGaiCnnbZHYItOYKDzaF4ND_mGSTtbIzyYXBsMqmz0xbNR2ZvCMfBccgwSx0LO8jqDccz--g382fzvswWJux3H__UF3ENUlBCKDSP2RUjw1OlXtVInUGc2Mz3IeO3DdWWzbHux1a0XW9oz5HH_GprFQjidO28wbxxJ47i2yAV5FClK_0R2XHz-tjnAwTzqULMVoakSpnTSh9SLerZQ5lkpidmd6SemF2gCgyk6Oqr8UQNBq_XvkVzMfy2CWrzuLSu-Y8ytQkf_phN0nd2trmnaW7P-ArLnpQ9IfDcedbvmGdmhIL8aE-wVQjUPxMJAs2lMVKO-KdjF-FVopy2kvBKg4WtYHuPNH5PBGXv0xWZ_Opu4JocoCZxfgqFtvuPcMVEYuTMadEGAwFa5FXjfrKU2NpY4lPY5l8KlELP9c_RbZvqA-XWKIXEP3DlnjggaRv8ON2dkXWW8k0hceTL7CW6Utz7VQCOFoCvg2gbqCtchLYCxZ70TzawbKLtFg2zdwBBOegTUtY7DjZLjK86PTKz23GmZbdV9x2tN_N78gtwcHo1KWw_29TXIHx1mG622R9eps4Z6BfVfp57WcUfL5pjn1N3Z-XQQ
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MIRACL%3A+A+Multilingual+Retrieval+Dataset+Covering+18+Diverse+Languages&rft.jtitle=Transactions+of+the+Association+for+Computational+Linguistics&rft.au=Zhang%2C+Xinyu&rft.au=Thakur%2C+Nandan&rft.au=Ogundepo%2C+Odunayo&rft.au=Kamalloo%2C+Ehsan&rft.date=2023-09-01&rft.pub=MIT+Press+Journals%2C+The&rft.issn=2307-387X&rft.eissn=2307-387X&rft.volume=11&rft.spage=1114&rft_id=info:doi/10.1162%2Ftacl_a_00595
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2307-387X&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2307-387X&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2307-387X&client=summon