Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers

Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native...

Full description

Saved in:
Bibliographic Details
Main Authors Bueno, Mirelle, de Oliveira, Eduardo Seiti, Nogueira, Rodrigo, Lotufo, Roberto A, Pereira, Jayr Alencar
Format Journal Article
LanguageEnglish
Published 10.04.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portuguese websites. These websites are frequented more likely by real users compared to those randomly scraped, ensuring a more representative and relevant corpus. To label the query-document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. We provide a detailed description of our annotation methodology to enable others to create similar datasets for other languages, providing a cost-effective way of creating high-quality IR datasets with an arbitrary number of labeled documents per query. Finally, we evaluate a diverse range of open-source and commercial retrievers to serve as baseline systems. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati and all scripts at https://github.com/unicamp-dl/quati .
DOI:10.48550/arxiv.2404.06976