Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers
Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
10.04.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Despite Portuguese being one of the most spoken languages in the world, there
is a lack of high-quality information retrieval datasets in that language. We
present Quati, a dataset specifically designed for the Brazilian Portuguese
language. It comprises a collection of queries formulated by native speakers
and a curated set of documents sourced from a selection of high-quality
Brazilian Portuguese websites. These websites are frequented more likely by
real users compared to those randomly scraped, ensuring a more representative
and relevant corpus. To label the query-document pairs, we use a
state-of-the-art LLM, which shows inter-annotator agreement levels comparable
to human performance in our assessments. We provide a detailed description of
our annotation methodology to enable others to create similar datasets for
other languages, providing a cost-effective way of creating high-quality IR
datasets with an arbitrary number of labeled documents per query. Finally, we
evaluate a diverse range of open-source and commercial retrievers to serve as
baseline systems. Quati is publicly available at
https://huggingface.co/datasets/unicamp-dl/quati and all scripts at
https://github.com/unicamp-dl/quati . |
---|---|
DOI: | 10.48550/arxiv.2404.06976 |