Utilising a Large Language Model to Annotate Subject Metadata: A Case Study in an Australian National Research Data Catalogue
In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research. As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them. Yet, it is a common issue that datasets of...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
17.10.2023
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2310.11318 |
Cover
Summary: | In support of open and reproducible research, there has been a rapidly
increasing number of datasets made available for research. As the availability
of datasets increases, it becomes more important to have quality metadata for
discovering and reusing them. Yet, it is a common issue that datasets often
lack quality metadata due to limited resources for data curation. Meanwhile,
technologies such as artificial intelligence and large language models (LLMs)
are progressing rapidly. Recently, systems based on these technologies, such as
ChatGPT, have demonstrated promising capabilities for certain data curation
tasks. This paper proposes to leverage LLMs for cost-effective annotation of
subject metadata through the LLM-based in-context learning. Our method employs
GPT-3.5 with prompts designed for annotating subject metadata, demonstrating
promising performance in automatic metadata annotation. However, models based
on in-context learning cannot acquire discipline-specific rules, resulting in
lower performance in several categories. This limitation arises from the
limited contextual information available for subject inference. To the best of
our knowledge, we are introducing, for the first time, an in-context learning
method that harnesses large language models for automated subject metadata
annotation. |
---|---|
DOI: | 10.48550/arxiv.2310.11318 |