A Survey of LLM Datasets: From Autoregressive Model to AI Chatbot

Since OpenAI opened access to ChatGPT, large language models (LLMs) become an increasingly popular topic attracting researchers’ attention from abundant domains. However, public researchers meet some problems when developing LLMs given that most of the LLMs are produced by industries and the trainin...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computer science and technology Vol. 39; no. 3; pp. 542 - 566
Main Authors	Du, Fei, Ma, Xin-Jian, Yang, Jing-Ru, Liu, Yi, Luo, Chao-Ran, Wang, Xue-Bin, Jiang, Hai-Ou, Jing, Xiang
Format	Journal Article
Language	English
Published	Singapore Springer Nature Singapore 01.05.2024 Springer Nature B.V
Subjects	Artificial Intelligence Autoregressive models Chatbots Computer Science Data Structures and Information Theory Datasets Information Systems Applications (incl.Internet) Large language models Polls & surveys Quality Researchers Software Engineering Survey Theory of Computation natural language processing (NLP) corpora AI chatbot OpenAI large language model (LLM) autoregressive model natural language processing(NLP)corpora large language model(LLM)
Online Access	Get full text
ISSN	1000-9000 1860-4749
DOI	10.1007/s11390-024-3767-3

Cover

Loading…

More Information
Summary:	Since OpenAI opened access to ChatGPT, large language models (LLMs) become an increasingly popular topic attracting researchers’ attention from abundant domains. However, public researchers meet some problems when developing LLMs given that most of the LLMs are produced by industries and the training details are typically unrevealed. Since datasets are an important setup of LLMs, this paper does a holistic survey on the training datasets used in both the pre-train and fine-tune processes. The paper first summarizes 16 pre-train datasets and 16 fine-tune datasets used in the state-of-the-art LLMs. Secondly, based on the properties of the pre-train and fine-tune processes, it comments on pre-train datasets from quality, quantity, and relation with models, and comments on fine-tune datasets from quality, quantity, and concerns. This study then critically figures out the problems and research trends that exist in current LLM datasets. The study helps public researchers train and investigate LLMs by visual cases and provides useful comments to the research community regarding data development. To the best of our knowledge, this paper is the first to summarize and discuss datasets used in both autoregressive and chat LLMs. The survey offers insights and suggestions to researchers and LLM developers as they build their models, and contributes to the LLM study by pointing out the existing problems of LLM studies from the perspective of data.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1000-9000 1860-4749
DOI:	10.1007/s11390-024-3767-3