Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models
The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new framework based on a simple hypothesis: just as humans acqu...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
26.07.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The quality of training data impacts the performance of pre-trained large
language models (LMs). Given a fixed budget of tokens, we study how to best
select data that leads to good downstream model performance across tasks. We
develop a new framework based on a simple hypothesis: just as humans acquire
interdependent skills in a deliberate order, language models also follow a
natural order when learning a set of skills from their training data. If such
an order exists, it can be utilized for improved understanding of LMs and for
data-efficient training. Using this intuition, our framework formalizes the
notion of a skill and of an ordered set of skills in terms of the associated
data. First, using both synthetic and real data, we demonstrate that these
ordered skill sets exist, and that their existence enables more advanced skills
to be learned with less data when we train on their prerequisite skills.
Second, using our proposed framework, we introduce an online data sampling
algorithm, Skill-It, over mixtures of skills for both continual pre-training
and fine-tuning regimes, where the objective is to efficiently learn multiple
skills in the former and an individual skill in the latter. On the LEGO
synthetic in the continual pre-training setting, Skill-It obtains 36.5 points
higher accuracy than random sampling. On the Natural Instructions dataset in
the fine-tuning setting, Skill-It reduces the validation loss on the target
skill by 13.6% versus training on data associated with the target skill itself.
We apply our skills framework on the recent RedPajama dataset to continually
pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation
Harness with 1B tokens than the baseline approach of sampling uniformly over
data sources with 3B tokens. |
---|---|
DOI: | 10.48550/arxiv.2307.14430 |