Instruction Fine-Tuning: Does Prompt Loss Matter?
We present a novel study analyzing the effects of various prompt loss token weights (PLW) for supervised instruction fine-tuning (SIFT). While prompt-masking (PLW = 0) is common for SIFT, some fine-tuning APIs support fractional PLWs and suggest that using a small non-zero PLW can help stabilize lea...
Saved in:
Main Authors | , |
---|---|
Format | Journal Article |
Language | English |
Published |
24.01.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | We present a novel study analyzing the effects of various prompt loss token
weights (PLW) for supervised instruction fine-tuning (SIFT). While
prompt-masking (PLW = 0) is common for SIFT, some fine-tuning APIs support
fractional PLWs and suggest that using a small non-zero PLW can help stabilize
learning when fine-tuning on short-completion data. However, there has never
been a study confirming this claim, and OpenAI, a major cloud-based SIFT
provider, recently removed this parameter from their fine-tuning API. We found
that performance of models fine-tuned on short-completion data had a
statistically-significant negative quadratic relationship with PLW. Using small
values (0.01 - 0.5) of PLW produced better results on multiple-choice and
short-generation benchmarks (outperforming models fine-tuned on long-completion
data) while large values (~ 1.0) of PLW produced better results on
long-generation benchmarks. We explained this effect and verified its
importance through additional experiments. This research serves as a warning to
API providers about the importance of providing a PLW parameter for SIFT. |
---|---|
DOI: | 10.48550/arxiv.2401.13586 |