Fast Training Dataset Attribution via In-Context Learning
We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned large language models (LLMs). We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM outputs with...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
14.08.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | We investigate the use of in-context learning and prompt engineering to
estimate the contributions of training data in the outputs of instruction-tuned
large language models (LLMs). We propose two novel approaches: (1) a
similarity-based approach that measures the difference between LLM outputs with
and without provided context, and (2) a mixture distribution model approach
that frames the problem of identifying contribution scores as a matrix
factorization task. Our empirical comparison demonstrates that the mixture
model approach is more robust to retrieval noise in in-context learning,
providing a more reliable estimation of data contributions. |
---|---|
DOI: | 10.48550/arxiv.2408.11852 |