Fast Training Dataset Attribution via In-Context Learning

We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned large language models (LLMs). We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM outputs with...

Full description

Saved in:

Bibliographic Details
Main Authors	Fotouhi, Milad, Bahadori, Mohammad Taha, Feyisetan, Oluwaseyi, Arabshahi, Payman, Heckerman, David
Format	Journal Article
Language	English
Published	14.08.2024
Subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned large language models (LLMs). We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM outputs with and without provided context, and (2) a mixture distribution model approach that frames the problem of identifying contribution scores as a matrix factorization task. Our empirical comparison demonstrates that the mixture model approach is more robust to retrieval noise in in-context learning, providing a more reliable estimation of data contributions.
DOI:	10.48550/arxiv.2408.11852