Fast Training Dataset Attribution via In-Context Learning

We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned large language models (LLMs). We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM outputs with...

Full description

Saved in:
Bibliographic Details
Main Authors Fotouhi, Milad, Bahadori, Mohammad Taha, Feyisetan, Oluwaseyi, Arabshahi, Payman, Heckerman, David
Format Journal Article
LanguageEnglish
Published 14.08.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned large language models (LLMs). We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM outputs with and without provided context, and (2) a mixture distribution model approach that frames the problem of identifying contribution scores as a matrix factorization task. Our empirical comparison demonstrates that the mixture model approach is more robust to retrieval noise in in-context learning, providing a more reliable estimation of data contributions.
DOI:10.48550/arxiv.2408.11852