Boosting GPT Models for Genomics Analysis: Generating Trusted Genetic Variant Annotations and Interpretations through RAG and fine-tuning
Large language models (LLMs) have acquired a remarkable level of knowledge through their initial training. However, they lack expertise in particular domains such as genomics. Variant annotation data, an important component of genomics, is crucial for interpreting and prioritizing disease-related va...
Saved in:
Published in | bioRxiv |
---|---|
Main Authors | , |
Format | Paper |
Language | English |
Published |
Cold Spring Harbor
Cold Spring Harbor Laboratory Press
15.11.2024
Cold Spring Harbor Laboratory |
Edition | 1.1 |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Large language models (LLMs) have acquired a remarkable level of knowledge through their initial training. However, they lack expertise in particular domains such as genomics. Variant annotation data, an important component of genomics, is crucial for interpreting and prioritizing disease-related variants among millions of variants identified by genetic sequencing. In our project, we aimed to improve LLM performance in genomics by adding variant annotation data to LLMs by retrieval-augmented generation (RAG) and fine-tuning techniques. Using RAG, we successfully integrated 190 million highly accurate variant annotations, curated from 5 major annotation datasets and tools, into GPT-4o. This integration empowers users to query specific variants and receive accurate variant annotations and interpretations supported by advanced reasoning and language understanding capabilities of LLMs. Additionally, fine-tuning GPT-4 on variant annotation data also improved model performance in some annotation fields, although the accuracy across more fields remains suboptimal. Our model significantly improved the accessibility and efficiency of the variant interpretation process by leveraging LLM capabilities. Our project also revealed that RAG outperforms fine-tuning in factual knowledge injection in terms of data volume, accuracy, and cost-effectiveness. As a pioneering study for adding genomics knowledge to LLMs, our work paves the way for developing more comprehensive and informative genomics AI systems to support clinical diagnosis and research projects, and it demonstrates the potential of LLMs in specialized domains.Competing Interest StatementMicrosoft Health Future genomics team supported this work by hosting S.L. through the 2024 Microsoft Research Intern Program. E.C. is a Senior Data and Applied Scientist at Microsoft Research on the Health Future genomics Team. S.L. was a Summer Research Intern on the same team during Summer 2024. We used Microsoft Azure Cloud as the cloud provider for this project. |
---|---|
Bibliography: | SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50 Competing Interest Statement: Microsoft Health Future genomics team supported this work by hosting S.L. through the 2024 Microsoft Research Intern Program. E.C. is a Senior Data and Applied Scientist at Microsoft Research on the Health Future genomics Team. S.L. was a Summer Research Intern on the same team during Summer 2024. We used Microsoft Azure Cloud as the cloud provider for this project. |
ISSN: | 2692-8205 2692-8205 |
DOI: | 10.1101/2024.11.12.623275 |