Boosting GPT Models for Genomics Analysis: Generating Trusted Genetic Variant Annotations and Interpretations through RAG and fine-tuning

Large language models (LLMs) have acquired a remarkable level of knowledge through their initial training. However, they lack expertise in particular domains such as genomics. Variant annotation data, an important component of genomics, is crucial for interpreting and prioritizing disease-related va...

Full description

Saved in:

Bibliographic Details
Published in	bioRxiv
Main Authors	Lu, Shuangjia, Cosgun, Erdal
Format	Paper
Language	English
Published	Cold Spring Harbor Cold Spring Harbor Laboratory Press 15.11.2024 Cold Spring Harbor Laboratory
Edition	1.1
Subjects	Annotations Genetic analysis Genetic diversity Genetics Genomic analysis Genomics Large language models variant annotation retrieval-augmented generation fine-tuning large language model
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Large language models (LLMs) have acquired a remarkable level of knowledge through their initial training. However, they lack expertise in particular domains such as genomics. Variant annotation data, an important component of genomics, is crucial for interpreting and prioritizing disease-related variants among millions of variants identified by genetic sequencing. In our project, we aimed to improve LLM performance in genomics by adding variant annotation data to LLMs by retrieval-augmented generation (RAG) and fine-tuning techniques. Using RAG, we successfully integrated 190 million highly accurate variant annotations, curated from 5 major annotation datasets and tools, into GPT-4o. This integration empowers users to query specific variants and receive accurate variant annotations and interpretations supported by advanced reasoning and language understanding capabilities of LLMs. Additionally, fine-tuning GPT-4 on variant annotation data also improved model performance in some annotation fields, although the accuracy across more fields remains suboptimal. Our model significantly improved the accessibility and efficiency of the variant interpretation process by leveraging LLM capabilities. Our project also revealed that RAG outperforms fine-tuning in factual knowledge injection in terms of data volume, accuracy, and cost-effectiveness. As a pioneering study for adding genomics knowledge to LLMs, our work paves the way for developing more comprehensive and informative genomics AI systems to support clinical diagnosis and research projects, and it demonstrates the potential of LLMs in specialized domains.Competing Interest StatementMicrosoft Health Future genomics team supported this work by hosting S.L. through the 2024 Microsoft Research Intern Program. E.C. is a Senior Data and Applied Scientist at Microsoft Research on the Health Future genomics Team. S.L. was a Summer Research Intern on the same team during Summer 2024. We used Microsoft Azure Cloud as the cloud provider for this project.
Bibliography:	SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50 Competing Interest Statement: Microsoft Health Future genomics team supported this work by hosting S.L. through the 2024 Microsoft Research Intern Program. E.C. is a Senior Data and Applied Scientist at Microsoft Research on the Health Future genomics Team. S.L. was a Summer Research Intern on the same team during Summer 2024. We used Microsoft Azure Cloud as the cloud provider for this project.
ISSN:	2692-8205 2692-8205
DOI:	10.1101/2024.11.12.623275