Large scale paired antibody language models
Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been c...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
26.03.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Antibodies are proteins produced by the immune system that can identify and
neutralise a wide variety of antigens with high specificity and affinity, and
constitute the most successful class of biotherapeutics. With the advent of
next-generation sequencing, billions of antibody sequences have been collected
in recent years, though their application in the design of better therapeutics
has been constrained by the sheer volume and complexity of the data. To address
this challenge, we present IgBert and IgT5, the best performing
antibody-specific language models developed to date which can consistently
handle both paired and unpaired variable region sequences as input. These
models are trained comprehensively using the more than two billion unpaired
sequences and two million paired sequences of light and heavy chains present in
the Observed Antibody Space dataset. We show that our models outperform
existing antibody and protein language models on a diverse range of design and
regression tasks relevant to antibody engineering. This advancement marks a
significant leap forward in leveraging machine learning, large scale data sets
and high-performance computing for enhancing antibody design for therapeutic
development. |
---|---|
DOI: | 10.48550/arxiv.2403.17889 |