ProGen2: Exploring the boundaries of protein language models
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective pro...
Saved in:
Published in | Cell systems Vol. 14; no. 11; pp. 968 - 978.e3 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
United States
Elsevier Inc
15.11.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper’s Transparent Peer Review process is included in the supplemental information.
•The ProGen2 suite of protein language models are scaled to 6.4B parameters•Models with increased scale better capture the distribution of protein sequences•ProGen2 models generate novel protein sequences adopting natural folds•ProGen2 model likelihoods are effective for zero-shot fitness prediction
The ProGen2 suite of models are scaled up to 6.4B parameters and trained on over one billion sequences from genomic, metagenomic, and immune repertoire datasets. We explore the impact of scale and data distribution on fitting the evolutionary sequence distribution, generating protein sequences, and estimating protein fitness. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 2405-4712 2405-4720 2405-4720 |
DOI: | 10.1016/j.cels.2023.10.002 |