ProGen2: Exploring the boundaries of protein language models

Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective pro...

Full description

Saved in:

Bibliographic Details
Published in	Cell systems Vol. 14; no. 11; pp. 968 - 978.e3
Main Authors	Nijkamp, Erik, Ruffolo, Jeffrey A., Weinstein, Eli N., Naik, Nikhil, Madani, Ali
Format	Journal Article
Language	English
Published	United States Elsevier Inc 15.11.2023
Subjects	Amino Acid Sequence Artificial Intelligence Databases, Factual fitness prediction Language language modeling protein design Proteins - genetics fitness prediction language modeling protein design
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper’s Transparent Peer Review process is included in the supplemental information. •The ProGen2 suite of protein language models are scaled to 6.4B parameters•Models with increased scale better capture the distribution of protein sequences•ProGen2 models generate novel protein sequences adopting natural folds•ProGen2 model likelihoods are effective for zero-shot fitness prediction The ProGen2 suite of models are scaled up to 6.4B parameters and trained on over one billion sequences from genomic, metagenomic, and immune repertoire datasets. We explore the impact of scale and data distribution on fitting the evolutionary sequence distribution, generating protein sequences, and estimating protein fitness.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2405-4712 2405-4720 2405-4720
DOI:	10.1016/j.cels.2023.10.002