ProGen2: Exploring the boundaries of protein language models

Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective pro...

Full description

Saved in:
Bibliographic Details
Published inCell systems Vol. 14; no. 11; pp. 968 - 978.e3
Main Authors Nijkamp, Erik, Ruffolo, Jeffrey A., Weinstein, Eli N., Naik, Nikhil, Madani, Ali
Format Journal Article
LanguageEnglish
Published United States Elsevier Inc 15.11.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper’s Transparent Peer Review process is included in the supplemental information. •The ProGen2 suite of protein language models are scaled to 6.4B parameters•Models with increased scale better capture the distribution of protein sequences•ProGen2 models generate novel protein sequences adopting natural folds•ProGen2 model likelihoods are effective for zero-shot fitness prediction The ProGen2 suite of models are scaled up to 6.4B parameters and trained on over one billion sequences from genomic, metagenomic, and immune repertoire datasets. We explore the impact of scale and data distribution on fitting the evolutionary sequence distribution, generating protein sequences, and estimating protein fitness.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2405-4712
2405-4720
2405-4720
DOI:10.1016/j.cels.2023.10.002