Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natur...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the National Academy of Sciences - PNAS Vol. 118; no. 15; p. e2016239118
Main Authors	Rives, Alexander, Meier, Joshua, Sercu, Tom, Goyal, Siddharth, Lin, Zeming, Liu, Jason, Guo, Demi, Ott, Myle, Zitnick, C Lawrence, Ma, Jerry, Fergus, Rob
Format	Journal Article
Language	English
Published	United States National Academy of Sciences 13.04.2021
Subjects	Amino acid sequence Amino acids Artificial intelligence Biological properties Biological Sciences Generative artificial intelligence Homology Language Learning Model testing Physical Sciences Protein structure Proteins Representations Secondary structure Sequences Structure-function relationships Tertiary structure Unsupervised learning deep learning protein language model synthetic biology representation learning generative biology
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Edited by David T. Jones, University College London, London, United Kingdom, and accepted by Editorial Board Member William H. Press December 16, 2020 (received for review August 6, 2020) 3Work performed while at Facebook AI Research. Author contributions: A.R., J. Meier, T.S., S.G., Z.L., M.O., C.L.Z., J. Ma, and R.F. designed research; A.R., J. Meier, T.S., S.G., Z.L., J.L., D.G., and J. Ma performed research; A.R., J. Meier, T.S., S.G., Z.L., J.L., D.G., and J. Ma analyzed data; and A.R., J. Meier, T.S., S.G., Z.L., J.L., D.G., M.O., C.L.Z., J. Ma, and R.F. wrote the paper. 1A.R., J. Meier., T.S., and S.G. contributed equally to this work.
ISSN:	0027-8424 1091-6490 1091-6490
DOI:	10.1073/pnas.2016239118