Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natur...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the National Academy of Sciences - PNAS Vol. 118; no. 15; p. e2016239118
Main Authors	Rives, Alexander, Meier, Joshua, Sercu, Tom, Goyal, Siddharth, Lin, Zeming, Liu, Jason, Guo, Demi, Ott, Myle, Zitnick, C Lawrence, Ma, Jerry, Fergus, Rob
Format	Journal Article
Language	English
Published	United States National Academy of Sciences 13.04.2021
Subjects	Amino acid sequence Amino acids Artificial intelligence Biological properties Biological Sciences Generative artificial intelligence Homology Language Learning Model testing Physical Sciences Protein structure Proteins Representations Secondary structure Sequences Structure-function relationships Tertiary structure Unsupervised learning deep learning protein language model synthetic biology representation learning generative biology
Online Access	Get full text
ISSN	0027-8424 1091-6490 1091-6490
DOI	10.1073/pnas.2016239118

Cover

Loading…

Abstract	In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
AbstractList	In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction. In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction. Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose scaling a deep contextual language model with unsupervised learning to sequences spanning evolutionary diversity. We find that without prior knowledge, information emerges in the learned representations on fundamental properties of proteins such as secondary structure, contacts, and biological activity. We show the learned representations are useful across benchmarks for remote homology detection, prediction of secondary structure, long-range residue–residue contacts, and mutational effect. Unsupervised representation learning enables state-of-the-art supervised prediction of mutational effect and secondary structure and improves state-of-the-art features for long-range contact prediction. In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
Author	Goyal, Siddharth Ma, Jerry Guo, Demi Rives, Alexander Lin, Zeming Fergus, Rob Liu, Jason Meier, Joshua Sercu, Tom Ott, Myle Zitnick, C Lawrence
Author_xml	– sequence: 1 givenname: Alexander orcidid: 0000-0003-2208-0796 surname: Rives fullname: Rives, Alexander email: arives@cs.nyu.edu organization: Department of Computer Science, New York University, New York, NY 10012 – sequence: 2 givenname: Joshua surname: Meier fullname: Meier, Joshua organization: Facebook AI Research, New York, NY 10003 – sequence: 3 givenname: Tom orcidid: 0000-0003-2947-6064 surname: Sercu fullname: Sercu, Tom organization: Facebook AI Research, New York, NY 10003 – sequence: 4 givenname: Siddharth surname: Goyal fullname: Goyal, Siddharth organization: Facebook AI Research, New York, NY 10003 – sequence: 5 givenname: Zeming surname: Lin fullname: Lin, Zeming organization: Department of Computer Science, New York University, New York, NY 10012 – sequence: 6 givenname: Jason surname: Liu fullname: Liu, Jason organization: Facebook AI Research, New York, NY 10003 – sequence: 7 givenname: Demi surname: Guo fullname: Guo, Demi organization: Harvard University, Cambridge, MA 02138 – sequence: 8 givenname: Myle surname: Ott fullname: Ott, Myle organization: Facebook AI Research, New York, NY 10003 – sequence: 9 givenname: C Lawrence surname: Zitnick fullname: Zitnick, C Lawrence organization: Facebook AI Research, New York, NY 10003 – sequence: 10 givenname: Jerry surname: Ma fullname: Ma, Jerry organization: Yale Law School, New Haven, CT 06511 – sequence: 11 givenname: Rob surname: Fergus fullname: Fergus, Rob organization: Department of Computer Science, New York University, New York, NY 10012
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/33876751$$D View this record in MEDLINE/PubMed
BookMark	eNpdkc1r3DAQxUVJaDbbnnsrgl5ycTL6sq1LoAltUwj00p6NVh5ttciSK1mB_vfx0rQ0OQ3MvPfjzcw5OYkpIiHvGFwy6MTVHE255MBaLjRj_SuyYaBZ00oNJ2QDwLuml1yekfNSDgCgVQ-vyZkQfdd2im1IuPEppL23JtCy5GqXmpGaOFJXo118ihQnzHukLqeJllXn457WWOqM-cEXHGlAk-OxuyTKFdDJh3A0zjkt6CMt-KtitFjekFNnQsG3T3VLfnz-9P32rrn_9uXr7cf75iCBLQ3TXAiGCjS2O9cD28mRu5Fz140Wx05b1-sRtBPYWTdqZYWQVlmrxE5KgWJLrv9w57qbcPXEJZswzNlPJv8ekvHD80n0P4d9ehh6UEJLsQIungA5rdnLMky-WAzBREy1DFwx1WqQ69m35MML6SHVHNf1jirBNbDuCHz_f6J_Uf4-QjwCwgGPnQ
ContentType	Journal Article
Copyright	Copyright © 2021 the Author(s). Published by PNAS. Copyright National Academy of Sciences Apr 13, 2021 Copyright © 2021 the Author(s). Published by PNAS. 2021
Copyright_xml	– notice: Copyright © 2021 the Author(s). Published by PNAS. – notice: Copyright National Academy of Sciences Apr 13, 2021 – notice: Copyright © 2021 the Author(s). Published by PNAS. 2021
DBID	NPM 7QG 7QL 7QP 7QR 7SN 7SS 7T5 7TK 7TM 7TO 7U9 8FD C1K FR3 H94 M7N P64 RC3 7X8 5PM
DOI	10.1073/pnas.2016239118
DatabaseName	PubMed Animal Behavior Abstracts Bacteriology Abstracts (Microbiology B) Calcium & Calcified Tissue Abstracts Chemoreception Abstracts Ecology Abstracts Entomology Abstracts (Full archive) Immunology Abstracts Neurosciences Abstracts Nucleic Acids Abstracts Oncogenes and Growth Factors Abstracts Virology and AIDS Abstracts Technology Research Database Environmental Sciences and Pollution Management Engineering Research Database AIDS and Cancer Research Abstracts Algology Mycology and Protozoology Abstracts (Microbiology C) Biotechnology and BioEngineering Abstracts Genetics Abstracts MEDLINE - Academic PubMed Central (Full Participant titles)
DatabaseTitle	PubMed Virology and AIDS Abstracts Oncogenes and Growth Factors Abstracts Technology Research Database Nucleic Acids Abstracts Ecology Abstracts Neurosciences Abstracts Biotechnology and BioEngineering Abstracts Environmental Sciences and Pollution Management Entomology Abstracts Genetics Abstracts Animal Behavior Abstracts Bacteriology Abstracts (Microbiology B) Algology Mycology and Protozoology Abstracts (Microbiology C) AIDS and Cancer Research Abstracts Chemoreception Abstracts Immunology Abstracts Engineering Research Database Calcium & Calcified Tissue Abstracts MEDLINE - Academic
DatabaseTitleList	Virology and AIDS Abstracts MEDLINE - Academic PubMed
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Sciences (General)
EISSN	1091-6490
ExternalDocumentID	PMC8053943 33876751
Genre	Journal Article
GrantInformation_xml	– fundername: National Science Foundation (NSF) grantid: 1339362
GroupedDBID	--- -DZ -~X .55 0R~ 123 29P 2AX 2FS 2WC 4.4 53G 5RE 5VS 85S AACGO AAFWJ AANCE ABBHK ABOCM ABPLY ABPPZ ABTLG ABXSQ ABZEH ACGOD ACIWK ACNCT ACPRK AENEX AEUPB AEXZC AFFNX AFOSN AFRAH ALMA_UNASSIGNED_HOLDINGS BKOMP CS3 D0L DCCCD DIK DU5 E3Z EBS F5P FRP GX1 H13 HH5 HYE IPSME JAAYA JBMMH JENOY JHFFW JKQEH JLS JLXEF JPM JSG JST KQ8 L7B LU7 N9A NPM N~3 O9- OK1 PNE PQQKQ R.V RHI RNA RNS RPM RXW SA0 SJN TAE TN5 UKR W8F WH7 WOQ WOW X7M XSW Y6R YBH YKV YSK ZCA ~02 ~KM 7QG 7QL 7QP 7QR 7SN 7SS 7T5 7TK 7TM 7TO 7U9 8FD C1K FR3 H94 M7N P64 RC3 7X8 5PM
ID	FETCH-LOGICAL-j401t-192331e509e6bf801b4d2fd22f7dced79cf89d09f3e7cfd95c334c5cc53b443e3
ISSN	0027-8424 1091-6490
IngestDate	Thu Aug 21 14:08:10 EDT 2025 Fri Jul 11 12:11:36 EDT 2025 Mon Jun 30 10:10:04 EDT 2025 Thu Apr 03 07:02:15 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	15
Keywords	deep learning protein language model synthetic biology representation learning generative biology
Language	English
License	Copyright © 2021 the Author(s). Published by PNAS. This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-j401t-192331e509e6bf801b4d2fd22f7dced79cf89d09f3e7cfd95c334c5cc53b443e3
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Edited by David T. Jones, University College London, London, United Kingdom, and accepted by Editorial Board Member William H. Press December 16, 2020 (received for review August 6, 2020) 3Work performed while at Facebook AI Research. Author contributions: A.R., J. Meier, T.S., S.G., Z.L., M.O., C.L.Z., J. Ma, and R.F. designed research; A.R., J. Meier, T.S., S.G., Z.L., J.L., D.G., and J. Ma performed research; A.R., J. Meier, T.S., S.G., Z.L., J.L., D.G., and J. Ma analyzed data; and A.R., J. Meier, T.S., S.G., Z.L., J.L., D.G., M.O., C.L.Z., J. Ma, and R.F. wrote the paper. 1A.R., J. Meier., T.S., and S.G. contributed equally to this work.
ORCID	0000-0003-2208-0796 0000-0003-2947-6064
OpenAccessLink	https://pubmed.ncbi.nlm.nih.gov/PMC8053943
PMID	33876751
PQID	2513290173
PQPubID	42026
ParticipantIDs	pubmedcentral_primary_oai_pubmedcentral_nih_gov_8053943 proquest_miscellaneous_2515690462 proquest_journals_2513290173 pubmed_primary_33876751
PublicationCentury	2000
PublicationDate	2021-04-13
PublicationDateYYYYMMDD	2021-04-13
PublicationDate_xml	– month: 04 year: 2021 text: 2021-04-13 day: 13
PublicationDecade	2020
PublicationPlace	United States
PublicationPlace_xml	– name: United States – name: Washington
PublicationTitle	Proceedings of the National Academy of Sciences - PNAS
PublicationTitleAlternate	Proc Natl Acad Sci U S A
PublicationYear	2021
Publisher	National Academy of Sciences
Publisher_xml	– name: National Academy of Sciences
SSID	ssj0009580
Score	2.7435303
Snippet	In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in... Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose...
SourceID	pubmedcentral proquest pubmed
SourceType	Open Access Repository Aggregation Database Index Database
StartPage	e2016239118
SubjectTerms	Amino acid sequence Amino acids Artificial intelligence Biological properties Biological Sciences Generative artificial intelligence Homology Language Learning Model testing Physical Sciences Protein structure Proteins Representations Secondary structure Sequences Structure-function relationships Tertiary structure Unsupervised learning
Title	Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
URI	https://www.ncbi.nlm.nih.gov/pubmed/33876751 https://www.proquest.com/docview/2513290173 https://www.proquest.com/docview/2515690462 https://pubmed.ncbi.nlm.nih.gov/PMC8053943
Volume	118
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3db9MwELfKkNBeEOOzMJCReABFGY3tOM7jNAHTJKoKNqlvVeI4aqeRVkrzAP8b_xvnz6SjD7CXqErsJPX9cne2736H0DtGSpKrSRWXnPKYcUnjghQs5qAuE0UmtVAmynfKz6_YxTydj0a_B1FL3bY8kb_25pXcRapwDuSqs2T_Q7LhpnACfoN84QgShuM_ydgWkrSJjYYH1u8GaGtlBKtMdqVNImmhnV4Y6Jq222gV0YKzeeOXRsAHBT8l0lWIdEfD37BqohBqPfRiZ8HqtT7GYOoXFU_7FBWnN9oojmbTvuDxN810u5NcE6SuVhZBF-t22QWDAfpMdgZYdgwtn_9PU6og-r6qqiWMzXK4fkESvRVj00-H9N97X2-ouAkYU2bTrU-U1dXg6sSc2WqjQZn32rzzmaJ_WQlQa7q0cVNovvYEHMDcdRtgZvPDgAZm8JruJunNZQhinH09E6DCckbvofsEZim6gMaXeTLgfBY2A8q9u2eWyujHW88-RA_8g_bNdG4H7A48oMtH6KGbuuBTi8MjNFLNY3TkRxG_dwzmH56gmx6YOAATg6ixBya2wMQamNgBEw-BiT0w8XaNAZjYARM7YOIAzKfo6vOny7Pz2JX1iK9hMr-N9ZwC9AB4qoqXNXhIJatIXRFSZ_AHqyyXtcirSV5Tlcm6ylNJKZOplCktGaOKPkMHzbpRLxAWelgnpBAJU0wqIXiuKJeJKuBWVKgxOvaDuXDfbbsAj57q6IGMjtHbcBm0qt4qKxq17kyblOc6cXuMntuxX2ws_cvCS2qMsh2phAaasX33SrNaGuZ2h5eXd-75Ch3239AxOgARqtfgFW_LNwZ7fwCPBL_l
linkProvider	Geneva Foundation for Medical Education and Research
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Biological+structure+and+function+emerge+from+scaling+unsupervised+learning+to+250+million+protein+sequences&rft.jtitle=Proceedings+of+the+National+Academy+of+Sciences+-+PNAS&rft.au=Rives%2C+Alexander&rft.au=Meier%2C+Joshua&rft.au=Sercu%2C+Tom&rft.au=Goyal%2C+Siddharth&rft.date=2021-04-13&rft.pub=National+Academy+of+Sciences&rft.issn=0027-8424&rft.eissn=1091-6490&rft.volume=118&rft.issue=15&rft_id=info:doi/10.1073%2Fpnas.2016239118&rft_id=info%3Apmid%2F33876751&rft.externalDocID=PMC8053943
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0027-8424&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0027-8424&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0027-8424&client=summon