Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natur...
Saved in:
Published in | Proceedings of the National Academy of Sciences - PNAS Vol. 118; no. 15; p. e2016239118 |
---|---|
Main Authors | , , , , , , , , , , |
Format | Journal Article |
Language | English |
Published |
United States
National Academy of Sciences
13.04.2021
|
Subjects | |
Online Access | Get full text |
ISSN | 0027-8424 1091-6490 1091-6490 |
DOI | 10.1073/pnas.2016239118 |
Cover
Loading…
Abstract | In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction. |
---|---|
AbstractList | In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction. In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction. Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose scaling a deep contextual language model with unsupervised learning to sequences spanning evolutionary diversity. We find that without prior knowledge, information emerges in the learned representations on fundamental properties of proteins such as secondary structure, contacts, and biological activity. We show the learned representations are useful across benchmarks for remote homology detection, prediction of secondary structure, long-range residue–residue contacts, and mutational effect. Unsupervised representation learning enables state-of-the-art supervised prediction of mutational effect and secondary structure and improves state-of-the-art features for long-range contact prediction. In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction. |
Author | Goyal, Siddharth Ma, Jerry Guo, Demi Rives, Alexander Lin, Zeming Fergus, Rob Liu, Jason Meier, Joshua Sercu, Tom Ott, Myle Zitnick, C Lawrence |
Author_xml | – sequence: 1 givenname: Alexander orcidid: 0000-0003-2208-0796 surname: Rives fullname: Rives, Alexander email: arives@cs.nyu.edu organization: Department of Computer Science, New York University, New York, NY 10012 – sequence: 2 givenname: Joshua surname: Meier fullname: Meier, Joshua organization: Facebook AI Research, New York, NY 10003 – sequence: 3 givenname: Tom orcidid: 0000-0003-2947-6064 surname: Sercu fullname: Sercu, Tom organization: Facebook AI Research, New York, NY 10003 – sequence: 4 givenname: Siddharth surname: Goyal fullname: Goyal, Siddharth organization: Facebook AI Research, New York, NY 10003 – sequence: 5 givenname: Zeming surname: Lin fullname: Lin, Zeming organization: Department of Computer Science, New York University, New York, NY 10012 – sequence: 6 givenname: Jason surname: Liu fullname: Liu, Jason organization: Facebook AI Research, New York, NY 10003 – sequence: 7 givenname: Demi surname: Guo fullname: Guo, Demi organization: Harvard University, Cambridge, MA 02138 – sequence: 8 givenname: Myle surname: Ott fullname: Ott, Myle organization: Facebook AI Research, New York, NY 10003 – sequence: 9 givenname: C Lawrence surname: Zitnick fullname: Zitnick, C Lawrence organization: Facebook AI Research, New York, NY 10003 – sequence: 10 givenname: Jerry surname: Ma fullname: Ma, Jerry organization: Yale Law School, New Haven, CT 06511 – sequence: 11 givenname: Rob surname: Fergus fullname: Fergus, Rob organization: Department of Computer Science, New York University, New York, NY 10012 |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/33876751$$D View this record in MEDLINE/PubMed |
BookMark | eNpdkc1r3DAQxUVJaDbbnnsrgl5ycTL6sq1LoAltUwj00p6NVh5ttciSK1mB_vfx0rQ0OQ3MvPfjzcw5OYkpIiHvGFwy6MTVHE255MBaLjRj_SuyYaBZ00oNJ2QDwLuml1yekfNSDgCgVQ-vyZkQfdd2im1IuPEppL23JtCy5GqXmpGaOFJXo118ihQnzHukLqeJllXn457WWOqM-cEXHGlAk-OxuyTKFdDJh3A0zjkt6CMt-KtitFjekFNnQsG3T3VLfnz-9P32rrn_9uXr7cf75iCBLQ3TXAiGCjS2O9cD28mRu5Fz140Wx05b1-sRtBPYWTdqZYWQVlmrxE5KgWJLrv9w57qbcPXEJZswzNlPJv8ekvHD80n0P4d9ehh6UEJLsQIungA5rdnLMky-WAzBREy1DFwx1WqQ69m35MML6SHVHNf1jirBNbDuCHz_f6J_Uf4-QjwCwgGPnQ |
ContentType | Journal Article |
Copyright | Copyright © 2021 the Author(s). Published by PNAS. Copyright National Academy of Sciences Apr 13, 2021 Copyright © 2021 the Author(s). Published by PNAS. 2021 |
Copyright_xml | – notice: Copyright © 2021 the Author(s). Published by PNAS. – notice: Copyright National Academy of Sciences Apr 13, 2021 – notice: Copyright © 2021 the Author(s). Published by PNAS. 2021 |
DBID | NPM 7QG 7QL 7QP 7QR 7SN 7SS 7T5 7TK 7TM 7TO 7U9 8FD C1K FR3 H94 M7N P64 RC3 7X8 5PM |
DOI | 10.1073/pnas.2016239118 |
DatabaseName | PubMed Animal Behavior Abstracts Bacteriology Abstracts (Microbiology B) Calcium & Calcified Tissue Abstracts Chemoreception Abstracts Ecology Abstracts Entomology Abstracts (Full archive) Immunology Abstracts Neurosciences Abstracts Nucleic Acids Abstracts Oncogenes and Growth Factors Abstracts Virology and AIDS Abstracts Technology Research Database Environmental Sciences and Pollution Management Engineering Research Database AIDS and Cancer Research Abstracts Algology Mycology and Protozoology Abstracts (Microbiology C) Biotechnology and BioEngineering Abstracts Genetics Abstracts MEDLINE - Academic PubMed Central (Full Participant titles) |
DatabaseTitle | PubMed Virology and AIDS Abstracts Oncogenes and Growth Factors Abstracts Technology Research Database Nucleic Acids Abstracts Ecology Abstracts Neurosciences Abstracts Biotechnology and BioEngineering Abstracts Environmental Sciences and Pollution Management Entomology Abstracts Genetics Abstracts Animal Behavior Abstracts Bacteriology Abstracts (Microbiology B) Algology Mycology and Protozoology Abstracts (Microbiology C) AIDS and Cancer Research Abstracts Chemoreception Abstracts Immunology Abstracts Engineering Research Database Calcium & Calcified Tissue Abstracts MEDLINE - Academic |
DatabaseTitleList | Virology and AIDS Abstracts MEDLINE - Academic PubMed |
Database_xml | – sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Sciences (General) |
EISSN | 1091-6490 |
ExternalDocumentID | PMC8053943 33876751 |
Genre | Journal Article |
GrantInformation_xml | – fundername: National Science Foundation (NSF) grantid: 1339362 |
GroupedDBID | --- -DZ -~X .55 0R~ 123 29P 2AX 2FS 2WC 4.4 53G 5RE 5VS 85S AACGO AAFWJ AANCE ABBHK ABOCM ABPLY ABPPZ ABTLG ABXSQ ABZEH ACGOD ACIWK ACNCT ACPRK AENEX AEUPB AEXZC AFFNX AFOSN AFRAH ALMA_UNASSIGNED_HOLDINGS BKOMP CS3 D0L DCCCD DIK DU5 E3Z EBS F5P FRP GX1 H13 HH5 HYE IPSME JAAYA JBMMH JENOY JHFFW JKQEH JLS JLXEF JPM JSG JST KQ8 L7B LU7 N9A NPM N~3 O9- OK1 PNE PQQKQ R.V RHI RNA RNS RPM RXW SA0 SJN TAE TN5 UKR W8F WH7 WOQ WOW X7M XSW Y6R YBH YKV YSK ZCA ~02 ~KM 7QG 7QL 7QP 7QR 7SN 7SS 7T5 7TK 7TM 7TO 7U9 8FD C1K FR3 H94 M7N P64 RC3 7X8 5PM |
ID | FETCH-LOGICAL-j401t-192331e509e6bf801b4d2fd22f7dced79cf89d09f3e7cfd95c334c5cc53b443e3 |
ISSN | 0027-8424 1091-6490 |
IngestDate | Thu Aug 21 14:08:10 EDT 2025 Fri Jul 11 12:11:36 EDT 2025 Mon Jun 30 10:10:04 EDT 2025 Thu Apr 03 07:02:15 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 15 |
Keywords | deep learning protein language model synthetic biology representation learning generative biology |
Language | English |
License | Copyright © 2021 the Author(s). Published by PNAS. This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND). |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-j401t-192331e509e6bf801b4d2fd22f7dced79cf89d09f3e7cfd95c334c5cc53b443e3 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Edited by David T. Jones, University College London, London, United Kingdom, and accepted by Editorial Board Member William H. Press December 16, 2020 (received for review August 6, 2020) 3Work performed while at Facebook AI Research. Author contributions: A.R., J. Meier, T.S., S.G., Z.L., M.O., C.L.Z., J. Ma, and R.F. designed research; A.R., J. Meier, T.S., S.G., Z.L., J.L., D.G., and J. Ma performed research; A.R., J. Meier, T.S., S.G., Z.L., J.L., D.G., and J. Ma analyzed data; and A.R., J. Meier, T.S., S.G., Z.L., J.L., D.G., M.O., C.L.Z., J. Ma, and R.F. wrote the paper. 1A.R., J. Meier., T.S., and S.G. contributed equally to this work. |
ORCID | 0000-0003-2208-0796 0000-0003-2947-6064 |
OpenAccessLink | https://pubmed.ncbi.nlm.nih.gov/PMC8053943 |
PMID | 33876751 |
PQID | 2513290173 |
PQPubID | 42026 |
ParticipantIDs | pubmedcentral_primary_oai_pubmedcentral_nih_gov_8053943 proquest_miscellaneous_2515690462 proquest_journals_2513290173 pubmed_primary_33876751 |
PublicationCentury | 2000 |
PublicationDate | 2021-04-13 |
PublicationDateYYYYMMDD | 2021-04-13 |
PublicationDate_xml | – month: 04 year: 2021 text: 2021-04-13 day: 13 |
PublicationDecade | 2020 |
PublicationPlace | United States |
PublicationPlace_xml | – name: United States – name: Washington |
PublicationTitle | Proceedings of the National Academy of Sciences - PNAS |
PublicationTitleAlternate | Proc Natl Acad Sci U S A |
PublicationYear | 2021 |
Publisher | National Academy of Sciences |
Publisher_xml | – name: National Academy of Sciences |
SSID | ssj0009580 |
Score | 2.7435303 |
Snippet | In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in... Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose... |
SourceID | pubmedcentral proquest pubmed |
SourceType | Open Access Repository Aggregation Database Index Database |
StartPage | e2016239118 |
SubjectTerms | Amino acid sequence Amino acids Artificial intelligence Biological properties Biological Sciences Generative artificial intelligence Homology Language Learning Model testing Physical Sciences Protein structure Proteins Representations Secondary structure Sequences Structure-function relationships Tertiary structure Unsupervised learning |
Title | Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences |
URI | https://www.ncbi.nlm.nih.gov/pubmed/33876751 https://www.proquest.com/docview/2513290173 https://www.proquest.com/docview/2515690462 https://pubmed.ncbi.nlm.nih.gov/PMC8053943 |
Volume | 118 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3db9MwELfKkNBeEOOzMJCReABFGY3tOM7jNAHTJKoKNqlvVeI4aqeRVkrzAP8b_xvnz6SjD7CXqErsJPX9cne2736H0DtGSpKrSRWXnPKYcUnjghQs5qAuE0UmtVAmynfKz6_YxTydj0a_B1FL3bY8kb_25pXcRapwDuSqs2T_Q7LhpnACfoN84QgShuM_ydgWkrSJjYYH1u8GaGtlBKtMdqVNImmhnV4Y6Jq222gV0YKzeeOXRsAHBT8l0lWIdEfD37BqohBqPfRiZ8HqtT7GYOoXFU_7FBWnN9oojmbTvuDxN810u5NcE6SuVhZBF-t22QWDAfpMdgZYdgwtn_9PU6og-r6qqiWMzXK4fkESvRVj00-H9N97X2-ouAkYU2bTrU-U1dXg6sSc2WqjQZn32rzzmaJ_WQlQa7q0cVNovvYEHMDcdRtgZvPDgAZm8JruJunNZQhinH09E6DCckbvofsEZim6gMaXeTLgfBY2A8q9u2eWyujHW88-RA_8g_bNdG4H7A48oMtH6KGbuuBTi8MjNFLNY3TkRxG_dwzmH56gmx6YOAATg6ixBya2wMQamNgBEw-BiT0w8XaNAZjYARM7YOIAzKfo6vOny7Pz2JX1iK9hMr-N9ZwC9AB4qoqXNXhIJatIXRFSZ_AHqyyXtcirSV5Tlcm6ylNJKZOplCktGaOKPkMHzbpRLxAWelgnpBAJU0wqIXiuKJeJKuBWVKgxOvaDuXDfbbsAj57q6IGMjtHbcBm0qt4qKxq17kyblOc6cXuMntuxX2ws_cvCS2qMsh2phAaasX33SrNaGuZ2h5eXd-75Ch3239AxOgARqtfgFW_LNwZ7fwCPBL_l |
linkProvider | Geneva Foundation for Medical Education and Research |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Biological+structure+and+function+emerge+from+scaling+unsupervised+learning+to+250+million+protein+sequences&rft.jtitle=Proceedings+of+the+National+Academy+of+Sciences+-+PNAS&rft.au=Rives%2C+Alexander&rft.au=Meier%2C+Joshua&rft.au=Sercu%2C+Tom&rft.au=Goyal%2C+Siddharth&rft.date=2021-04-13&rft.pub=National+Academy+of+Sciences&rft.issn=0027-8424&rft.eissn=1091-6490&rft.volume=118&rft.issue=15&rft_id=info:doi/10.1073%2Fpnas.2016239118&rft_id=info%3Apmid%2F33876751&rft.externalDocID=PMC8053943 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0027-8424&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0027-8424&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0027-8424&client=summon |