Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natur...

Full description

Saved in:
Bibliographic Details
Published inProceedings of the National Academy of Sciences - PNAS Vol. 118; no. 15; p. e2016239118
Main Authors Rives, Alexander, Meier, Joshua, Sercu, Tom, Goyal, Siddharth, Lin, Zeming, Liu, Jason, Guo, Demi, Ott, Myle, Zitnick, C Lawrence, Ma, Jerry, Fergus, Rob
Format Journal Article
LanguageEnglish
Published United States National Academy of Sciences 13.04.2021
Subjects
Online AccessGet full text
ISSN0027-8424
1091-6490
1091-6490
DOI10.1073/pnas.2016239118

Cover

Loading…
Abstract In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
AbstractList In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose scaling a deep contextual language model with unsupervised learning to sequences spanning evolutionary diversity. We find that without prior knowledge, information emerges in the learned representations on fundamental properties of proteins such as secondary structure, contacts, and biological activity. We show the learned representations are useful across benchmarks for remote homology detection, prediction of secondary structure, long-range residue–residue contacts, and mutational effect. Unsupervised representation learning enables state-of-the-art supervised prediction of mutational effect and secondary structure and improves state-of-the-art features for long-range contact prediction. In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
Author Goyal, Siddharth
Ma, Jerry
Guo, Demi
Rives, Alexander
Lin, Zeming
Fergus, Rob
Liu, Jason
Meier, Joshua
Sercu, Tom
Ott, Myle
Zitnick, C Lawrence
Author_xml – sequence: 1
  givenname: Alexander
  orcidid: 0000-0003-2208-0796
  surname: Rives
  fullname: Rives, Alexander
  email: arives@cs.nyu.edu
  organization: Department of Computer Science, New York University, New York, NY 10012
– sequence: 2
  givenname: Joshua
  surname: Meier
  fullname: Meier, Joshua
  organization: Facebook AI Research, New York, NY 10003
– sequence: 3
  givenname: Tom
  orcidid: 0000-0003-2947-6064
  surname: Sercu
  fullname: Sercu, Tom
  organization: Facebook AI Research, New York, NY 10003
– sequence: 4
  givenname: Siddharth
  surname: Goyal
  fullname: Goyal, Siddharth
  organization: Facebook AI Research, New York, NY 10003
– sequence: 5
  givenname: Zeming
  surname: Lin
  fullname: Lin, Zeming
  organization: Department of Computer Science, New York University, New York, NY 10012
– sequence: 6
  givenname: Jason
  surname: Liu
  fullname: Liu, Jason
  organization: Facebook AI Research, New York, NY 10003
– sequence: 7
  givenname: Demi
  surname: Guo
  fullname: Guo, Demi
  organization: Harvard University, Cambridge, MA 02138
– sequence: 8
  givenname: Myle
  surname: Ott
  fullname: Ott, Myle
  organization: Facebook AI Research, New York, NY 10003
– sequence: 9
  givenname: C Lawrence
  surname: Zitnick
  fullname: Zitnick, C Lawrence
  organization: Facebook AI Research, New York, NY 10003
– sequence: 10
  givenname: Jerry
  surname: Ma
  fullname: Ma, Jerry
  organization: Yale Law School, New Haven, CT 06511
– sequence: 11
  givenname: Rob
  surname: Fergus
  fullname: Fergus, Rob
  organization: Department of Computer Science, New York University, New York, NY 10012
BackLink https://www.ncbi.nlm.nih.gov/pubmed/33876751$$D View this record in MEDLINE/PubMed
BookMark eNpdkc1r3DAQxUVJaDbbnnsrgl5ycTL6sq1LoAltUwj00p6NVh5ttciSK1mB_vfx0rQ0OQ3MvPfjzcw5OYkpIiHvGFwy6MTVHE255MBaLjRj_SuyYaBZ00oNJ2QDwLuml1yekfNSDgCgVQ-vyZkQfdd2im1IuPEppL23JtCy5GqXmpGaOFJXo118ihQnzHukLqeJllXn457WWOqM-cEXHGlAk-OxuyTKFdDJh3A0zjkt6CMt-KtitFjekFNnQsG3T3VLfnz-9P32rrn_9uXr7cf75iCBLQ3TXAiGCjS2O9cD28mRu5Fz140Wx05b1-sRtBPYWTdqZYWQVlmrxE5KgWJLrv9w57qbcPXEJZswzNlPJv8ekvHD80n0P4d9ehh6UEJLsQIungA5rdnLMky-WAzBREy1DFwx1WqQ69m35MML6SHVHNf1jirBNbDuCHz_f6J_Uf4-QjwCwgGPnQ
ContentType Journal Article
Copyright Copyright © 2021 the Author(s). Published by PNAS.
Copyright National Academy of Sciences Apr 13, 2021
Copyright © 2021 the Author(s). Published by PNAS. 2021
Copyright_xml – notice: Copyright © 2021 the Author(s). Published by PNAS.
– notice: Copyright National Academy of Sciences Apr 13, 2021
– notice: Copyright © 2021 the Author(s). Published by PNAS. 2021
DBID NPM
7QG
7QL
7QP
7QR
7SN
7SS
7T5
7TK
7TM
7TO
7U9
8FD
C1K
FR3
H94
M7N
P64
RC3
7X8
5PM
DOI 10.1073/pnas.2016239118
DatabaseName PubMed
Animal Behavior Abstracts
Bacteriology Abstracts (Microbiology B)
Calcium & Calcified Tissue Abstracts
Chemoreception Abstracts
Ecology Abstracts
Entomology Abstracts (Full archive)
Immunology Abstracts
Neurosciences Abstracts
Nucleic Acids Abstracts
Oncogenes and Growth Factors Abstracts
Virology and AIDS Abstracts
Technology Research Database
Environmental Sciences and Pollution Management
Engineering Research Database
AIDS and Cancer Research Abstracts
Algology Mycology and Protozoology Abstracts (Microbiology C)
Biotechnology and BioEngineering Abstracts
Genetics Abstracts
MEDLINE - Academic
PubMed Central (Full Participant titles)
DatabaseTitle PubMed
Virology and AIDS Abstracts
Oncogenes and Growth Factors Abstracts
Technology Research Database
Nucleic Acids Abstracts
Ecology Abstracts
Neurosciences Abstracts
Biotechnology and BioEngineering Abstracts
Environmental Sciences and Pollution Management
Entomology Abstracts
Genetics Abstracts
Animal Behavior Abstracts
Bacteriology Abstracts (Microbiology B)
Algology Mycology and Protozoology Abstracts (Microbiology C)
AIDS and Cancer Research Abstracts
Chemoreception Abstracts
Immunology Abstracts
Engineering Research Database
Calcium & Calcified Tissue Abstracts
MEDLINE - Academic
DatabaseTitleList Virology and AIDS Abstracts
MEDLINE - Academic
PubMed

Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
DeliveryMethod fulltext_linktorsrc
Discipline Sciences (General)
EISSN 1091-6490
ExternalDocumentID PMC8053943
33876751
Genre Journal Article
GrantInformation_xml – fundername: National Science Foundation (NSF)
  grantid: 1339362
GroupedDBID ---
-DZ
-~X
.55
0R~
123
29P
2AX
2FS
2WC
4.4
53G
5RE
5VS
85S
AACGO
AAFWJ
AANCE
ABBHK
ABOCM
ABPLY
ABPPZ
ABTLG
ABXSQ
ABZEH
ACGOD
ACIWK
ACNCT
ACPRK
AENEX
AEUPB
AEXZC
AFFNX
AFOSN
AFRAH
ALMA_UNASSIGNED_HOLDINGS
BKOMP
CS3
D0L
DCCCD
DIK
DU5
E3Z
EBS
F5P
FRP
GX1
H13
HH5
HYE
IPSME
JAAYA
JBMMH
JENOY
JHFFW
JKQEH
JLS
JLXEF
JPM
JSG
JST
KQ8
L7B
LU7
N9A
NPM
N~3
O9-
OK1
PNE
PQQKQ
R.V
RHI
RNA
RNS
RPM
RXW
SA0
SJN
TAE
TN5
UKR
W8F
WH7
WOQ
WOW
X7M
XSW
Y6R
YBH
YKV
YSK
ZCA
~02
~KM
7QG
7QL
7QP
7QR
7SN
7SS
7T5
7TK
7TM
7TO
7U9
8FD
C1K
FR3
H94
M7N
P64
RC3
7X8
5PM
ID FETCH-LOGICAL-j401t-192331e509e6bf801b4d2fd22f7dced79cf89d09f3e7cfd95c334c5cc53b443e3
ISSN 0027-8424
1091-6490
IngestDate Thu Aug 21 14:08:10 EDT 2025
Fri Jul 11 12:11:36 EDT 2025
Mon Jun 30 10:10:04 EDT 2025
Thu Apr 03 07:02:15 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 15
Keywords deep learning
protein language model
synthetic biology
representation learning
generative biology
Language English
License Copyright © 2021 the Author(s). Published by PNAS.
This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-j401t-192331e509e6bf801b4d2fd22f7dced79cf89d09f3e7cfd95c334c5cc53b443e3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
Edited by David T. Jones, University College London, London, United Kingdom, and accepted by Editorial Board Member William H. Press December 16, 2020 (received for review August 6, 2020)
3Work performed while at Facebook AI Research.
Author contributions: A.R., J. Meier, T.S., S.G., Z.L., M.O., C.L.Z., J. Ma, and R.F. designed research; A.R., J. Meier, T.S., S.G., Z.L., J.L., D.G., and J. Ma performed research; A.R., J. Meier, T.S., S.G., Z.L., J.L., D.G., and J. Ma analyzed data; and A.R., J. Meier, T.S., S.G., Z.L., J.L., D.G., M.O., C.L.Z., J. Ma, and R.F. wrote the paper.
1A.R., J. Meier., T.S., and S.G. contributed equally to this work.
ORCID 0000-0003-2208-0796
0000-0003-2947-6064
OpenAccessLink https://pubmed.ncbi.nlm.nih.gov/PMC8053943
PMID 33876751
PQID 2513290173
PQPubID 42026
ParticipantIDs pubmedcentral_primary_oai_pubmedcentral_nih_gov_8053943
proquest_miscellaneous_2515690462
proquest_journals_2513290173
pubmed_primary_33876751
PublicationCentury 2000
PublicationDate 2021-04-13
PublicationDateYYYYMMDD 2021-04-13
PublicationDate_xml – month: 04
  year: 2021
  text: 2021-04-13
  day: 13
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
– name: Washington
PublicationTitle Proceedings of the National Academy of Sciences - PNAS
PublicationTitleAlternate Proc Natl Acad Sci U S A
PublicationYear 2021
Publisher National Academy of Sciences
Publisher_xml – name: National Academy of Sciences
SSID ssj0009580
Score 2.7435303
Snippet In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in...
Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose...
SourceID pubmedcentral
proquest
pubmed
SourceType Open Access Repository
Aggregation Database
Index Database
StartPage e2016239118
SubjectTerms Amino acid sequence
Amino acids
Artificial intelligence
Biological properties
Biological Sciences
Generative artificial intelligence
Homology
Language
Learning
Model testing
Physical Sciences
Protein structure
Proteins
Representations
Secondary structure
Sequences
Structure-function relationships
Tertiary structure
Unsupervised learning
Title Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
URI https://www.ncbi.nlm.nih.gov/pubmed/33876751
https://www.proquest.com/docview/2513290173
https://www.proquest.com/docview/2515690462
https://pubmed.ncbi.nlm.nih.gov/PMC8053943
Volume 118
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3db9MwELfKkNBeEOOzMJCReABFGY3tOM7jNAHTJKoKNqlvVeI4aqeRVkrzAP8b_xvnz6SjD7CXqErsJPX9cne2736H0DtGSpKrSRWXnPKYcUnjghQs5qAuE0UmtVAmynfKz6_YxTydj0a_B1FL3bY8kb_25pXcRapwDuSqs2T_Q7LhpnACfoN84QgShuM_ydgWkrSJjYYH1u8GaGtlBKtMdqVNImmhnV4Y6Jq222gV0YKzeeOXRsAHBT8l0lWIdEfD37BqohBqPfRiZ8HqtT7GYOoXFU_7FBWnN9oojmbTvuDxN810u5NcE6SuVhZBF-t22QWDAfpMdgZYdgwtn_9PU6og-r6qqiWMzXK4fkESvRVj00-H9N97X2-ouAkYU2bTrU-U1dXg6sSc2WqjQZn32rzzmaJ_WQlQa7q0cVNovvYEHMDcdRtgZvPDgAZm8JruJunNZQhinH09E6DCckbvofsEZim6gMaXeTLgfBY2A8q9u2eWyujHW88-RA_8g_bNdG4H7A48oMtH6KGbuuBTi8MjNFLNY3TkRxG_dwzmH56gmx6YOAATg6ixBya2wMQamNgBEw-BiT0w8XaNAZjYARM7YOIAzKfo6vOny7Pz2JX1iK9hMr-N9ZwC9AB4qoqXNXhIJatIXRFSZ_AHqyyXtcirSV5Tlcm6ylNJKZOplCktGaOKPkMHzbpRLxAWelgnpBAJU0wqIXiuKJeJKuBWVKgxOvaDuXDfbbsAj57q6IGMjtHbcBm0qt4qKxq17kyblOc6cXuMntuxX2ws_cvCS2qMsh2phAaasX33SrNaGuZ2h5eXd-75Ch3239AxOgARqtfgFW_LNwZ7fwCPBL_l
linkProvider Geneva Foundation for Medical Education and Research
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Biological+structure+and+function+emerge+from+scaling+unsupervised+learning+to+250+million+protein+sequences&rft.jtitle=Proceedings+of+the+National+Academy+of+Sciences+-+PNAS&rft.au=Rives%2C+Alexander&rft.au=Meier%2C+Joshua&rft.au=Sercu%2C+Tom&rft.au=Goyal%2C+Siddharth&rft.date=2021-04-13&rft.pub=National+Academy+of+Sciences&rft.issn=0027-8424&rft.eissn=1091-6490&rft.volume=118&rft.issue=15&rft_id=info:doi/10.1073%2Fpnas.2016239118&rft_id=info%3Apmid%2F33876751&rft.externalDocID=PMC8053943
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0027-8424&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0027-8424&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0027-8424&client=summon