Learning the protein language: Evolution, structure, and function

Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using langua...

Full description

Saved in:
Bibliographic Details
Published inCell systems Vol. 12; no. 6; pp. 654 - 669.e3
Main Authors Bepler, Tristan, Berger, Bonnie
Format Journal Article
LanguageEnglish
Published United States Elsevier Inc 16.06.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using language models, we can encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. We then consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations. The knowledge distilled by these models allows us to improve downstream function prediction through transfer learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to approach protein and therapeutic design. However, further developments are needed to encode strong biological priors into protein language models and to increase their accessibility to the broader community. •Deep protein language models can learn information from protein sequence•They capture the structure, function, and evolutionary fitness of sequence variants•They can be enriched with prior knowledge and inform function predictions•They can revolutionize protein biology by suggesting new ways to approach design In this synthesis, Bepler and Berger discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. They consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2405-4712
2405-4720
2405-4720
DOI:10.1016/j.cels.2021.05.017