Learning functional properties of proteins with language models

Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by bette...

Full description

Saved in:

Bibliographic Details
Published in	Nature machine intelligence Vol. 4; no. 3; pp. 227 - 245
Main Authors	Unsal, Serbulent, Atas, Heval, Albayrak, Muammer, Turhan, Kemal, Acar, Aybar C., Doğan, Tunca
Format	Journal Article
Language	English
Published	London Nature Publishing Group UK 01.03.2022 Nature Publishing Group
Subjects	631/114/1305 631/114/2403 631/114/2410 631/114/2784 639/705/1042 Algorithms Analysis Annotations Benchmarks Biomarkers Deep learning Engineering Gene expression Immunoglobulins Informatics Kinases Ligands Localization Machine learning Mutation Natural language processing Neural networks Performance prediction Proteins Proteomics Representations West Nile virus
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by better representing the data at hand. Novel data representation approaches mostly take inspiration from language models that have yielded ground-breaking improvements in natural language processing. Lately, these approaches have been applied to the field of protein science and have displayed highly promising results in terms of extracting complex sequence–structure–function relationships. In this study we conducted a detailed investigation over protein representation learning by first categorizing/explaining each approach, subsequently benchmarking their performances on predicting: (1) semantic similarities between proteins, (2) ontology-based protein functions, (3) drug target protein families and (4) protein–protein binding affinity changes following mutations. We evaluate and discuss the advantages and disadvantages of each method over the benchmark results, source datasets and algorithms used, in comparison with classical model-driven approaches. Finally, we discuss current challenges and suggest future directions. We believe that the conclusions of this study will help researchers to apply machine/deep learning-based representation techniques to protein data for various predictive tasks, and inspire the development of novel methods. Deep learning methods have in recent years shown promising results in characterizing proteins and extracting complex sequence–structure–function relationships. This Analysis describes a benchmarking study to compare the performances and advantages of recent deep learning approaches in a range of protein prediction tasks.
ISSN:	2522-5839 2522-5839
DOI:	10.1038/s42256-022-00457-9