Learning functional properties of proteins with language models

Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by bette...

Full description

Saved in:
Bibliographic Details
Published inNature machine intelligence Vol. 4; no. 3; pp. 227 - 245
Main Authors Unsal, Serbulent, Atas, Heval, Albayrak, Muammer, Turhan, Kemal, Acar, Aybar C., Doğan, Tunca
Format Journal Article
LanguageEnglish
Published London Nature Publishing Group UK 01.03.2022
Nature Publishing Group
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by better representing the data at hand. Novel data representation approaches mostly take inspiration from language models that have yielded ground-breaking improvements in natural language processing. Lately, these approaches have been applied to the field of protein science and have displayed highly promising results in terms of extracting complex sequence–structure–function relationships. In this study we conducted a detailed investigation over protein representation learning by first categorizing/explaining each approach, subsequently benchmarking their performances on predicting: (1) semantic similarities between proteins, (2) ontology-based protein functions, (3) drug target protein families and (4) protein–protein binding affinity changes following mutations. We evaluate and discuss the advantages and disadvantages of each method over the benchmark results, source datasets and algorithms used, in comparison with classical model-driven approaches. Finally, we discuss current challenges and suggest future directions. We believe that the conclusions of this study will help researchers to apply machine/deep learning-based representation techniques to protein data for various predictive tasks, and inspire the development of novel methods. Deep learning methods have in recent years shown promising results in characterizing proteins and extracting complex sequence–structure–function relationships. This Analysis describes a benchmarking study to compare the performances and advantages of recent deep learning approaches in a range of protein prediction tasks.
ISSN:2522-5839
2522-5839
DOI:10.1038/s42256-022-00457-9