Protein language models meet reduced amino acid alphabets
Motivation Protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clust...
Saved in:
Published in | Bioinformatics (Oxford, England) Vol. 40; no. 2 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
England
Oxford University Press
01.02.2024
Oxford Publishing Limited (England) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Motivation
Protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical–chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored.
Results
Here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%.
Availability and implementation
Trained models and code are available at github.com/Ieremie/reduced-alph-PLM. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
ISSN: | 1367-4803 1367-4811 1367-4811 |
DOI: | 10.1093/bioinformatics/btae061 |