Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding

Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database sear...

Full description

Saved in:
Bibliographic Details
Published inPLoS computational biology Vol. 7; no. 1; p. e1001047
Main Authors Melvin, Iain, Weston, Jason, Noble, William Stafford, Leslie, Christina
Format Journal Article
LanguageEnglish
Published United States Public Library of Science 01.01.2011
Public Library of Science (PLoS)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
Conceived and designed the experiments: JW WSN CL. Performed the experiments: IM. Analyzed the data: IM JW. Wrote the paper: IM JW WSN CL.
ISSN:1553-7358
1553-734X
1553-7358
DOI:10.1371/journal.pcbi.1001047