Learning supervised embeddings for large scale sequence comparisons

Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment...

Full description

Saved in:

Bibliographic Details
Published in	bioRxiv
Main Authors	Kimothi, Dhananjay, Biyani, Pravesh, Hogan, James M, Soni, Akshay, Kelly, Wayne
Format	Paper
Language	English
Published	Cold Spring Harbor Cold Spring Harbor Laboratory Press 26.04.2019
Subjects	Bioinformatics Data collection Homology Learning algorithms Problem solving
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence. In this paper, we introduce SuperVec, a novel supervised approach to learning sequence embeddings. Our method extends earlier Representation Learning (RL) based methods to include jointly contextual and class-related information for each sequence during training. This ensures that related sequence fragments have proximal representations in the target space, better reflecting the structure of the domain. Such representations may be used for downstream machine learning tasks or employed directly. Here, we apply SuperVec embeddings to a sequence retrieval task, where the goal is to retrieve sequences with the same family label as a given query. The SuperVec approach is extended further through H-SuperVec, a tree-based hierarchical method which learns embeddings across a range of feature spaces based on the class labels and their exclusive and exhaustive subsets. Experiments show that supervised learning of embeddings based on sequence labels using SuperVec and H-SuperVec provides a substantial improvement in retrieval performance over existing (unsupervised) RL-based approaches. Further, the new methods are an order of magnitude faster than BLAST for the database retrieval task, supporting hybrid approaches in which SuperVec rapidly filters the collection so that only potentially relevant records remain, allowing slower, more accurate methods to be executed quickly over a far smaller dataset. Thus, we may achieve faster query processing and higher precision than before. Finally, for some problems, direct use of embeddings is already sufficient to yield high levels of precision and recall. Extending this work to encompass weaker homology is the subject of ongoing research.
DOI:	10.1101/620153