Efficient Clustering of Metagenomic Sequences using Locality Sensitive Hashing
The new generation of genomic technologies have allowed researchers to determine the collective DNA of organisms (e.g., microbes) co-existing as communities across the ecosystem (e.g., within the human host). There is a need for the computational approaches to analyze and annotate the large volumes...
Saved in:
Published in | Society for Industrial and Applied Mathematics. Proceedings of the SIAM International Conference on Data Mining p. 1023 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
Philadelphia
Society for Industrial and Applied Mathematics
01.01.2012
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The new generation of genomic technologies have allowed researchers to determine the collective DNA of organisms (e.g., microbes) co-existing as communities across the ecosystem (e.g., within the human host). There is a need for the computational approaches to analyze and annotate the large volumes of available sequence data from such microbial communities (metagenomes). In this paper, we developed an efficient and accurate metagenome clustering approach that uses the locality sensitive hashing (LSH) technique to approximate the computational complexity associated with comparing sequences. We introduce the use of fixed-length, gapless subsequences for improving the sensitivity of the LSH-based similarity function. We evaluate the performance of our algorithm on two metagenome datasets associated with microbes existing across different human skin locations. Our empirical results show the strength of the developed approach in comparison to three state-of-the-art sequence clustering algorithms with regards to computational efficiency and clustering quality. We also demonstrate practical significance for the developed clustering algorithm, to compare bacterial diversity and structure across different skin locations. [PUBLICATION ABSTRACT] |
---|