Improving Metagenome Sequence Clustering Application Performance Using Louvain Algorithm

Metagenomic assembly is a very challenging subject due to the huge data volume of next-generation sequencing (NGS). The ability of clustering strategy to handle large amounts of data makes it an ideal solution to memory limitations. SpaRC (Spark Reads Clustering), a scalable sequences clustering too...

Full description

Saved in:
Bibliographic Details
Published inRecent Featured Applications of Artificial Intelligence Methods. LSMS 2020 and ICSEE 2020 Workshops Vol. 1303; pp. 386 - 400
Main Authors Lu, Yakang, Deng, Li, Wang, Lili, Li, Kexue, Wu, Jinda
Format Book Chapter
LanguageEnglish
Published Singapore Springer 2021
Springer Singapore
SeriesCommunications in Computer and Information Science
Subjects
Online AccessGet full text
ISBN9813363770
9789813363779
ISSN1865-0929
1865-0937
DOI10.1007/978-981-33-6378-6_29

Cover

Loading…
More Information
Summary:Metagenomic assembly is a very challenging subject due to the huge data volume of next-generation sequencing (NGS). The ability of clustering strategy to handle large amounts of data makes it an ideal solution to memory limitations. SpaRC (Spark Reads Clustering), a scalable sequences clustering tool based on the Apache Spark, a distributed big data analysis platform, provides a solution to cluster hundreds of GBs of sequences from different genomes. However, the Label Propagation Algorithm (LPA) used in SpaRC is usually unstable, causing the clustering results to oscillate and contain too many tiny clusters. In this paper, we proposed a method for clustering metagenomic sequences based on the distributed Louvain algorithm to obtain more accurate clustering results. We performed experiments on two different datasets with millions of genome sequences based on LPA and Louvain, respectively. The experimental results indicate that this approach can effectively improve clustering performance. We hope that the method applied in this paper can be widely used in other metagenomic clustering studies.
ISBN:9813363770
9789813363779
ISSN:1865-0929
1865-0937
DOI:10.1007/978-981-33-6378-6_29