A statistical framework for accurate taxonomic assignment of metagenomic sequencing reads

The advent of next-generation sequencing technologies has greatly promoted the field of metagenomics which studies genetic material recovered directly from an environment. Characterization of genomic composition of a metagenomic sample is essential for understanding the structure of the microbial co...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 7; no. 10; p. e46450
Main Authors	Jiang, Hongmei, An, Lingling, Lin, Simon M, Feng, Gang, Qiu, Yuqing
Format	Journal Article
Language	English
Published	United States Public Library of Science 01.10.2012 Public Library of Science (PLoS)
Subjects	Abundance Algorithms Bioinformatics Biology Classification - methods Computational Biology - methods Computer simulation Datasets Deoxyribonucleic acid DNA Gene sequencing Genomes Genomics High-Throughput Nucleotide Sequencing - methods Homology Identification methods Informatics Mathematical models Mathematics Metagenomics - methods Microorganisms Models, Genetic Nucleotide sequence Relative abundance Software Species Specificity Statistical analysis Studies Taxonomy United States > US Illinois Wisconsin Arizona
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The advent of next-generation sequencing technologies has greatly promoted the field of metagenomics which studies genetic material recovered directly from an environment. Characterization of genomic composition of a metagenomic sample is essential for understanding the structure of the microbial community. Multiple genomes contained in a metagenomic sample can be identified and quantitated through homology searches of sequence reads with known sequences catalogued in reference databases. Traditionally, reads with multiple genomic hits are assigned to non-specific or high ranks of the taxonomy tree, thereby impacting on accurate estimates of relative abundance of multiple genomes present in a sample. Instead of assigning reads one by one to the taxonomy tree as many existing methods do, we propose a statistical framework to model the identified candidate genomes to which sequence reads have hits. After obtaining the estimated proportion of reads generated by each genome, sequence reads are assigned to the candidate genomes and the taxonomy tree based on the estimated probability by taking into account both sequence alignment scores and estimated genome abundance. The proposed method is comprehensively tested on both simulated datasets and two real datasets. It assigns reads to the low taxonomic ranks very accurately. Our statistical approach of taxonomic assignment of metagenomic reads, TAMER, is implemented in R and available at http://faculty.wcas.northwestern.edu/hji403/MetaR.htm.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Conceived and designed the experiments: HJ. Performed the experiments: HJ LA. Analyzed the data: HJ LA YQ. Contributed reagents/materials/analysis tools: SL GF. Wrote the paper: HJ. Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0046450