HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data

Identification of specific species in metagenomic samples is critical for several key applications, yet many tools available require large computational power and are often prone to false positive identifications. Here we describe High-AccuracY and Scalable Taxonomic Assignment of MetagenomiC data (...

Full description

Saved in:

Bibliographic Details
Published in	PLoS computational biology Vol. 18; no. 9; p. e1010493
Main Authors	Dimopoulos, Evangelos A, Carmagnini, Alberto, Velsko, Irina M, Warinner, Christina, Larson, Greger, Frantz, Laurent A. F, Irving-Pease, Evan K
Format	Journal Article
Language	English
Published	San Francisco Public Library of Science 01.09.2022 Public Library of Science (PLoS)
Subjects	Accuracy Analysis Archaeology Bayesian analysis Bayesian statistical decision theory Biology and Life Sciences Computer applications Datasets Dental calculus Deoxyribonucleic acid DNA DNA damage DNA sequencing Earth Sciences Empirical analysis Gene mapping Genomes Genomics Identification Identification and classification Laboratories Medicine and Health Sciences Metagenomics Methods Next-generation sequencing Nucleotide sequencing Pathogens Research and Analysis Methods Simulation Software Species Species classification Statistical analysis Taxonomy Tuberculosis Singapore
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Identification of specific species in metagenomic samples is critical for several key applications, yet many tools available require large computational power and are often prone to false positive identifications. Here we describe High-AccuracY and Scalable Taxonomic Assignment of MetagenomiC data (HAYSTAC), which can estimate the probability that a specific taxon is present in a metagenome. HAYSTAC provides a user-friendly tool to construct databases, based on publicly available genomes, that are used for competitive read mapping. It then uses a novel Bayesian framework to infer the abundance and statistical support for each species identification and provide per-read species classification. Unlike other methods, HAYSTAC is specifically designed to efficiently handle both ancient and modern DNA data, as well as incomplete reference databases, making it possible to run highly accurate hypothesis-driven analyses ( i . e ., assessing the presence of a specific species) on variably sized reference databases while dramatically improving processing speeds. We tested the performance and accuracy of HAYSTAC using simulated Illumina libraries, both with and without ancient DNA damage, and compared the results to other currently available methods ( i . e ., Kraken2/Bracken, KrakenUniq, MALT/HOPS, and Sigma). HAYSTAC identified fewer false positives than both Kraken2/Bracken, KrakenUniq and MALT in all simulations, and fewer than Sigma in simulations of ancient data. It uses less memory than Kraken2/Bracken, KrakenUniq as well as MALT both during database construction and sample analysis. Lastly, we used HAYSTAC to search for specific pathogens in two published ancient metagenomic datasets, demonstrating how it can be applied to empirical datasets. HAYSTAC is available from https://github.com/antonisdim/HAYSTAC .
Bibliography:	new_version ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 These authors co-supervised this work. The authors have declared that no competing interests exist.
ISSN:	1553-7358 1553-734X 1553-7358
DOI:	10.1371/journal.pcbi.1010493