Real-Time Pathogen Detection in the Era of Whole-Genome Sequencing and Big Data: Comparison of k-mer and Site-Based Methods for Inferring the Genetic Distances among Tens of Thousands of Salmonella Samples

The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence dat...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 11; no. 11; p. e0166162
Main Authors	Pettengill, James B., Pightling, Arthur W., Baugher, Joseph D., Rand, Hugh, Strain, Errol
Format	Journal Article
Language	English
Published	United States Public Library of Science 10.11.2016 Public Library of Science (PLoS)
Subjects	Analysis Animals Archives & records Bacteria BASIC BIOLOGICAL SCIENCES Big data Bioinformatics Biological evolution Biology Biology and Life Sciences Computational Biology - methods Computer applications Data bases Data management Data processing DNA sequencing Empirical analysis Engineering and Technology Food contamination Food safety Gene sequencing Genetic aspects Genetic distance Genome, Bacterial - genetics Genomes Genomics Humans Listeria Medicine and Health Sciences Metadata Methods Multilocus sequence typing Multilocus Sequence Typing - methods Nucleotide sequence Nutrition Outbreaks Pathogens Phylogenetics Phylogeny Public health Real time Reproducibility of Results Research and Analysis Methods Salmonella Salmonella - classification Salmonella - genetics Salmonella - physiology Salmonella enterica Salmonella Infections - microbiology Salmonellosis Sequence Analysis, DNA - methods Species Specificity Time Factors
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The adoption of whole-genome sequencing within the public health realm for molecular characterization of bacterial pathogens has been followed by an increased emphasis on real-time detection of emerging outbreaks (e.g., food-borne Salmonellosis). In turn, large databases of whole-genome sequence data are being populated. These databases currently contain tens of thousands of samples and are expected to grow to hundreds of thousands within a few years. For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genetic distances among a set of samples. Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. We evaluated seven measures of genetic distance, which were estimated from either k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) or nucleotide sites (NUCmer and an extended multi-locus sequence typing (MLST) scheme). When analyzing empirical data (whole-genome sequence data from 18,997 Salmonella isolates) there are features (e.g., genomic, assembly, and contamination) that cause distances inferred from k-mer profiles, which treat absent data as informative, to fail to accurately capture the distance between samples when compared to distances inferred from differences in nucleotide sites. Thus, site-based distances, like NUCmer and extended MLST, are superior in performance, but accessing the computing resources necessary to perform them may be challenging when analyzing large databases.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 USDOE US Food and Drug Administration (FDA) Competing Interests: The authors have declared that no competing interests exist. Conceptualization: JBP AWP JDB HR. Data curation: JBP. Formal analysis: JBP. Methodology: JBP AWP JDB HR. Project administration: JBP. Resources: JBP AWP JDB HR ES. Software: JBP AWP JDB HR. Supervision: HR ES. Validation: JBP AWP JDB HR. Visualization: JBP. Writing – original draft: JBP. Writing – review & editing: JBP AWP JDB HR ES.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0166162