CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data

The analysis of next-generation sequence (NGS) data is often a fragmented step-wise process. For example, multiple pieces of software are typically needed to map NGS reads, extract variant sites, and construct a DNA sequence matrix containing only single nucleotide polymorphisms (i.e., a SNP matrix)...

Full description

Saved in:
Bibliographic Details
Published inPeerJ. Computer science Vol. 1; p. e20
Main Authors Davis, Steve, Pettengill, James B., Luo, Yan, Payne, Justin, Shpuntoff, Al, Rand, Hugh, Strain, Errol
Format Journal Article
LanguageEnglish
Published San Diego PeerJ. Ltd 26.08.2015
PeerJ, Inc
PeerJ Inc
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The analysis of next-generation sequence (NGS) data is often a fragmented step-wise process. For example, multiple pieces of software are typically needed to map NGS reads, extract variant sites, and construct a DNA sequence matrix containing only single nucleotide polymorphisms (i.e., a SNP matrix) for a set of individuals. The management and chaining of these software pieces and their outputs can often be a cumbersome and difficult task. Here, we present CFSAN SNP Pipeline, which combines into a single package the mapping of NGS reads to a reference genome with Bowtie2, processing of those mapping (BAM) files using SAMtools, identification of variant sites using VarScan, and production of a SNP matrix using custom Python scripts. We also introduce a Python package (CFSAN SNP Mutator) that when given a reference genome will generate variants of known position against which we validate our pipeline. We created 1,000 simulated Salmonella enterica sp. enterica Serovar Agona genomes at 100x and 20x coverage, each containing 500 SNPs, 20 single-base insertions and 20 single-base deletions. For the 100x dataset, the CFSAN SNP Pipeline recovered 98.9% of the introduced SNPs and had a false positive rate of 1.04 x 10.sup.-6 ; for the 20x dataset 98.8% of SNPs were recovered and the false positive rate was 8.34 x 10.sup.-7 . Based on these results, CFSAN SNP Pipeline is a robust and accurate tool that it is among the first to combine into a single executable the myriad steps required to produce a SNP matrix from NGS data. Such a tool is useful to those working in an applied setting (e.g., food safety traceback investigations) as well as for those interested in evolutionary questions.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2376-5992
2376-5992
DOI:10.7717/peerj-cs.20