CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data
The analysis of next-generation sequence (NGS) data is often a fragmented step-wise process. For example, multiple pieces of software are typically needed to map NGS reads, extract variant sites, and construct a DNA sequence matrix containing only single nucleotide polymorphisms (i.e., a SNP matrix)...
Saved in:
Published in | PeerJ. Computer science Vol. 1; p. e20 |
---|---|
Main Authors | , , , , , , |
Format | Journal Article |
Language | English |
Published |
San Diego
PeerJ. Ltd
26.08.2015
PeerJ, Inc PeerJ Inc |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The analysis of next-generation sequence (NGS) data is often a fragmented step-wise process. For example, multiple pieces of software are typically needed to map NGS reads, extract variant sites, and construct a DNA sequence matrix containing only single nucleotide polymorphisms (i.e., a SNP matrix) for a set of individuals. The management and chaining of these software pieces and their outputs can often be a cumbersome and difficult task. Here, we present CFSAN SNP Pipeline, which combines into a single package the mapping of NGS reads to a reference genome with Bowtie2, processing of those mapping (BAM) files using SAMtools, identification of variant sites using VarScan, and production of a SNP matrix using custom Python scripts. We also introduce a Python package (CFSAN SNP Mutator) that when given a reference genome will generate variants of known position against which we validate our pipeline. We created 1,000 simulated Salmonella enterica sp. enterica Serovar Agona genomes at 100x and 20x coverage, each containing 500 SNPs, 20 single-base insertions and 20 single-base deletions. For the 100x dataset, the CFSAN SNP Pipeline recovered 98.9% of the introduced SNPs and had a false positive rate of 1.04 x 10.sup.-6 ; for the 20x dataset 98.8% of SNPs were recovered and the false positive rate was 8.34 x 10.sup.-7 . Based on these results, CFSAN SNP Pipeline is a robust and accurate tool that it is among the first to combine into a single executable the myriad steps required to produce a SNP matrix from NGS data. Such a tool is useful to those working in an applied setting (e.g., food safety traceback investigations) as well as for those interested in evolutionary questions. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 2376-5992 2376-5992 |
DOI: | 10.7717/peerj-cs.20 |