SAMStat: monitoring biases in next generation sequencing data

Motivation: The sequence alignment map format (SAM) is a commonly used format to store the alignments between millions of short reads and a reference genome. Often certain positions within the reads are inherently more likely to contain errors due to the protocols used to prepare the samples. Such b...

Full description

Saved in:
Bibliographic Details
Published inBioinformatics Vol. 27; no. 1; pp. 130 - 131
Main Authors Lassmann, Timo, Hayashizaki, Yoshihide, Daub, Carsten O.
Format Journal Article
LanguageEnglish
Published Oxford Oxford University Press 01.01.2011
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Motivation: The sequence alignment map format (SAM) is a commonly used format to store the alignments between millions of short reads and a reference genome. Often certain positions within the reads are inherently more likely to contain errors due to the protocols used to prepare the samples. Such biases can have adverse effects on both mapping rate and accuracy. To understand the relationship between potential protocol biases and poor mapping we wrote SAMstat, a simple C program plotting nucleotide overrepresentation and other statistics in mapped and unmapped reads in a concise html page. Collecting such statistics also makes it easy to highlight problems in the data processing and enables non-experts to track data quality over time. Results: We demonstrate that studying sequence features in mapped data can be used to identify biases particular to one sequencing protocol. Once identified, such biases can be considered in the downstream analysis or even be removed by read trimming or filtering techniques. Availability: SAMStat is open source and freely available as a C program running on all Unix-compatible platforms. The source code is available from http://samstat.sourceforge.net. Contact: timolassmann@gmail.com
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
Associate Editor: Martin Bishop
ISSN:1367-4803
1460-2059
1367-4811
DOI:10.1093/bioinformatics/btq614