Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

Motivation: Eugene Myers in his string graph paper suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossles...

Full description

Saved in:

Bibliographic Details
Published in	Bioinformatics Vol. 28; no. 14; pp. 1838 - 1844
Main Author	Li, Heng
Format	Journal Article
Language	English
Published	Oxford Oxford University Press 15.07.2012
Subjects	Algorithms Assembly Biological and medical sciences Computational Biology - methods Construction Fundamental and applied biological sciences. Psychology General aspects Graphs Humans INDEL Mutation Insertion Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Original Papers Pipelines Polymorphism, Single Nucleotide Sequence Analysis, DNA - methods Strings Chromosomal aberration Sample Deletion Insertion Single nucleotide polymorphism Genome De novo
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Motivation: Eugene Myers in his string graph paper suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs. Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly can be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we propose FMD-index for forward–backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index. Availability: http://github.com/lh3/fermi Contact: hengli@broadinstitute.org
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Associate Editor: Michael Brudno
ISSN:	1367-4803 1367-4811 1367-4811 1460-2059
DOI:	10.1093/bioinformatics/bts280