proovframe: frameshift-correction for long-read (meta)genomics

Long-read sequencing technologies hold big promises for the genomic analysis of complex samples such as microbial communities. Yet, despite improving accuracy, basic gene prediction on long-read data is still often impaired by frameshifts resulting from small indels. Consensus polishing using either...

Full description

Saved in:
Bibliographic Details
Published inbioRxiv
Main Authors Hackl, Thomas, Trigodet, Florian, Eren, A Murat, Biller, Steven J, Eppley, John M, Luo, Elaine, Burger, Andrew, Delong, Edward F, Fischer, Matthias G
Format Paper
LanguageEnglish
Published Cold Spring Harbor Cold Spring Harbor Laboratory Press 24.08.2021
Cold Spring Harbor Laboratory
Edition1.1
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Long-read sequencing technologies hold big promises for the genomic analysis of complex samples such as microbial communities. Yet, despite improving accuracy, basic gene prediction on long-read data is still often impaired by frameshifts resulting from small indels. Consensus polishing using either complementary short reads or to a lesser extent the long reads themselves can mitigate this effect but requires universally high sequencing depth, which is difficult to achieve in complex samples where the majority of community members are rare. Here we present proovframe, a software implementing an alternative approach to overcome frameshift errors in long-read assemblies and raw long reads. We utilize protein-to-nucleotide alignments against reference databases to pinpoint indels in contigs or reads and correct them by deleting or inserting 1-2 bases, thereby conservatively restoring reading-frame fidelity in aligned regions. Using simulated and real-world benchmark data we show that proovframe performs comparably to short-read-based polishing on assembled data, works well with remote protein homologs, and can even be applied to raw reads directly. Together, our results demonstrate that protein-guided frameshift correction significantly improves the analyzability of long-read data both in combination with and as an alternative to common polishing strategies. Proovframe is available from https://github.com/thackl/proovframe. Competing Interest Statement The authors have declared no competing interest. Footnotes * https://github.com/thackl/proovframe * http://github.com/thackl/proovframe-benchmark * https://doi.org/10.5281/zenodo.5164669
Bibliography:SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
Competing Interest Statement: The authors have declared no competing interest.
ISSN:2692-8205
2692-8205
DOI:10.1101/2021.08.23.457338