RegCloser: a robust regression approach to closing genome gaps

Closing gaps in draft genomes leads to more complete and continuous genome assemblies. The ubiquitous genomic repeats are challenges to the existing gap-closing methods, based on either the k-mer representation by the de Bruijn graph or the overlap-layout-consensus paradigm. Besides, chimeric reads...

Full description

Saved in:
Bibliographic Details
Published inBMC bioinformatics Vol. 24; no. 1; p. 249
Main Authors Cao, Shenghao, Li, Mengtian, Li, Lei M
Format Journal Article
LanguageEnglish
Published England BioMed Central Ltd 13.06.2023
BioMed Central
BMC
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Closing gaps in draft genomes leads to more complete and continuous genome assemblies. The ubiquitous genomic repeats are challenges to the existing gap-closing methods, based on either the k-mer representation by the de Bruijn graph or the overlap-layout-consensus paradigm. Besides, chimeric reads will cause erroneous k-mers in the former and false overlaps of reads in the latter. We propose a novel local assembly approach to gap closing, called RegCloser. It represents read coordinates and their overlaps respectively by parameters and observations in a linear regression model. The optimal overlap is searched only in the restricted range consistent with insert sizes. Under this linear regression framework, the local DNA assembly becomes a robust parameter estimation problem. We solved the problem by a customized robust regression procedure that resists the influence of false overlaps by optimizing a convex global Huber loss function. The global optimum is obtained by iteratively solving the sparse system of linear equations. On both simulated and real datasets, RegCloser outperformed other popular methods in accurately resolving the copy number of tandem repeats, and achieved superior completeness and contiguity. Applying RegCloser to a plateau zokor draft genome that had been improved by long reads further increased contig N50 to 3-fold long. We also tested the robust regression approach on layout generation of long reads. RegCloser is a competitive gap-closing tool. The software is available at https://github.com/csh3/RegCloser . The robust regression approach has a prospect to be incorporated into the layout module of long read assemblers.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1471-2105
1471-2105
DOI:10.1186/s12859-023-05367-0