Correcting a Single Indel/Edit for DNA-Based Data Storage: Linear-Time Encoders and Order-Optimality

An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this article, we investigate codes that correct either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on information theory Vol. 67; no. 6; pp. 3438 - 3451
Main Authors Cai, Kui, Chee, Yeow Meng, Gabrys, Ryan, Kiah, Han Mao, Nguyen, Tuan Thanh
Format Journal Article
LanguageEnglish
Published New York IEEE 01.06.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN0018-9448
1557-9654
DOI10.1109/TIT.2021.3049627

Cover

More Information
Summary:An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this article, we investigate codes that correct either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Over the quaternary alphabet, we provide two linear-time encoders. One corrects a single edit with <inline-formula> <tex-math notation="LaTeX">\lceil {\log \text {n}}\rceil+\text {O}(\log \log \text {n}) </tex-math></inline-formula> redundancy bits, while the other corrects a single indel with <inline-formula> <tex-math notation="LaTeX">\lceil {\log \text {n}}\rceil+2 </tex-math></inline-formula> redundant bits. These two encoders are order-optimal . The former encoder is the first known order-optimal encoder that corrects a single edit, while the latter encoder (that corrects a single indel) reduces the redundancy of the best known encoder of Tenengolts (1984) by at least four bits. Over the DNA alphabet, we impose an additional constraint: the <inline-formula> <tex-math notation="LaTeX">\mathtt {GC} </tex-math></inline-formula> -balanced constraint and require that exactly half of the symbols of any DNA codeword to be either <inline-formula> <tex-math notation="LaTeX">\mathtt {C} </tex-math></inline-formula> or <inline-formula> <tex-math notation="LaTeX">\mathtt {G} </tex-math></inline-formula>. In particular, via a modification of Knuth's balancing technique, we provide a linear-time map that translates binary messages into <inline-formula> <tex-math notation="LaTeX">\mathtt {GC} </tex-math></inline-formula>-balanced codewords and the resulting codebook is able to correct a single indel or a single edit. These are the first known constructions of <inline-formula> <tex-math notation="LaTeX">\mathtt {GC} </tex-math></inline-formula>-balanced codes that correct a single indel or a single edit.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0018-9448
1557-9654
DOI:10.1109/TIT.2021.3049627