Correcting a Single Indel/Edit for DNA-Based Data Storage: Linear-Time Encoders and Order-Optimality
An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this article, we investigate codes that correct either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length...
Saved in:
Published in | IEEE transactions on information theory Vol. 67; no. 6; pp. 3438 - 3451 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.06.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
ISSN | 0018-9448 1557-9654 |
DOI | 10.1109/TIT.2021.3049627 |
Cover
Summary: | An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this article, we investigate codes that correct either a single indel or a single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Over the quaternary alphabet, we provide two linear-time encoders. One corrects a single edit with <inline-formula> <tex-math notation="LaTeX">\lceil {\log \text {n}}\rceil+\text {O}(\log \log \text {n}) </tex-math></inline-formula> redundancy bits, while the other corrects a single indel with <inline-formula> <tex-math notation="LaTeX">\lceil {\log \text {n}}\rceil+2 </tex-math></inline-formula> redundant bits. These two encoders are order-optimal . The former encoder is the first known order-optimal encoder that corrects a single edit, while the latter encoder (that corrects a single indel) reduces the redundancy of the best known encoder of Tenengolts (1984) by at least four bits. Over the DNA alphabet, we impose an additional constraint: the <inline-formula> <tex-math notation="LaTeX">\mathtt {GC} </tex-math></inline-formula> -balanced constraint and require that exactly half of the symbols of any DNA codeword to be either <inline-formula> <tex-math notation="LaTeX">\mathtt {C} </tex-math></inline-formula> or <inline-formula> <tex-math notation="LaTeX">\mathtt {G} </tex-math></inline-formula>. In particular, via a modification of Knuth's balancing technique, we provide a linear-time map that translates binary messages into <inline-formula> <tex-math notation="LaTeX">\mathtt {GC} </tex-math></inline-formula>-balanced codewords and the resulting codebook is able to correct a single indel or a single edit. These are the first known constructions of <inline-formula> <tex-math notation="LaTeX">\mathtt {GC} </tex-math></inline-formula>-balanced codes that correct a single indel or a single edit. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 0018-9448 1557-9654 |
DOI: | 10.1109/TIT.2021.3049627 |