Natural-Language Text Compression Using Reverse Multi-Delimiter Codes

This paper studies binary reverse multi-delimiter (RMD) data compression codes. RMD codes have a range of useful properties, such as unique decodability, completeness, universality, synchronizability, recognition using a finite automaton, and the ability for rapid data retrieval within an encoded fi...

Full description

Saved in:
Bibliographic Details
Published inCybernetics and systems analysis Vol. 60; no. 1; pp. 1 - 12
Main Authors Anisimov, A. V., Zavadskyi, I. O., Chudakov, T. S.
Format Journal Article
LanguageEnglish
Published New York Springer US 01.01.2024
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN1060-0396
1573-8337
DOI10.1007/s10559-024-00641-2

Cover

More Information
Summary:This paper studies binary reverse multi-delimiter (RMD) data compression codes. RMD codes have a range of useful properties, such as unique decodability, completeness, universality, synchronizability, recognition using a finite automaton, and the ability for rapid data retrieval within an encoded file. The authors have constructed a simple monotonic mapping from the set of non-negative integers to the codeword set. Based on this mapping, they have developed a fast byte-aligned decoding algorithm. Computer experiments demonstrate that we can decode RMD codes nearly as quickly as the SCDC code and several times faster than the Fibonacci code. Compared to known codes of a similar type, RMD codes exhibit a better compression ratio for natural language texts (more than four times closer to the entropy bound than SCDC). Additionally, the paper describes a technology for preprocessing natural language texts, which, combined with encoding using RMD codes, enhances the efficiency of powerful modern archivers.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1060-0396
1573-8337
DOI:10.1007/s10559-024-00641-2