Natural-Language Text Compression Using Reverse Multi-Delimiter Codes
This paper studies binary reverse multi-delimiter (RMD) data compression codes. RMD codes have a range of useful properties, such as unique decodability, completeness, universality, synchronizability, recognition using a finite automaton, and the ability for rapid data retrieval within an encoded fi...
Saved in:
Published in | Cybernetics and systems analysis Vol. 60; no. 1; pp. 1 - 12 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
New York
Springer US
01.01.2024
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
ISSN | 1060-0396 1573-8337 |
DOI | 10.1007/s10559-024-00641-2 |
Cover
Summary: | This paper studies binary reverse multi-delimiter (RMD) data compression codes. RMD codes have a range of useful properties, such as unique decodability, completeness, universality, synchronizability, recognition using a finite automaton, and the ability for rapid data retrieval within an encoded file. The authors have constructed a simple monotonic mapping from the set of non-negative integers to the codeword set. Based on this mapping, they have developed a fast byte-aligned decoding algorithm. Computer experiments demonstrate that we can decode RMD codes nearly as quickly as the SCDC code and several times faster than the Fibonacci code. Compared to known codes of a similar type, RMD codes exhibit a better compression ratio for natural language texts (more than four times closer to the entropy bound than SCDC). Additionally, the paper describes a technology for preprocessing natural language texts, which, combined with encoding using RMD codes, enhances the efficiency of powerful modern archivers. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 1060-0396 1573-8337 |
DOI: | 10.1007/s10559-024-00641-2 |