A New String Edit Distance and Applications
String edit distances have been used for decades in applications ranging from spelling correction and web search suggestions to DNA analysis. Most string edit distances are variations of the Levenshtein distance and consider only single-character edits. In forensic applications polymorphic genetic m...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
11.03.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | String edit distances have been used for decades in applications ranging from
spelling correction and web search suggestions to DNA analysis. Most string
edit distances are variations of the Levenshtein distance and consider only
single-character edits. In forensic applications polymorphic genetic markers
such as short tandem repeats (STRs) are used. At these repetitive motifs the
DNA copying errors consist of more than just single base differences. More
often the phenomenon of ``stutter'' is observed, where the number of repeated
units differs (by whole units) from the template. To adapt the Levenshtein
distance to be suitable for forensic applications where DNA sequence similarity
is of interest, a generalized string edit distance is defined that accommodates
the addition or deletion of whole motifs in addition to single-nucleotide
edits. A dynamic programming implementation is developed for computing this
distance between sequences. The novelty of this algorithm is in handling the
complex interactions that arise between multiple- and single-character edits.
Forensic examples illustrate the purpose and use of the Restricted Forensic
Levenshtein (RFL) distance measure, but applications extend to sequence
alignment and string similarity in other biological areas, as well as dynamic
programming algorithms more broadly. |
---|---|
DOI: | 10.48550/arxiv.2203.06138 |