Analyzing DNA Pattern Matching through String Similarity Measurements in Cancer Sequence Data

Cancer is a deadly disease with actual cause still being an unknown mystery. Classifying the different kinds of tumors is rather essential for diagnosing cancer and discovering new treatments. The majority of prior cancer classification studies, however, are clinically oriented, which limits their d...

Full description

Saved in:

Bibliographic Details
Published in	2023 International Conference on Sustainable Communication Networks and Application (ICSCNA) pp. 1373 - 1381
Main Authors	A, Lincy, Ebenezer, V., V M, Arul Xavier, Isaac, A Joshua, Jenefa, A., Naveen, Edward
Format	Conference Proceeding
Language	English
Published	IEEE 15.11.2023
Subjects	Bioinformatics Biology Cancer sequence data DNA DNA pattern matching DNA sequencing Gene expression Genomic analysis Memory management Sequence alignment Sequential analysis String similarity Time measurement
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Cancer is a deadly disease with actual cause still being an unknown mystery. Classifying the different kinds of tumors is rather essential for diagnosing cancer and discovering new treatments. The majority of prior cancer classification studies, however, are clinically oriented, which limits their diagnostic utility. Classifying cancer using gene expression data is crucial to cancer diagnosis and medication discovery. Genes of DNA convey information about cause of cancer and hence search for patterns associated with cancer in human DNA sequence is prominent. Some small stretches of human DNA include cancer-causing gene sequences. This proposed work applies various similarity measures used in string similarity into bioinformatics and evaluates the best method that can used in cancer pattern searching. Ten distance measures are considered and tested on 15 human DNA sequence dataset searching eight cancer patterns. To ascertain whether a cancer pattern is found exactly or approximately, distance measures build a number of mathematical equations. Therefore, out of ten distance, only seven of them yield exact match and amidst them Levenshtein is very computationally intensive. Closest matches are found using Jaro Winkler, Jaccard, and Cosine distances out of the total of ten distance metrics. Obtaining a precise match using the Levenshtein distance requires a lot of processing power. Six distance measures such as Euclidean, Manhattan, Minkowski, Canberra, Hamming and Jaro distance have been identified as the best metrics in terms of accuracy, time and space complexity, are applicable metrics for pattern search in biological sequences.
DOI:	10.1109/ICSCNA58489.2023.10370648