Analyzing DNA Pattern Matching through String Similarity Measurements in Cancer Sequence Data
Cancer is a deadly disease with actual cause still being an unknown mystery. Classifying the different kinds of tumors is rather essential for diagnosing cancer and discovering new treatments. The majority of prior cancer classification studies, however, are clinically oriented, which limits their d...
Saved in:
Published in | 2023 International Conference on Sustainable Communication Networks and Application (ICSCNA) pp. 1373 - 1381 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
15.11.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Cancer is a deadly disease with actual cause still being an unknown mystery. Classifying the different kinds of tumors is rather essential for diagnosing cancer and discovering new treatments. The majority of prior cancer classification studies, however, are clinically oriented, which limits their diagnostic utility. Classifying cancer using gene expression data is crucial to cancer diagnosis and medication discovery. Genes of DNA convey information about cause of cancer and hence search for patterns associated with cancer in human DNA sequence is prominent. Some small stretches of human DNA include cancer-causing gene sequences. This proposed work applies various similarity measures used in string similarity into bioinformatics and evaluates the best method that can used in cancer pattern searching. Ten distance measures are considered and tested on 15 human DNA sequence dataset searching eight cancer patterns. To ascertain whether a cancer pattern is found exactly or approximately, distance measures build a number of mathematical equations. Therefore, out of ten distance, only seven of them yield exact match and amidst them Levenshtein is very computationally intensive. Closest matches are found using Jaro Winkler, Jaccard, and Cosine distances out of the total of ten distance metrics. Obtaining a precise match using the Levenshtein distance requires a lot of processing power. Six distance measures such as Euclidean, Manhattan, Minkowski, Canberra, Hamming and Jaro distance have been identified as the best metrics in terms of accuracy, time and space complexity, are applicable metrics for pattern search in biological sequences. |
---|---|
DOI: | 10.1109/ICSCNA58489.2023.10370648 |