Utilizing Machine Learning Techniques for Cancer Prediction and Classification based on Gene Expression Data
Cancer classification through genetic evaluation has become a hot topic among researchers. It holds the promise of delivering systematic, precise, and scientifically backed diagnoses for different types of cancer. Lately, several studies have delved into cancer classification by leveraging data mini...
Saved in:
Published in | UHD Journal of Science and Technology Vol. 9; no. 1; pp. 135 - 148 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
University of Human Development
02.06.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Cancer classification through genetic evaluation has become a hot topic among researchers. It holds the promise of delivering systematic, precise, and scientifically backed diagnoses for different types of cancer. Lately, several studies have delved into cancer classification by leveraging data mining techniques, machine learning algorithms, and statistical methods to thoroughly analyze high-dimensional datasets. Detecting cancer early by examining gene expression data is vital for providing effective patient care. Each sample in the Gene dataset usually includes a range of features, each representing a specific gene. In this paper, we propose a unique approach that utilizes DistilBERT, a distilled version of the Bidirectional Encoder Representations from Transformers, for cancer classification and prediction. In addition, our model integrates a self-attention mechanism in the transformer layers to enhance the model’s focus on key features and employs an embedding layer for dimensionality reduction, improving the processing of gene statistics, preventing overfitting, and boosting generalization. We utilized datasets from important resources: The gene expression omnibus, which provided microarray records of lung and ovarian cancers, and the cancer genome atlas (TCGA), which offered RNA-Seq facts encompassing multiple most cancer types (breast invasive carcinoma, kidney renal clear cell carcinoma, colon adenocarcinoma, lung adenocarcinoma, and prostate adenocarcinoma). Our approach established excessive accuracy across all datasets, showcasing big upgrades in overall model performance compared to present strategies within the subject. The results underscore the ability to leverage transformer-primarily based architectures for strong cancer-type prediction and classification. Our approach achieved and improved exceptional accuracy compared to previous studies, with DS1: 97.56% for lung cancer, DS2: 100% for ovarian cancer, and DS3: 99.504% for the TCGA dataset. |
---|---|
ISSN: | 2521-4209 2521-4217 |
DOI: | 10.21928/uhdjst.v9n1y2025.pp135-148 |