Utilizing Machine Learning Techniques for Cancer Prediction and Classification based on Gene Expression Data

Cancer classification through genetic evaluation has become a hot topic among researchers. It holds the promise of delivering systematic, precise, and scientifically backed diagnoses for different types of cancer. Lately, several studies have delved into cancer classification by leveraging data mini...

Full description

Saved in:

Bibliographic Details
Published in	UHD Journal of Science and Technology Vol. 9; no. 1; pp. 135 - 148
Main Authors	Hama Aziz, Mariwan Mahmood, Mahmood, Sozan Abdullah
Format	Journal Article
Language	English
Published	University of Human Development 02.06.2025
Subjects	bidirectional encoder representations from transformers model cancer classification distilbert dna microarray gene expression data machine learning pan-cancer rna-seq the cancer genome atlas
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Cancer classification through genetic evaluation has become a hot topic among researchers. It holds the promise of delivering systematic, precise, and scientifically backed diagnoses for different types of cancer. Lately, several studies have delved into cancer classification by leveraging data mining techniques, machine learning algorithms, and statistical methods to thoroughly analyze high-dimensional datasets. Detecting cancer early by examining gene expression data is vital for providing effective patient care. Each sample in the Gene dataset usually includes a range of features, each representing a specific gene. In this paper, we propose a unique approach that utilizes DistilBERT, a distilled version of the Bidirectional Encoder Representations from Transformers, for cancer classification and prediction. In addition, our model integrates a self-attention mechanism in the transformer layers to enhance the model’s focus on key features and employs an embedding layer for dimensionality reduction, improving the processing of gene statistics, preventing overfitting, and boosting generalization. We utilized datasets from important resources: The gene expression omnibus, which provided microarray records of lung and ovarian cancers, and the cancer genome atlas (TCGA), which offered RNA-Seq facts encompassing multiple most cancer types (breast invasive carcinoma, kidney renal clear cell carcinoma, colon adenocarcinoma, lung adenocarcinoma, and prostate adenocarcinoma). Our approach established excessive accuracy across all datasets, showcasing big upgrades in overall model performance compared to present strategies within the subject. The results underscore the ability to leverage transformer-primarily based architectures for strong cancer-type prediction and classification. Our approach achieved and improved exceptional accuracy compared to previous studies, with DS1: 97.56% for lung cancer, DS2: 100% for ovarian cancer, and DS3: 99.504% for the TCGA dataset.
ISSN:	2521-4209 2521-4217
DOI:	10.21928/uhdjst.v9n1y2025.pp135-148