Analysis of Gene Expression Cancer Data Set: Classification of TCGA Pan-cancer HiSeq Data
In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight machine learning algorithms: Decision Tree, Gradient Boosting, K-Nearest Neighbors, Logistic Regression, Naïve Bayes, Neural Network, Random...
Saved in:
Published in | 2021 IEEE International Conference on Big Data (Big Data) pp. 4745 - 4752 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
15.12.2021
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/BigData52589.2021.9671793 |
Cover
Loading…
Summary: | In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight machine learning algorithms: Decision Tree, Gradient Boosting, K-Nearest Neighbors, Logistic Regression, Naïve Bayes, Neural Network, Random Forest, and Support Vector Machine. Machine learning models were generated by training the algorithms on the TCGA Pancancer HiSeq data set. This data set is an RNA sequencing (RNA-seq) data set consisting of five separate cancer types such as breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), kidney renal clear cell carcinoma (KIRC), lung adenocarcinoma (LUAD), and Prostate adenocarcinoma (PRAD). The data set was preprocessed with feature selection, oversampling, and normalization techniques. The preprocessed methods were implemented by selecting only the best features and thus removing the inconsequential features, balancing the sample size of each cancer type, and rescaling the values of numeric attributes. Our goal was to determine which algorithm generates a classification model that shows the best performance when categorizing cancer types by employing the following evaluation measures: accuracy, precision, recall, area under curve (AUC) score, F-1 score, and processing time. |
---|---|
DOI: | 10.1109/BigData52589.2021.9671793 |