Analysis of Gene Expression Cancer Data Set: Classification of TCGA Pan-cancer HiSeq Data

In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight machine learning algorithms: Decision Tree, Gradient Boosting, K-Nearest Neighbors, Logistic Regression, Naïve Bayes, Neural Network, Random...

Full description

Saved in:
Bibliographic Details
Published in2021 IEEE International Conference on Big Data (Big Data) pp. 4745 - 4752
Main Authors Nitta, Yusaku, Borders, Mitchell, Ludwig, Simone A.
Format Conference Proceeding
LanguageEnglish
Published IEEE 15.12.2021
Subjects
Online AccessGet full text
DOI10.1109/BigData52589.2021.9671793

Cover

Loading…
More Information
Summary:In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight machine learning algorithms: Decision Tree, Gradient Boosting, K-Nearest Neighbors, Logistic Regression, Naïve Bayes, Neural Network, Random Forest, and Support Vector Machine. Machine learning models were generated by training the algorithms on the TCGA Pancancer HiSeq data set. This data set is an RNA sequencing (RNA-seq) data set consisting of five separate cancer types such as breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), kidney renal clear cell carcinoma (KIRC), lung adenocarcinoma (LUAD), and Prostate adenocarcinoma (PRAD). The data set was preprocessed with feature selection, oversampling, and normalization techniques. The preprocessed methods were implemented by selecting only the best features and thus removing the inconsequential features, balancing the sample size of each cancer type, and rescaling the values of numeric attributes. Our goal was to determine which algorithm generates a classification model that shows the best performance when categorizing cancer types by employing the following evaluation measures: accuracy, precision, recall, area under curve (AUC) score, F-1 score, and processing time.
DOI:10.1109/BigData52589.2021.9671793