Analysis of Gene Expression Cancer Data Set: Classification of TCGA Pan-cancer HiSeq Data

In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight machine learning algorithms: Decision Tree, Gradient Boosting, K-Nearest Neighbors, Logistic Regression, Naïve Bayes, Neural Network, Random...

Full description

Saved in:

Bibliographic Details
Published in	2021 IEEE International Conference on Big Data (Big Data) pp. 4745 - 4752
Main Authors	Nitta, Yusaku, Borders, Mitchell, Ludwig, Simone A.
Format	Conference Proceeding
Language	English
Published	IEEE 15.12.2021
Subjects	and Prostate adenocarcinoma breast invasive carcinoma colon adenocarcinoma Data models Feature extraction kidney renal clear cell carcinoma lung adenocarcinoma Machine learning algorithms Prediction algorithms Predictive models RNA RNA Sequencing Sequential analysis TCGA Pan-cancer HiSeq data
Online Access	Get full text
DOI	10.1109/BigData52589.2021.9671793

Cover

Loading…

More Information
Summary:	In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight machine learning algorithms: Decision Tree, Gradient Boosting, K-Nearest Neighbors, Logistic Regression, Naïve Bayes, Neural Network, Random Forest, and Support Vector Machine. Machine learning models were generated by training the algorithms on the TCGA Pancancer HiSeq data set. This data set is an RNA sequencing (RNA-seq) data set consisting of five separate cancer types such as breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), kidney renal clear cell carcinoma (KIRC), lung adenocarcinoma (LUAD), and Prostate adenocarcinoma (PRAD). The data set was preprocessed with feature selection, oversampling, and normalization techniques. The preprocessed methods were implemented by selecting only the best features and thus removing the inconsequential features, balancing the sample size of each cancer type, and rescaling the values of numeric attributes. Our goal was to determine which algorithm generates a classification model that shows the best performance when categorizing cancer types by employing the following evaluation measures: accuracy, precision, recall, area under curve (AUC) score, F-1 score, and processing time.
DOI:	10.1109/BigData52589.2021.9671793