Analysis of Gene Expression Cancer Data Set: Classification of TCGA Pan-cancer HiSeq Data

In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight machine learning algorithms: Decision Tree, Gradient Boosting, K-Nearest Neighbors, Logistic Regression, Naïve Bayes, Neural Network, Random...

Full description

Saved in:
Bibliographic Details
Published in2021 IEEE International Conference on Big Data (Big Data) pp. 4745 - 4752
Main Authors Nitta, Yusaku, Borders, Mitchell, Ludwig, Simone A.
Format Conference Proceeding
LanguageEnglish
Published IEEE 15.12.2021
Subjects
Online AccessGet full text
DOI10.1109/BigData52589.2021.9671793

Cover

Loading…
Abstract In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight machine learning algorithms: Decision Tree, Gradient Boosting, K-Nearest Neighbors, Logistic Regression, Naïve Bayes, Neural Network, Random Forest, and Support Vector Machine. Machine learning models were generated by training the algorithms on the TCGA Pancancer HiSeq data set. This data set is an RNA sequencing (RNA-seq) data set consisting of five separate cancer types such as breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), kidney renal clear cell carcinoma (KIRC), lung adenocarcinoma (LUAD), and Prostate adenocarcinoma (PRAD). The data set was preprocessed with feature selection, oversampling, and normalization techniques. The preprocessed methods were implemented by selecting only the best features and thus removing the inconsequential features, balancing the sample size of each cancer type, and rescaling the values of numeric attributes. Our goal was to determine which algorithm generates a classification model that shows the best performance when categorizing cancer types by employing the following evaluation measures: accuracy, precision, recall, area under curve (AUC) score, F-1 score, and processing time.
AbstractList In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight machine learning algorithms: Decision Tree, Gradient Boosting, K-Nearest Neighbors, Logistic Regression, Naïve Bayes, Neural Network, Random Forest, and Support Vector Machine. Machine learning models were generated by training the algorithms on the TCGA Pancancer HiSeq data set. This data set is an RNA sequencing (RNA-seq) data set consisting of five separate cancer types such as breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), kidney renal clear cell carcinoma (KIRC), lung adenocarcinoma (LUAD), and Prostate adenocarcinoma (PRAD). The data set was preprocessed with feature selection, oversampling, and normalization techniques. The preprocessed methods were implemented by selecting only the best features and thus removing the inconsequential features, balancing the sample size of each cancer type, and rescaling the values of numeric attributes. Our goal was to determine which algorithm generates a classification model that shows the best performance when categorizing cancer types by employing the following evaluation measures: accuracy, precision, recall, area under curve (AUC) score, F-1 score, and processing time.
Author Nitta, Yusaku
Ludwig, Simone A.
Borders, Mitchell
Author_xml – sequence: 1
  givenname: Yusaku
  surname: Nitta
  fullname: Nitta, Yusaku
  email: yusaku.nitta@ndsu.edu
  organization: North Dakota State University,Department of Computer Science,Fargo,USA
– sequence: 2
  givenname: Mitchell
  surname: Borders
  fullname: Borders, Mitchell
  email: mitchell.borders@ndsu.edu
  organization: North Dakota State University,Department of Computer Science,Fargo,USA
– sequence: 3
  givenname: Simone A.
  surname: Ludwig
  fullname: Ludwig, Simone A.
  email: simone.ludwig@ndsu.edu
  organization: North Dakota State University,Department of Computer Science,Fargo,USA
BookMark eNotj8FOwzAQRI0EB1r4Ai7mAxJsb5ytuYVQUqRKILUcOFWbdI0sBbckOdC_J6U9jUYz86SZiMu4iyzEvVap1so9PIWvZxrIGjtzqVFGpy5HjQ4uxETnuc3AKWOvxWcRqT30oZc7LyuOLOe_-477PuyiLCk23MkjSK54eJRlS2PiQ0PDMR8n67Iq5DvFpDl1F2HFP_-LG3Hlqe359qxT8fEyX5eLZPlWvZbFMglGwZDUrLYIyIA5KEI2kKOtrfaZR2OYM4MKtha5rjGbOeXBAmKjWI-WNcFU3J24gZk3-y58U3fYnN_CHwjETu0
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/BigData52589.2021.9671793
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1665439025
9781665439022
EndPage 4752
ExternalDocumentID 9671793
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i203t-be0d737e37630a7e23675b51f4f722ee42703d57ebb74890f35377c0e1748e1a3
IEDL.DBID RIE
IngestDate Thu Jun 29 18:37:39 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-be0d737e37630a7e23675b51f4f722ee42703d57ebb74890f35377c0e1748e1a3
PageCount 8
ParticipantIDs ieee_primary_9671793
PublicationCentury 2000
PublicationDate 2021-Dec.-15
PublicationDateYYYYMMDD 2021-12-15
PublicationDate_xml – month: 12
  year: 2021
  text: 2021-Dec.-15
  day: 15
PublicationDecade 2020
PublicationTitle 2021 IEEE International Conference on Big Data (Big Data)
PublicationTitleAbbrev Big Data
PublicationYear 2021
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.7961859
Snippet In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight...
SourceID ieee
SourceType Publisher
StartPage 4745
SubjectTerms and Prostate adenocarcinoma
breast invasive carcinoma
colon adenocarcinoma
Data models
Feature extraction
kidney renal clear cell carcinoma
lung adenocarcinoma
Machine learning algorithms
Prediction algorithms
Predictive models
RNA
RNA Sequencing
Sequential analysis
TCGA Pan-cancer HiSeq data
Title Analysis of Gene Expression Cancer Data Set: Classification of TCGA Pan-cancer HiSeq Data
URI https://ieeexplore.ieee.org/document/9671793
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA1tD-JJpRW_ieDR3e7mo9n1prW1CJVCW6inkmQnUoStli2Iv94ku60oHryFkJCQSfKY5M0bhK4s7GjDJQuEiZh1UJQOZGw6gUk1sNhCEvdiz8OnzmDKHmd8VkPX21gYAPDkMwhd0f_lZ0u9dk9l7bQj3H6qo7rdZmWs1g66rGQz23eLl3tZSE544iJQSBxW7X8kTvG40d9Dw82IJV3kNVwXKtSfv8QY_zulfdT6jtDDoy32HKAa5E30vFEYwUuDnZ407n1UPNccd515V9jNHY-huME-HaYjCnnbuC6T7sMtHsk80GXbwWIM775HC037vUl3EFSpE4IFiWgRKIgyQQW46yOSApxOG1c8NswIQgAYsSc94wKUcvIzkaGcCqEjsA5KArGkh6iRL3M4QlirlGgqmdSp9aaISBRLOKPUgrsdQWXHqOmWZf5WqmPMqxU5-bv6FO060zhCSMzPUKNYreHcwnqhLrw9vwCsg6Ht
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1dS8MwFA1zgvqksonfRvDRdm0-ltY3nc6q2xhsg_k0kvRWhtDp6ED89SZtN1F88K2EXhJykxxucu65CF0Y2NEJl8wRicdMgKK0I_2k6SShBuYbSOK52HO314xG7HHMxxV0ucqFAYCcfAau_czf8uOZXtirskbYFHY9raF1g_uMF9laG-i8FM5s3ExfbmUmOeGBzUEhvlta_CidkiNHext1l30WhJFXd5EpV3_-kmP876B2UP07Rw_3V-iziyqQ1tDzUmMEzxJsFaXx3UfJdE1xyzp4ju3Y8QCyK5wXxLRUodw71mTYur_GfZk6uvg3mg7gPbeoo1H7btiKnLJ4gjMlHs0cBV4sqAB7gHhSgFVq44r7CUsEIQCMmL0ecwFKWQEaL6GcCqE9MCFKAL6ke6iazlLYR1irkGgqmdShiaeICBQLOKPUwLvpQcUHqGanZfJW6GNMyhk5_Lv5DG1Gw25n0nnoPR2hLesmSw_x-TGqZvMFnBiQz9Rp7tsvWO-lOg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2021+IEEE+International+Conference+on+Big+Data+%28Big+Data%29&rft.atitle=Analysis+of+Gene+Expression+Cancer+Data+Set%3A+Classification+of+TCGA+Pan-cancer+HiSeq+Data&rft.au=Nitta%2C+Yusaku&rft.au=Borders%2C+Mitchell&rft.au=Ludwig%2C+Simone+A.&rft.date=2021-12-15&rft.pub=IEEE&rft.spage=4745&rft.epage=4752&rft_id=info:doi/10.1109%2FBigData52589.2021.9671793&rft.externalDocID=9671793