Analysis of Gene Expression Cancer Data Set: Classification of TCGA Pan-cancer HiSeq Data
In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight machine learning algorithms: Decision Tree, Gradient Boosting, K-Nearest Neighbors, Logistic Regression, Naïve Bayes, Neural Network, Random...
Saved in:
Published in | 2021 IEEE International Conference on Big Data (Big Data) pp. 4745 - 4752 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
15.12.2021
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/BigData52589.2021.9671793 |
Cover
Loading…
Abstract | In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight machine learning algorithms: Decision Tree, Gradient Boosting, K-Nearest Neighbors, Logistic Regression, Naïve Bayes, Neural Network, Random Forest, and Support Vector Machine. Machine learning models were generated by training the algorithms on the TCGA Pancancer HiSeq data set. This data set is an RNA sequencing (RNA-seq) data set consisting of five separate cancer types such as breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), kidney renal clear cell carcinoma (KIRC), lung adenocarcinoma (LUAD), and Prostate adenocarcinoma (PRAD). The data set was preprocessed with feature selection, oversampling, and normalization techniques. The preprocessed methods were implemented by selecting only the best features and thus removing the inconsequential features, balancing the sample size of each cancer type, and rescaling the values of numeric attributes. Our goal was to determine which algorithm generates a classification model that shows the best performance when categorizing cancer types by employing the following evaluation measures: accuracy, precision, recall, area under curve (AUC) score, F-1 score, and processing time. |
---|---|
AbstractList | In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight machine learning algorithms: Decision Tree, Gradient Boosting, K-Nearest Neighbors, Logistic Regression, Naïve Bayes, Neural Network, Random Forest, and Support Vector Machine. Machine learning models were generated by training the algorithms on the TCGA Pancancer HiSeq data set. This data set is an RNA sequencing (RNA-seq) data set consisting of five separate cancer types such as breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), kidney renal clear cell carcinoma (KIRC), lung adenocarcinoma (LUAD), and Prostate adenocarcinoma (PRAD). The data set was preprocessed with feature selection, oversampling, and normalization techniques. The preprocessed methods were implemented by selecting only the best features and thus removing the inconsequential features, balancing the sample size of each cancer type, and rescaling the values of numeric attributes. Our goal was to determine which algorithm generates a classification model that shows the best performance when categorizing cancer types by employing the following evaluation measures: accuracy, precision, recall, area under curve (AUC) score, F-1 score, and processing time. |
Author | Nitta, Yusaku Ludwig, Simone A. Borders, Mitchell |
Author_xml | – sequence: 1 givenname: Yusaku surname: Nitta fullname: Nitta, Yusaku email: yusaku.nitta@ndsu.edu organization: North Dakota State University,Department of Computer Science,Fargo,USA – sequence: 2 givenname: Mitchell surname: Borders fullname: Borders, Mitchell email: mitchell.borders@ndsu.edu organization: North Dakota State University,Department of Computer Science,Fargo,USA – sequence: 3 givenname: Simone A. surname: Ludwig fullname: Ludwig, Simone A. email: simone.ludwig@ndsu.edu organization: North Dakota State University,Department of Computer Science,Fargo,USA |
BookMark | eNotj8FOwzAQRI0EB1r4Ai7mAxJsb5ytuYVQUqRKILUcOFWbdI0sBbckOdC_J6U9jUYz86SZiMu4iyzEvVap1so9PIWvZxrIGjtzqVFGpy5HjQ4uxETnuc3AKWOvxWcRqT30oZc7LyuOLOe_-477PuyiLCk23MkjSK54eJRlS2PiQ0PDMR8n67Iq5DvFpDl1F2HFP_-LG3Hlqe359qxT8fEyX5eLZPlWvZbFMglGwZDUrLYIyIA5KEI2kKOtrfaZR2OYM4MKtha5rjGbOeXBAmKjWI-WNcFU3J24gZk3-y58U3fYnN_CHwjETu0 |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/BigData52589.2021.9671793 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 1665439025 9781665439022 |
EndPage | 4752 |
ExternalDocumentID | 9671793 |
Genre | orig-research |
GroupedDBID | 6IE 6IL CBEJK RIE RIL |
ID | FETCH-LOGICAL-i203t-be0d737e37630a7e23675b51f4f722ee42703d57ebb74890f35377c0e1748e1a3 |
IEDL.DBID | RIE |
IngestDate | Thu Jun 29 18:37:39 EDT 2023 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i203t-be0d737e37630a7e23675b51f4f722ee42703d57ebb74890f35377c0e1748e1a3 |
PageCount | 8 |
ParticipantIDs | ieee_primary_9671793 |
PublicationCentury | 2000 |
PublicationDate | 2021-Dec.-15 |
PublicationDateYYYYMMDD | 2021-12-15 |
PublicationDate_xml | – month: 12 year: 2021 text: 2021-Dec.-15 day: 15 |
PublicationDecade | 2020 |
PublicationTitle | 2021 IEEE International Conference on Big Data (Big Data) |
PublicationTitleAbbrev | Big Data |
PublicationYear | 2021 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
Score | 1.7961859 |
Snippet | In our research, supervised machine learning algorithms were applied to analyze and compare their capability of cancer classification. Our research used eight... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 4745 |
SubjectTerms | and Prostate adenocarcinoma breast invasive carcinoma colon adenocarcinoma Data models Feature extraction kidney renal clear cell carcinoma lung adenocarcinoma Machine learning algorithms Prediction algorithms Predictive models RNA RNA Sequencing Sequential analysis TCGA Pan-cancer HiSeq data |
Title | Analysis of Gene Expression Cancer Data Set: Classification of TCGA Pan-cancer HiSeq Data |
URI | https://ieeexplore.ieee.org/document/9671793 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA1tD-JJpRW_ieDR3e7mo9n1prW1CJVCW6inkmQnUoStli2Iv94ku60oHryFkJCQSfKY5M0bhK4s7GjDJQuEiZh1UJQOZGw6gUk1sNhCEvdiz8OnzmDKHmd8VkPX21gYAPDkMwhd0f_lZ0u9dk9l7bQj3H6qo7rdZmWs1g66rGQz23eLl3tZSE544iJQSBxW7X8kTvG40d9Dw82IJV3kNVwXKtSfv8QY_zulfdT6jtDDoy32HKAa5E30vFEYwUuDnZ407n1UPNccd515V9jNHY-huME-HaYjCnnbuC6T7sMtHsk80GXbwWIM775HC037vUl3EFSpE4IFiWgRKIgyQQW46yOSApxOG1c8NswIQgAYsSc94wKUcvIzkaGcCqEjsA5KArGkh6iRL3M4QlirlGgqmdSp9aaISBRLOKPUgrsdQWXHqOmWZf5WqmPMqxU5-bv6FO060zhCSMzPUKNYreHcwnqhLrw9vwCsg6Ht |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1dS8MwFA1zgvqksonfRvDRdm0-ltY3nc6q2xhsg_k0kvRWhtDp6ED89SZtN1F88K2EXhJykxxucu65CF0Y2NEJl8wRicdMgKK0I_2k6SShBuYbSOK52HO314xG7HHMxxV0ucqFAYCcfAau_czf8uOZXtirskbYFHY9raF1g_uMF9laG-i8FM5s3ExfbmUmOeGBzUEhvlta_CidkiNHext1l30WhJFXd5EpV3_-kmP876B2UP07Rw_3V-iziyqQ1tDzUmMEzxJsFaXx3UfJdE1xyzp4ju3Y8QCyK5wXxLRUodw71mTYur_GfZk6uvg3mg7gPbeoo1H7btiKnLJ4gjMlHs0cBV4sqAB7gHhSgFVq44r7CUsEIQCMmL0ecwFKWQEaL6GcCqE9MCFKAL6ke6iazlLYR1irkGgqmdShiaeICBQLOKPUwLvpQcUHqGanZfJW6GNMyhk5_Lv5DG1Gw25n0nnoPR2hLesmSw_x-TGqZvMFnBiQz9Rp7tsvWO-lOg |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2021+IEEE+International+Conference+on+Big+Data+%28Big+Data%29&rft.atitle=Analysis+of+Gene+Expression+Cancer+Data+Set%3A+Classification+of+TCGA+Pan-cancer+HiSeq+Data&rft.au=Nitta%2C+Yusaku&rft.au=Borders%2C+Mitchell&rft.au=Ludwig%2C+Simone+A.&rft.date=2021-12-15&rft.pub=IEEE&rft.spage=4745&rft.epage=4752&rft_id=info:doi/10.1109%2FBigData52589.2021.9671793&rft.externalDocID=9671793 |