CanDLE: Illuminating Biases in Transcriptomic Pan-Cancer Diagnosis

Automatic cancer diagnosis based on RNA-Seq profiles is at the intersection of transcriptome analysis and machine learning. Methods developed for this task could be a valuable support in clinical practice and provide insights into the cancer causal mechanisms. To correctly approach this problem, the...

Full description

Saved in:

Bibliographic Details
Published in	Computational Mathematics Modeling in Cancer Analysis Vol. 13574; pp. 68 - 77
Main Authors	Mejía, Gabriel, Bloch, Natasha, Arbelaez, Pablo
Format	Book Chapter
Language	English
Published	Switzerland Springer 2022 Springer Nature Switzerland
Series	Lecture Notes in Computer Science
Subjects	Cancer classification Cancer detection GTEx Machine learning Multinomial logistic regression TCGA
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Automatic cancer diagnosis based on RNA-Seq profiles is at the intersection of transcriptome analysis and machine learning. Methods developed for this task could be a valuable support in clinical practice and provide insights into the cancer causal mechanisms. To correctly approach this problem, the largest existing resource (The Cancer Genome Atlas) must be complemented with healthy tissue samples from the Genotype-Tissue Expression project. In this work, we empirically prove that previous approaches to joining these databases suffer from translation biases and correct them using batch z-score normalization. Moreover, we propose CanDLE, a multinomial logistic regression model that achieves state of the art performance in multilabel cancer/healthy tissue type classification (94.1% $$94.1\%$$ balanced accuracy) and all-vs-one cancer type detection (78.0% $$78.0\%$$ average maxF1 $$\max F_1$$ ).
Bibliography:	Original Abstract: Automatic cancer diagnosis based on RNA-Seq profiles is at the intersection of transcriptome analysis and machine learning. Methods developed for this task could be a valuable support in clinical practice and provide insights into the cancer causal mechanisms. To correctly approach this problem, the largest existing resource (The Cancer Genome Atlas) must be complemented with healthy tissue samples from the Genotype-Tissue Expression project. In this work, we empirically prove that previous approaches to joining these databases suffer from translation biases and correct them using batch z-score normalization. Moreover, we propose CanDLE, a multinomial logistic regression model that achieves state of the art performance in multilabel cancer/healthy tissue type classification (94.1%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$94.1\%$$\end{document} balanced accuracy) and all-vs-one cancer type detection (78.0%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$78.0\%$$\end{document} average maxF1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\max F_1$$\end{document}).
ISBN:	9783031172656 3031172655
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-031-17266-3_7