Logistic regression for disease classification using microarray data: model selection in a large p and small n case

Motivation: Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number...

Full description

Saved in:

Bibliographic Details
Published in	Bioinformatics Vol. 23; no. 15; pp. 1945 - 1951
Main Authors	Liao, J.G., Chin, Khew-Voon
Format	Journal Article
Language	English
Published	Oxford Oxford University Press 01.08.2007 Oxford Publishing Limited (England)
Subjects	Algorithms Biological and medical sciences Biomarkers, Tumor - analysis Data Interpretation, Statistical Diagnosis, Computer-Assisted - methods Fundamental and applied biological sciences. Psychology General aspects Humans Leukemia Logistic Models Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Models, Biological Neoplasm Proteins - analysis Neoplasms - classification Neoplasms - diagnosis Neoplasms - metabolism Oligonucleotide Array Sequence Analysis - methods Prediction models Regression Analysis Reproducibility of Results Sample Size Sensitivity and Specificity Error estimation False positive Disease DNA chip Malignant tumor Gene expression Microarray Original document Logistic regression Computer program Classification Bootstrap Models Bioinformatics Cancer
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Motivation: Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number of genes and a small number of subjects. Model selection for this two-step approach requires new statistical tools because prediction error estimation ignoring the feature selection step can be severely downward biased. Generic methods such as cross-validation and non-parametric bootstrap can be very ineffective due to the big variability in the prediction error estimate. Results: We propose a parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. The proposed method provides guidance on the two critical issues in model selection: the number of genes to include in the model and the optimal shrinkage for the penalized logistic regression. We show that selecting more than 20 genes usually helps little in further reducing the prediction error. Application to Golub's leukemia data and our own cervical cancer data leads to highly accurate prediction models. Availability: R library GeneLogit at http://geocities.com/jg_liao Contact: jl544@drexel.edu
Bibliography:	ark:/67375/HXZ-GVFP4455-S To whom correspondence should be addressed. Associate Editor: Trey Ideker istex:A4D248DBC4A4B128B84BDE81E0F874CEBD402849 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1367-4803 1367-4811 1460-2059 1367-4811
DOI:	10.1093/bioinformatics/btm287