Advanced colorectal neoplasia risk stratification by penalized logistic regression

Colorectal cancer is the second leading cause of death from cancer in the United States. To facilitate the efficiency of colorectal cancer screening, there is a need to stratify risk for colorectal cancer among the 90% of US residents who are considered "average risk." In this article, we...

Full description

Saved in:
Bibliographic Details
Published inStatistical methods in medical research Vol. 25; no. 4; p. 1677
Main Authors Lin, Yunzhi, Yu, Menggang, Wang, Sijian, Chappell, Richard, Imperiale, Thomas F
Format Journal Article
LanguageEnglish
Published England 01.08.2016
Subjects
Online AccessGet more information
ISSN1477-0334
DOI10.1177/0962280213497432

Cover

Loading…
More Information
Summary:Colorectal cancer is the second leading cause of death from cancer in the United States. To facilitate the efficiency of colorectal cancer screening, there is a need to stratify risk for colorectal cancer among the 90% of US residents who are considered "average risk." In this article, we investigate such risk stratification rules for advanced colorectal neoplasia (colorectal cancer and advanced, precancerous polyps). We use a recently completed large cohort study of subjects who underwent a first screening colonoscopy. Logistic regression models have been used in the literature to estimate the risk of advanced colorectal neoplasia based on quantifiable risk factors. However, logistic regression may be prone to overfitting and instability in variable selection. Since most of the risk factors in our study have several categories, it was tempting to collapse these categories into fewer risk groups. We propose a penalized logistic regression method that automatically and simultaneously selects variables, groups categories, and estimates their coefficients by penalizing the [Formula: see text]-norm of both the coefficients and their differences. Hence, it encourages sparsity in the categories, i.e. grouping of the categories, and sparsity in the variables, i.e. variable selection. We apply the penalized logistic regression method to our data. The important variables are selected, with close categories simultaneously grouped, by penalized regression models with and without the interactions terms. The models are validated with 10-fold cross-validation. The receiver operating characteristic curves of the penalized regression models dominate the receiver operating characteristic curve of naive logistic regressions, indicating a superior discriminative performance.
ISSN:1477-0334
DOI:10.1177/0962280213497432