Sure independence screening in the presence of missing data

Variable selection in ultra-high dimensional data sets is an increasingly prevalent issue with the readily available data arising from, for example, genome-wide associations studies or gene expression data. When the dimension of the feature space is exponentially larger than the sample size, it is d...

Full description

Saved in:

Bibliographic Details
Published in	Statistical papers (Berlin, Germany) Vol. 62; no. 2; pp. 817 - 845
Main Authors	Zambom, Adriano Zanin, Matthews, Gregory J.
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.04.2021 Springer Nature B.V
Subjects	Correlation coefficients Economic Theory/Quantitative Economics/Mathematical Methods Economics Finance Gene expression Insurance Management Mathematics and Statistics Maximum likelihood estimation Missing data Operations Research/Decision Theory Probability Theory and Stochastic Processes Prostate Prostate cancer Regular Article Screening Statistics Statistics for Business Maximum likelihood estimator Correlation coefficient Ultrahigh dimensionality Missing at random EM algorithm
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Variable selection in ultra-high dimensional data sets is an increasingly prevalent issue with the readily available data arising from, for example, genome-wide associations studies or gene expression data. When the dimension of the feature space is exponentially larger than the sample size, it is desirable to screen out unimportant predictors in order to bring the dimension down to a moderate scale. In this paper we consider the case when observations of the predictors are missing at random. We propose performing screening using the marginal linear correlation coefficient between each predictor and the response variable accounting for the missing data using maximum likelihood estimation. This method is shown to have the sure screening property. Moreover, a novel method of screening that uses additional predictors when estimating the correlation coefficient is proposed. Simulations show that simply performing screening using pairwise complete observations is out-performed by both the proposed methods and is not recommended. Finally, the proposed methods are applied to a gene expression study on prostate cancer.
ISSN:	0932-5026 1613-9798
DOI:	10.1007/s00362-019-01115-w