Large-Scale Survey Data Analysis with Penalized Regression: A Monte Carlo Simulation on Missing Categorical Predictors

With the advent of the big data era, machine learning methods have evolved and proliferated. This study focused on penalized regression, a procedure that builds interpretive prediction models among machine learning methods. In particular, penalized regression coupled with large-scale data can explor...

Full description

Saved in:
Bibliographic Details
Published inMultivariate behavioral research Vol. 57; no. 4; pp. 642 - 657
Main Authors Yoo, Jin Eun, Rho, Minjeong
Format Journal Article
LanguageEnglish
Published United States Routledge 29.07.2022
Taylor & Francis Ltd
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:With the advent of the big data era, machine learning methods have evolved and proliferated. This study focused on penalized regression, a procedure that builds interpretive prediction models among machine learning methods. In particular, penalized regression coupled with large-scale data can explore hundreds or thousands of variables in one statistical model without convergence problems and identify yet uninvestigated important predictors. As one of the first Monte Carlo simulation studies to investigate predictive modeling with missing categorical predictors in the context of social science research, this study endeavored to emulate real social science large-scale data. Likert-scaled variables were simulated as well as multiple-category and count variables. Due to the inclusion of the categorical predictors in modeling, penalized regression methods that consider the grouping effect were employed such as group Mnet. We also examined the applicability of the simulation conditions with a real large-scale dataset that the simulation study referenced. Particularly, the study presented selection counts of variables after multiple iterations of modeling in order to consider the bias resulting from data-splitting in model validation. Selection counts turned out to be a necessary tool when variable selection is of research interest. Efforts to utilize large-scale data to the fullest appear to offer a valid approach to mitigate the effect of nonignorable missingness. Overall, penalized regression which assumes linearity is a viable method to analyze social science large-scale survey data.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0027-3171
1532-7906
DOI:10.1080/00273171.2021.1891856