Optimal subsample selection for massive logistic regression with distributed data

With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently appro...

Full description

Saved in:

Bibliographic Details
Published in	Computational statistics Vol. 36; no. 4; pp. 2535 - 2562
Main Authors	Zuo, Lulu, Zhang, Haixiang, Wang, HaiYing, Sun, Liuquan
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.12.2021 Springer Nature B.V
Subjects	Algorithms Big Data Business operations Data collection Datasets Economic Theory/Quantitative Economics/Mathematical Methods Generalized linear models Mathematics and Statistics Maximum likelihood estimators Normality Original Paper Parameter estimation Probability and Statistics in Computer Science Probability Theory and Stochastic Processes Statistics Subsampling probabilities Subsample estimator Allocation size Big data Distributed and massive data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently approximate the maximum likelihood estimator for the logistic regression. We establish the consistency and asymptotic normality of the subsample estimator given the full data. The optimal subsampling probabilities and optimal allocation sizes are explicitly obtained. We develop a two-step algorithm to approximate the optimal subsampling procedure. Numerical simulations and an application to airline data are presented to evaluate the performance of our subsampling method.
ISSN:	0943-4062 1613-9658
DOI:	10.1007/s00180-021-01089-0