Optimal subsample selection for massive logistic regression with distributed data
With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently appro...
Saved in:
Published in | Computational statistics Vol. 36; no. 4; pp. 2535 - 2562 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Berlin/Heidelberg
Springer Berlin Heidelberg
01.12.2021
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | With the emergence of big data, it is increasingly common that the data are distributed. i.e., the data are stored at many distributed sites (machines or nodes) owing to data collection or business operations, etc. We propose a distributed subsampling procedure in such a setting to efficiently approximate the maximum likelihood estimator for the logistic regression. We establish the consistency and asymptotic normality of the subsample estimator given the full data. The optimal subsampling probabilities and optimal allocation sizes are explicitly obtained. We develop a two-step algorithm to approximate the optimal subsampling procedure. Numerical simulations and an application to airline data are presented to evaluate the performance of our subsampling method. |
---|---|
ISSN: | 0943-4062 1613-9658 |
DOI: | 10.1007/s00180-021-01089-0 |