Nonparametric Distributed Learning Architecture for Big Data: Algorithm and Applications

Dramatic increases in the size and complexity of modern datasets have made traditional “centralized” statistical inference prohibitive. In addition to computational challenges associated with big data learning, the presence of numerous data types (e.g., discrete, continuous, categorical, etc.) makes...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on big data Vol. 5; no. 2; pp. 166 - 179
Main Authors	Bruce, Scott, Li, Zeda, Yang, Hsiang-Chieh, Mukhopadhyay, Subhadeep
Format	Journal Article
Language	English
Published	Piscataway The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 01.06.2019
Subjects	Algorithms Analytical models Big Data Computational modeling Computer architecture Data analysis Data management Data models data-parallelism Datasets Distributed databases distributed statistical learning heterogeneity Inference algorithms LP transformation Machine learning Meta-analysis Nonparametric mixed data modeling Nonparametric statistics Parallel processing Statistical inference Statistical models
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Dramatic increases in the size and complexity of modern datasets have made traditional “centralized” statistical inference prohibitive. In addition to computational challenges associated with big data learning, the presence of numerous data types (e.g., discrete, continuous, categorical, etc.) makes automation and scalability difficult. A question of immediate concern is how to design a data-intensive statistical inference architecture without changing the basic statistical modeling principles developed for “small” data over the last century. To address this problem, we present MetaLP, a flexible, distributed statistical modeling framework suitable for large-scale data analysis, where statistical inference meets big data computing. This framework consists of three key components that work together to provide a holistic solution for big data learning: (i) partitioning massive data into smaller datasets for parallel processing and efficient computation, (ii) modern nonparametric learning based on a specially designed, orthonormal data transformation leading to mixed data algorithms, and finally (iii) combining heterogeneous “local” inferences from partitioned data using meta-analysis techniques to arrive at the “global” inference for the original big data. We present an application of this general theory in the context of a nonparametric two-sample inference algorithm for Expedia personalized hotel recommendations based on 10 million search result records.
ISSN:	2332-7790 2372-2096
DOI:	10.1109/TBDATA.2018.2810187