Self-Adaptive Framework for Efficient Stream Data Classification on Storm

In this era of big data, stream data classification which is one of typical data stream applications has become more and more significant and challengeable. In these applications, it is obvious that data classification is much more frequent than model training. The ratio of stream data to be classif...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on systems, man, and cybernetics. Systems Vol. 50; no. 1; pp. 123 - 136
Main Authors Deng, Shizhuo, Wang, Botao, Huang, Shan, Yue, Chuncheng, Zhou, Jianpeng, Wang, Guoren
Format Journal Article
LanguageEnglish
Published New York IEEE 01.01.2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In this era of big data, stream data classification which is one of typical data stream applications has become more and more significant and challengeable. In these applications, it is obvious that data classification is much more frequent than model training. The ratio of stream data to be classified is rapid and time-varying, so it is an important problem to classify the stream data efficiently with high throughput. In this paper, we first analyze and categorize the current data stream machine learning algorithms according to their data structures. Then, we propose stream data classification topology (SDC-Topology) on Storm. For the classification algorithms based on the matrix, we propose self-adaptive stream data classification framework (SASDC-Framework) for efficient stream data classification on Storm. In SASDC-Framework, all the data sets arriving at the same unit time are partitioned into subsets with the nearly best partition size and processed in parallel. To select the nearly best partition size for the stream data sets efficiently, we adopt bisection method strategy and inverse distance weighted strategy. Extreme learning machine, which is a fast and accurate machine learning method based on matrix calculating, is used to test the efficiency of our proposals. According to evaluation results, the throughputs based on SASDC-Framework are 8-35 times higher than those based on SDC-Topology and the best throughput is more than 40000 prediction requests per second in our environment.
ISSN:2168-2216
2168-2232
DOI:10.1109/TSMC.2017.2757029