Self-Adaptive Framework for Efficient Stream Data Classification on Storm
In this era of big data, stream data classification which is one of typical data stream applications has become more and more significant and challengeable. In these applications, it is obvious that data classification is much more frequent than model training. The ratio of stream data to be classif...
Saved in:
Published in | IEEE transactions on systems, man, and cybernetics. Systems Vol. 50; no. 1; pp. 123 - 136 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.01.2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In this era of big data, stream data classification which is one of typical data stream applications has become more and more significant and challengeable. In these applications, it is obvious that data classification is much more frequent than model training. The ratio of stream data to be classified is rapid and time-varying, so it is an important problem to classify the stream data efficiently with high throughput. In this paper, we first analyze and categorize the current data stream machine learning algorithms according to their data structures. Then, we propose stream data classification topology (SDC-Topology) on Storm. For the classification algorithms based on the matrix, we propose self-adaptive stream data classification framework (SASDC-Framework) for efficient stream data classification on Storm. In SASDC-Framework, all the data sets arriving at the same unit time are partitioned into subsets with the nearly best partition size and processed in parallel. To select the nearly best partition size for the stream data sets efficiently, we adopt bisection method strategy and inverse distance weighted strategy. Extreme learning machine, which is a fast and accurate machine learning method based on matrix calculating, is used to test the efficiency of our proposals. According to evaluation results, the throughputs based on SASDC-Framework are 8-35 times higher than those based on SDC-Topology and the best throughput is more than 40000 prediction requests per second in our environment. |
---|---|
ISSN: | 2168-2216 2168-2232 |
DOI: | 10.1109/TSMC.2017.2757029 |