Memory-adaptive high utility sequential pattern mining over data streams

High utility sequential pattern (HUSP) mining has emerged as an important topic in data mining. A number of studies have been conducted on mining HUSPs, but they are mainly intended for non-streaming data and thus do not take data stream characteristics into consideration. Streaming data are fast ch...

Full description

Saved in:

Bibliographic Details
Published in	Machine learning Vol. 106; no. 6; pp. 799 - 836
Main Authors	Zihayat, Morteza, Chen, Yan, An, Aijun
Format	Journal Article
Language	English
Published	New York Springer US 01.06.2017 Springer Nature B.V
Subjects	Adaptive algorithms Algorithms Artificial Intelligence Computer memory Computer Science Control Data mining Data transmission Gene expression Mechatronics Memory management Natural Language Processing (NLP) Pattern analysis Robotics Simulation and Modeling Approximation algorithms Data streams High utility sequential pattern mining
Online Access	Get full text

Cover

Loading…

More Information
Summary:	High utility sequential pattern (HUSP) mining has emerged as an important topic in data mining. A number of studies have been conducted on mining HUSPs, but they are mainly intended for non-streaming data and thus do not take data stream characteristics into consideration. Streaming data are fast changing, continuously generated unbounded in quantity. Such data can easily exhaust computer resources (e.g., memory) unless a proper resource-aware mining is performed. In this study, we explore the fundamental problem of how limited memory can be best utilized to produce high quality HUSPs over a data stream. We design an approximation algorithm, called MAHUSP , that employs memory adaptive mechanisms to use a bounded portion of memory, in order to efficiently discover HUSPs over data streams. An efficient tree structure, called MAS-Tree , is proposed to store potential HUSPs over a data stream. MAHUSP guarantees that all HUSPs are discovered in certain circumstances. Our experimental study shows that our algorithm can not only discover HUSPs over data streams efficiently, but also adapt to memory allocation with limited sacrifices in the quality of discovered HUSPs. Furthermore, in order to show the effectiveness and efficiency of MAHUSP in real-life applications, we apply our proposed algorithm to a web clickstream dataset obtained from a Canadian news portal to showcase users’ reading behavior, and to a real biosequence database to identify disease-related gene regulation sequential patterns. The results show that MAHUSP effectively discovers useful and meaningful patterns in both cases.
ISSN:	0885-6125 1573-0565
DOI:	10.1007/s10994-016-5617-1