What's hot and what's not: tracking most frequent items dynamically

Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of t...

Full description

Saved in:
Bibliographic Details
Published inACM transactions on database systems Vol. 30; no. 1; pp. 249 - 278
Main Authors Cormode, Graham, Muthukrishnan, S.
Format Journal Article Conference Proceeding
LanguageEnglish
Published New York, NY Association for Computing Machinery 01.03.2005
Subjects
Online AccessGet full text
ISSN0362-5915
1557-4644
DOI10.1145/1061318.1061325

Cover

More Information
Summary:Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items” in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications.We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and, when required, quickly output all hot items without rescanning the relation in the database. With user-specified probability, all hot items are correctly reported. Our methods rely on ideas from “group testing.” They are simple to implement, and have provable quality, space, and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees cannot handle deletions, and those that handle deletions cannot make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.
Bibliography:SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
ObjectType-Article-2
content type line 23
ISSN:0362-5915
1557-4644
DOI:10.1145/1061318.1061325