Efficient Algorithms for Finding Richer Subgroup Descriptions in Numeric and Nominal Data

Subgroup discovery systems are concerned with finding interesting patterns in labeled data. How these systems deal with numeric and nominal data has a large impact on the quality of their results. In this paper, we consider two ways to extend the standard pattern language of subgroup discovery: usin...

Full description

Saved in:
Bibliographic Details
Published in2012 IEEE 12th International Conference on Data Mining pp. 499 - 508
Main Authors Mampaey, M., Nijssen, S., Feelders, A., Knobbe, A.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2012
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Subgroup discovery systems are concerned with finding interesting patterns in labeled data. How these systems deal with numeric and nominal data has a large impact on the quality of their results. In this paper, we consider two ways to extend the standard pattern language of subgroup discovery: using conditions that test for interval membership for numeric attributes, and value set membership for nominal attributes. We assume a greedy search setting, that is, iteratively refining a given subgroup, with respect to a (convex) quality measure. For numeric attributes, we propose an algorithm that finds the optimal interval in linear (rather than quadratic) time, with respect to the number of examples and split points. Similarly, for nominal attributes, we show that finding the optimal set of values can be achieved in linear (rather than exponential) time, with respect to the number of examples and the size of the domain of the attribute. These algorithms operate by only considering subgroup refinements that lie on a convex hull in ROC space, thus significantly narrowing down the search space. We further provide efficient algorithms specifically for the popular Weighted Relative Accuracy quality measure, taking advantage of some of its properties. Our algorithms are shown to perform well in practice, and furthermore provide additional expressive power leading to higher-quality results.
ISBN:1467346497
9781467346498
ISSN:1550-4786
2374-8486
DOI:10.1109/ICDM.2012.117