Information-theoretic Feature Selection via Tensor Decomposition and Submodularity

Feature selection by maximizing high-order mutual information between the selected feature vector and a target variable is the gold standard in terms of selecting the best subset of relevant features that maximizes the performance of prediction models. However, such an approach typically requires kn...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Amiridi, Magda, Kargas, Nikos, Sidiropoulos, Nicholas D
Format	Paper Journal Article
Language	English
Published	Ithaca Cornell University Library, arXiv.org 30.10.2020
Subjects	Combinatorial analysis Computer Science - Learning Decomposition Feature selection Greedy algorithms Information theory Maximization Optimization Prediction models Statistics - Machine Learning Tensors
Online Access	Get full text
ISSN	2331-8422
DOI	10.48550/arxiv.2010.16181

Cover

More Information
Summary:	Feature selection by maximizing high-order mutual information between the selected feature vector and a target variable is the gold standard in terms of selecting the best subset of relevant features that maximizes the performance of prediction models. However, such an approach typically requires knowledge of the multivariate probability distribution of all features and the target, and involves a challenging combinatorial optimization problem. Recent work has shown that any joint Probability Mass Function (PMF) can be represented as a naive Bayes model, via Canonical Polyadic (tensor rank) Decomposition. In this paper, we introduce a low-rank tensor model of the joint PMF of all variables and indirect targeting as a way of mitigating complexity and maximizing the classification performance for a given number of features. Through low-rank modeling of the joint PMF, it is possible to circumvent the curse of dimensionality by learning principal components of the joint distribution. By indirectly aiming to predict the latent variable of the naive Bayes model instead of the original target variable, it is possible to formulate the feature selection problem as maximization of a monotone submodular function subject to a cardinality constraint - which can be tackled using a greedy algorithm that comes with performance guarantees. Numerical experiments with several standard datasets suggest that the proposed approach compares favorably to the state-of-art for this important problem.
Bibliography:	SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50
ISSN:	2331-8422
DOI:	10.48550/arxiv.2010.16181