The Variational Garrote

We analyze the variational method for sparse regression using ℓ 0 regularization. The variational approximation results in a model that is similar to Breiman’s Garrote model. We refer to this method as the Variational Garrote (VG). The VG has the effect of making the problem effectively of maximal r...

Full description

Saved in:
Bibliographic Details
Published inMachine learning Vol. 96; no. 3; pp. 269 - 294
Main Authors Kappen, Hilbert J., Gómez, Vicenç
Format Journal Article
LanguageEnglish
Published New York Springer US 01.09.2014
Springer
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We analyze the variational method for sparse regression using ℓ 0 regularization. The variational approximation results in a model that is similar to Breiman’s Garrote model. We refer to this method as the Variational Garrote (VG). The VG has the effect of making the problem effectively of maximal rank even when the number of samples is small compared to the number of variables. We propose a naive mean field approximation combined with a maximum a posteriori (MAP) approach to estimate the model parameters and use an annealing and reheating schedule of the sparsity hyper-parameter to avoid local minima. The hyper-parameter is set by cross-validation. We compare the VG with the lasso, ridge regression and the recently introduced Bayesian paired mean field method (PMF) (Titsias and Lázaro-Gredilla in Advances in neural information processing systems, vol. 24, pp. 2339–2347, 2011 ). For fair comparison, we implemented a similar annealing-reheating schedule for the PMF sparsity parameter. Numerical results show that the VG and PMF yield more accurate predictions and more accurately reconstruct the true model than the other methods. The VG finds correct solutions when the lasso solution is inconsistent due to large input correlations. In the experiments that we consider we find that the VG, although based on a simpler approximation than the PMF, yields qualitatively similar or better results and is computationally more efficient. The naive implementation of the VG scales cubic with the number of features. By introducing Lagrange multipliers we obtain a dual formulation of the problem that scales cubic in the number of samples, but close to linear in the number of features.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0885-6125
1573-0565
DOI:10.1007/s10994-013-5427-7