The Variational Garrote
We analyze the variational method for sparse regression using ℓ 0 regularization. The variational approximation results in a model that is similar to Breiman’s Garrote model. We refer to this method as the Variational Garrote (VG). The VG has the effect of making the problem effectively of maximal r...
Saved in:
Published in | Machine learning Vol. 96; no. 3; pp. 269 - 294 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
New York
Springer US
01.09.2014
Springer Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | We analyze the variational method for sparse regression using
ℓ
0
regularization. The variational approximation results in a model that is similar to Breiman’s Garrote model. We refer to this method as the Variational Garrote (VG). The VG has the effect of making the problem effectively of maximal rank even when the number of samples is small compared to the number of variables. We propose a naive mean field approximation combined with a maximum a posteriori (MAP) approach to estimate the model parameters and use an annealing and reheating schedule of the sparsity hyper-parameter to avoid local minima. The hyper-parameter is set by cross-validation. We compare the VG with the lasso, ridge regression and the recently introduced Bayesian paired mean field method (PMF) (Titsias and Lázaro-Gredilla in Advances in neural information processing systems, vol. 24, pp. 2339–2347,
2011
). For fair comparison, we implemented a similar annealing-reheating schedule for the PMF sparsity parameter. Numerical results show that the VG and PMF yield more accurate predictions and more accurately reconstruct the true model than the other methods. The VG finds correct solutions when the lasso solution is inconsistent due to large input correlations. In the experiments that we consider we find that the VG, although based on a simpler approximation than the PMF, yields qualitatively similar or better results and is computationally more efficient. The naive implementation of the VG scales cubic with the number of features. By introducing Lagrange multipliers we obtain a dual formulation of the problem that scales cubic in the number of samples, but close to linear in the number of features. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0885-6125 1573-0565 |
DOI: | 10.1007/s10994-013-5427-7 |