Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask?
Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that can still be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such h...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
06.10.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Modern deep learning involves training costly, highly overparameterized
networks, thus motivating the search for sparser networks that can still be
trained to the same accuracy as the full network (i.e. matching). Iterative
magnitude pruning (IMP) is a state of the art algorithm that can find such
highly sparse matching subnetworks, known as winning tickets. IMP operates by
iterative cycles of training, masking smallest magnitude weights, rewinding
back to an early training point, and repeating. Despite its simplicity, the
underlying principles for when and how IMP finds winning tickets remain
elusive. In particular, what useful information does an IMP mask found at the
end of training convey to a rewound network near the beginning of training? How
does SGD allow the network to extract this information? And why is iterative
pruning needed? We develop answers in terms of the geometry of the error
landscape. First, we find that$\unicode{x2014}$at higher
sparsities$\unicode{x2014}$pairs of pruned networks at successive pruning
iterations are connected by a linear path with zero error barrier if and only
if they are matching. This indicates that masks found at the end of training
convey the identity of an axial subspace that intersects a desired linearly
connected mode of a matching sublevel set. Second, we show SGD can exploit this
information due to a strong form of robustness: it can return to this mode
despite strong perturbations early in training. Third, we show how the flatness
of the error landscape at the end of training determines a limit on the
fraction of weights that can be pruned at each iteration of IMP. Finally, we
show that the role of retraining in IMP is to find a network with new small
weights to prune. Overall, these results make progress toward demystifying the
existence of winning tickets by revealing the fundamental role of error
landscape geometry. |
---|---|
DOI: | 10.48550/arxiv.2210.03044 |