Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win
Sparse Neural Networks (NNs) can match the generalization of dense NNs using a fraction of the compute/storage for inference, and also have the potential to enable efficient training. However, naively training unstructured sparse NNs from random initialization results in significantly worse generali...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
07.10.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Sparse Neural Networks (NNs) can match the generalization of dense NNs using
a fraction of the compute/storage for inference, and also have the potential to
enable efficient training. However, naively training unstructured sparse NNs
from random initialization results in significantly worse generalization, with
the notable exceptions of Lottery Tickets (LTs) and Dynamic Sparse Training
(DST). Through our analysis of gradient flow during training we attempt to
answer: (1) why training unstructured sparse networks from random
initialization performs poorly and; (2) what makes LTs and DST the exceptions?
We show that sparse NNs have poor gradient flow at initialization and
demonstrate the importance of using sparsity-aware initialization. Furthermore,
we find that DST methods significantly improve gradient flow during training
over traditional sparse training methods. Finally, we show that LTs do not
improve gradient flow, rather their success lies in re-learning the pruning
solution they are derived from - however, this comes at the cost of learning
novel solutions. |
---|---|
DOI: | 10.48550/arxiv.2010.03533 |