Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem
Understanding the representational power of Deep Neural Networks (DNNs) and how their structural properties (e.g., depth, width, type of activation unit) affect the functions they can compute, has been an important yet challenging question in deep learning and approximation theory. In a seminal pape...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
09.12.2019
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Understanding the representational power of Deep Neural Networks (DNNs) and
how their structural properties (e.g., depth, width, type of activation unit)
affect the functions they can compute, has been an important yet challenging
question in deep learning and approximation theory. In a seminal paper,
Telgarsky highlighted the benefits of depth by presenting a family of functions
(based on simple triangular waves) for which DNNs achieve zero classification
error, whereas shallow networks with fewer than exponentially many nodes incur
constant error. Even though Telgarsky's work reveals the limitations of shallow
neural networks, it does not inform us on why these functions are difficult to
represent and in fact he states it as a tantalizing open question to
characterize those functions that cannot be well-approximated by smaller
depths.
In this work, we point to a new connection between DNNs expressivity and
Sharkovsky's Theorem from dynamical systems, that enables us to characterize
the depth-width trade-offs of ReLU networks for representing functions based on
the presence of generalized notion of fixed points, called periodic points (a
fixed point is a point of period 1). Motivated by our observation that the
triangle waves used in Telgarsky's work contain points of period 3 - a period
that is special in that it implies chaotic behavior based on the celebrated
result by Li-Yorke - we proceed to give general lower bounds for the width
needed to represent periodic functions as a function of the depth. Technically,
the crux of our approach is based on an eigenvalue analysis of the dynamical
system associated with such functions. |
---|---|
DOI: | 10.48550/arxiv.1912.04378 |