The Depth-to-Width Interplay in Self-Attention
Self-attention architectures, which are rapidly pushing the frontier in natural language processing, demonstrate a surprising depth-inefficient behavior: previous works indicate that increasing the internal representation (network width) is just as useful as increasing the number of self-attention l...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
22.06.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Self-attention architectures, which are rapidly pushing the frontier in
natural language processing, demonstrate a surprising depth-inefficient
behavior: previous works indicate that increasing the internal representation
(network width) is just as useful as increasing the number of self-attention
layers (network depth). We theoretically predict a width-dependent transition
between depth-efficiency and depth-inefficiency in self-attention. We conduct
systematic empirical ablations on networks of depths 6 to 48 that clearly
reveal the theoretically predicted behaviors, and provide explicit quantitative
suggestions regarding the optimal depth-to-width allocation for a given
self-attention network size. The race towards beyond 1-Trillion parameter
language models renders informed guidelines for increasing self-attention depth
and width in tandem an essential ingredient. Our guidelines elucidate the
depth-to-width trade-off in self-attention networks of sizes up to the scale of
GPT3 (which we project to be too deep for its size), and beyond, marking an
unprecedented width of 30K as optimal for a 1-Trillion parameter network. |
---|---|
DOI: | 10.48550/arxiv.2006.12467 |