Rethinking Query-Key Pairwise Interactions in Vision Transformers
Vision Transformers have achieved state-of-the-art performance in many visual tasks. Due to the quadratic computational and memory complexities of self-attention, recent works either apply attention only to low-resolution inputs or restrict the receptive field to a small local region. To overcome th...
Saved in:
Main Authors | , |
---|---|
Format | Journal Article |
Language | English |
Published |
30.06.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Vision Transformers have achieved state-of-the-art performance in many visual
tasks. Due to the quadratic computational and memory complexities of
self-attention, recent works either apply attention only to low-resolution
inputs or restrict the receptive field to a small local region. To overcome
these limitations, we propose key-only attention, which excludes query-key
pairwise interactions and uses a compute-efficient saliency-gate to obtain
attention weights, modeling local-global interactions in all stages. Key-only
attention has linear computational and memory complexities w.r.t input size. We
use alternate layout to hybridize convolution and attention layers instead of
grafting which is suggested by previous works, so that all stages can benefit
from both spatial attentions and convolutions. We leverage these improvements
to develop a new self-attention model family, LinGlos, which reach
state-of-the-art accuracies on the parameter-limited setting of ImageNet
classification benchmark, and outperform baselines significantly in downstream
tasks, e.g., COCO object detection and ADE20K semantic segmentation. |
---|---|
DOI: | 10.48550/arxiv.2207.00188 |