X-Linear Attention Networks for Image Captioning

Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2nd order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attentio...

Full description

Saved in:

Bibliographic Details
Published in	2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 10968 - 10977
Main Authors	Pan, Yingwei, Yao, Ting, Li, Yehao, Mei, Tao
Format	Conference Proceeding
Language	English
Published	IEEE 01.06.2020
Subjects	Aggregates Cognition Decoding Knowledge discovery Task analysis Visualization Weight measurement
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2nd order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block --- X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning. Technically, X-Linear attention block simultaneously exploits both the spatial and channel-wise bilinear attention distributions to capture the 2^{nd} order interactions between the input single-modal or multi-modal features. Higher and even infinity order feature interactions are readily modeled through stacking multiple X-Linear attention blocks and equipping the block with Exponential Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher order intra- and inter-modal interactions. The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0% on COCO Karpathy test split. When further endowing Transformer with X-Linear attention blocks, CIDEr is boosted up to 132.8%. Source code is available at https://github.com/Panda-Peter/image-captioning.
ISSN:	2575-7075
DOI:	10.1109/CVPR42600.2020.01098