DiffCap: Exploring Continuous Diffusion on Image Captioning
Current image captioning works usually focus on generating descriptions in an autoregressive manner. However, there are limited works that focus on generating descriptions non-autoregressively, which brings more decoding diversity. Inspired by the success of diffusion models on generating natural-lo...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
20.05.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Current image captioning works usually focus on generating descriptions in an
autoregressive manner. However, there are limited works that focus on
generating descriptions non-autoregressively, which brings more decoding
diversity. Inspired by the success of diffusion models on generating
natural-looking images, we propose a novel method DiffCap to apply continuous
diffusions on image captioning. Unlike image generation where the output is
fixed-size and continuous, image description length varies with discrete
tokens. Our method transforms discrete tokens in a natural way and applies
continuous diffusion on them to successfully fuse extracted image features for
diffusion caption generation. Our experiments on COCO dataset demonstrate that
our method uses a much simpler structure to achieve comparable results to the
previous non-autoregressive works. Apart from quality, an intriguing property
of DiffCap is its high diversity during generation, which is missing from many
autoregressive models. We believe our method on fusing multimodal features in
diffusion language generation will inspire more researches on multimodal
language generation tasks for its simplicity and decoding flexibility. |
---|---|
DOI: | 10.48550/arxiv.2305.12144 |