VMamba: Visual State Space Model

Designing computationally efficient network architectures persists as an ongoing necessity in computer vision. In this paper, we transplant Mamba, a state-space language model, into VMamba, a vision backbone that works in linear time complexity. At the core of VMamba lies a stack of Visual State-Spa...

Full description

Saved in:
Bibliographic Details
Main Authors Liu, Yue, Tian, Yunjie, Zhao, Yuzhong, Yu, Hongtian, Xie, Lingxi, Wang, Yaowei, Ye, Qixiang, Liu, Yunfan
Format Journal Article
LanguageEnglish
Published 18.01.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Designing computationally efficient network architectures persists as an ongoing necessity in computer vision. In this paper, we transplant Mamba, a state-space language model, into VMamba, a vision backbone that works in linear time complexity. At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D helps bridge the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the gathering of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments showcase VMamba's promising performance across diverse visual perception tasks, highlighting its advantages in input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.
DOI:10.48550/arxiv.2401.10166