Mamba State-Space Models Are Lyapunov-Stable Learners
Mamba state-space models (SSMs) were recently shown to outperform state-of-the-art (SOTA) Transformer large language models (LLMs) across various tasks. Despite subsequent widespread adaptation, little work has focused on Mamba LLMs' amenability for fine-tuning frameworks ubiquitously used for...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
31.05.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Mamba state-space models (SSMs) were recently shown to outperform
state-of-the-art (SOTA) Transformer large language models (LLMs) across various
tasks. Despite subsequent widespread adaptation, little work has focused on
Mamba LLMs' amenability for fine-tuning frameworks ubiquitously used for
Transformer-based LLMs, e.g., mixed-precision fine-tuning (MPFT) and
parameter-efficient fine-tuning (PEFT). For the former, it currently remains an
open question whether Mamba's recurrent dynamics are robust to small input
changes, such as those encountered during MPFT. Using dynamical systems theory
(in particular, Lyapunov exponents), we answer this question in the
affirmative. We empirically validate this result through several experiments,
showing that Mamba SSMs are significantly more stable to changes introduced by
mixed-precision than comparable Transformers, even when both MPFT and PEFT are
combined. For PEFT, we show how targeting specific memory buffers in Mamba's
customized CUDA kernels for low-rank adaptation regularizes SSM parameters,
thus providing both parameter efficient learning and computational savings.
Finally, with both MPFT and PEFT enabled, we explore the impact of instruction
tuning Mamba SSMs for in-context learning (ICL) on natural language tasks.
While pretrained Mamba and Mamba-2 models only achieve 38% and 82%
(respectively) of the ICL improvements of comparable Transformer-based LLMs, we
show that instruction tuning allows Mamba models to narrow this gap to 81% and
Mamba-2 models to skyrocket over this gap to 132%. |
---|---|
DOI: | 10.48550/arxiv.2406.00209 |