Understanding, Uncovering, and Mitigating the Causes of Inference Slowdown for Language Models
Dynamic neural networks (DyNNs) have shown promise for alleviating the high computational costs of pre-trained language models (PLMs), such as BERT and GPT. Emerging slowdown attacks have shown to inhibit the ability of DyNNs to omit computation, e.g., by skipping layers that are deemed unnecessary....
Saved in:
Published in | 2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) pp. 723 - 740 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
09.04.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Dynamic neural networks (DyNNs) have shown promise for alleviating the high computational costs of pre-trained language models (PLMs), such as BERT and GPT. Emerging slowdown attacks have shown to inhibit the ability of DyNNs to omit computation, e.g., by skipping layers that are deemed unnecessary. As a result, these attacks can cause significant delays in inference speed for DyNNs and may erase their cost savings altogether. Most research in slowdown attacks has been in the image domain, despite the ever-growing computational costs-and relevance of DyNNs-in the language domain. Unfortunately, it is still not understood what language artifacts trigger extra processing in a PLM or what causes this behavior. We aim to fill this gap through an empirical exploration of the slowdown effect on language models. Specifically, we uncover a crucial difference between the slowdown effect in the image and language domains, illuminate the efficacy of pre-existing and novel techniques for causing slowdown, and report circumstances where slowdown does not occur. Building on these observations, we propose the first approach for mitigating the slowdown effect. Our results suggest that slowdown attacks can provide new insights that can inform the development of more efficient PLMs. |
---|---|
DOI: | 10.1109/SaTML59370.2024.00042 |