Understanding, Uncovering, and Mitigating the Causes of Inference Slowdown for Language Models

Dynamic neural networks (DyNNs) have shown promise for alleviating the high computational costs of pre-trained language models (PLMs), such as BERT and GPT. Emerging slowdown attacks have shown to inhibit the ability of DyNNs to omit computation, e.g., by skipping layers that are deemed unnecessary....

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) pp. 723 - 740
Main Authors	Varma, Kamala, Numanoglu, Arda, Kaya, Yigitcan, Dumitras, Tudor
Format	Conference Proceeding
Language	English
Published	IEEE 09.04.2024
Subjects	adversarial machine learning Buildings Computational efficiency Computational modeling Costs Delays efficient machine learning Machine learning Neural networks pre-trained language models
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Dynamic neural networks (DyNNs) have shown promise for alleviating the high computational costs of pre-trained language models (PLMs), such as BERT and GPT. Emerging slowdown attacks have shown to inhibit the ability of DyNNs to omit computation, e.g., by skipping layers that are deemed unnecessary. As a result, these attacks can cause significant delays in inference speed for DyNNs and may erase their cost savings altogether. Most research in slowdown attacks has been in the image domain, despite the ever-growing computational costs-and relevance of DyNNs-in the language domain. Unfortunately, it is still not understood what language artifacts trigger extra processing in a PLM or what causes this behavior. We aim to fill this gap through an empirical exploration of the slowdown effect on language models. Specifically, we uncover a crucial difference between the slowdown effect in the image and language domains, illuminate the efficacy of pre-existing and novel techniques for causing slowdown, and report circumstances where slowdown does not occur. Building on these observations, we propose the first approach for mitigating the slowdown effect. Our results suggest that slowdown attacks can provide new insights that can inform the development of more efficient PLMs.
DOI:	10.1109/SaTML59370.2024.00042