Query-efficient model extraction for text classification model in a hard label setting

Designing a query-efficient model extraction strategy to steal models from cloud-based platforms with black-box constraints remains a challenge, especially for language models. In a more realistic setting, a lack of information about the target model’s internal parameters, gradients, training data,...

Full description

Saved in:

Bibliographic Details
Published in	Journal of King Saud University. Computer and information sciences Vol. 35; no. 4; pp. 10 - 20
Main Authors	Peng, Hao, Guo, Shixin, Zhao, Dandan, Wu, Yiming, Han, Jianming, Wang, Zhe, Ji, Shouling, Zhong, Ming
Format	Journal Article
Language	English
Published	Elsevier B.V 01.04.2023 Springer
Subjects	Adversarial attack Language model stealing Model extraction Model privacy Natural language processing Performance Evaluation Performance Evaluation Natural language processing Adversarial attack Model extraction Language model stealing Model privacy
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Designing a query-efficient model extraction strategy to steal models from cloud-based platforms with black-box constraints remains a challenge, especially for language models. In a more realistic setting, a lack of information about the target model’s internal parameters, gradients, training data, or even confidence scores prevents attackers from easily copying the target model. Selecting informative and useful examples to train a substitute model is critical to query-efficient model stealing. We propose a novel model extraction framework that fine-tunes a pretrained model based on bidirectional encoder representations from transformers (BERT) while improving query efficiency by utilizing an active learning selection strategy. The active learning strategy, incorporating semantic-based diversity sampling and class-balanced uncertainty sampling, builds an informative subset from the public unannotated dataset as the input for fine-tuning. We apply our method to extract deep classifiers with identical and mismatched architectures as the substitute model under tight and moderate query budgets. Furthermore, we evaluate the transferability of adversarial examples constructed with the help of the models extracted by our method. The results show that our method achieves higher accuracy with fewer queries than existing baselines and the resulting models exhibit a high transferability success rate of adversarial examples.
ISSN:	1319-1578 2213-1248
DOI:	10.1016/j.jksuci.2023.02.019