An Automatic and Efficient BERT Pruning for Edge AI Systems
With the yearning for deep learning democratization, there are increasing demands to implement Transformer-based natural language processing (NLP) models on resource-constrained devices for low-latency and high accuracy. Existing BERT pruning methods require domain experts to heuristically handcraft...
Saved in:
Main Authors | , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
21.06.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | With the yearning for deep learning democratization, there are increasing
demands to implement Transformer-based natural language processing (NLP) models
on resource-constrained devices for low-latency and high accuracy. Existing
BERT pruning methods require domain experts to heuristically handcraft
hyperparameters to strike a balance among model size, latency, and accuracy. In
this work, we propose AE-BERT, an automatic and efficient BERT pruning
framework with efficient evaluation to select a "good" sub-network candidate
(with high accuracy) given the overall pruning ratio constraints. Our proposed
method requires no human experts experience and achieves a better accuracy
performance on many NLP tasks. Our experimental results on General Language
Understanding Evaluation (GLUE) benchmark show that AE-BERT outperforms the
state-of-the-art (SOTA) hand-crafted pruning methods on BERT$_{\mathrm{BASE}}$.
On QNLI and RTE, we obtain 75\% and 42.8\% more overall pruning ratio while
achieving higher accuracy. On MRPC, we obtain a 4.6 higher score than the SOTA
at the same overall pruning ratio of 0.5. On STS-B, we can achieve a 40\%
higher pruning ratio with a very small loss in Spearman correlation compared to
SOTA hand-crafted pruning methods. Experimental results also show that after
model compression, the inference time of a single BERT$_{\mathrm{BASE}}$
encoder on Xilinx Alveo U200 FPGA board has a 1.83$\times$ speedup compared to
Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPU, which shows the reasonableness of
deploying the proposed method generated subnets of BERT$_{\mathrm{BASE}}$ model
on computation restricted devices. |
---|---|
DOI: | 10.48550/arxiv.2206.10461 |