Grow and Prune Compact, Fast, and Accurate LSTMs

Long short-term memory (LSTM) has been widely used for sequential data modeling. Researchers have increased LSTM depth by stacking LSTM cells to improve performance. This incurs model redundancy, increases run-time delay, and makes the LSTMs more prone to overfitting. To address these problems, we p...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on computers Vol. 69; no. 3; pp. 441 - 452
Main Authors	Dai, Xiaoliang, Yin, Hongxu, Jha, Niraj K.
Format	Journal Article
Language	English
Published	New York IEEE 01.03.2020 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Architecture Cider Coders Computational modeling Computer architecture Datasets Deep learning Encoders-Decoders Floating point arithmetic grow-and-prune training Logic gates long short-term memory Machine translation Mathematical models neural network Neural networks Nonlinear control Object recognition Parameters Performance enhancement Redundancy Speech recognition Stacking Time lag Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Long short-term memory (LSTM) has been widely used for sequential data modeling. Researchers have increased LSTM depth by stacking LSTM cells to improve performance. This incurs model redundancy, increases run-time delay, and makes the LSTMs more prone to overfitting. To address these problems, we propose a hidden-layer LSTM (H-LSTM) that adds hidden layers to LSTM's original one-level nonlinear control gates. H-LSTM increases accuracy while employing fewer external stacked layers, thus reducing the number of parameters and run-time latency significantly. We employ grow-and-prune (GP) training to iteratively adjust the hidden layers through gradient-based growth and magnitude-based pruning of connections. This learns both the weights and the compact architecture of H-LSTM control gates. We have GP-trained H-LSTMs for image captioning, speech recognition, and neural machine translation applications. For the NeuralTalk architecture on the MSCOCO dataset, our three models reduce the number of parameters by 38.7× [floating-point operations (FLOPs) by 45.5×], run-time latency by 4.5×, and improve the CIDEr-D score by 2.8 percent, respectively. For the DeepSpeech2 architecture on the AN4 dataset, the first model we generated reduces the number of parameters by 19.4× and run-time latency by 37.4 percent. The second model reduces the word error rate (WER) from 12.9 to 8.7 percent. For the encoder-decoder sequence-to-sequence network on the IWSLT 2014 German-English dataset, the first model we generated reduces the number of parameters by 10.8× and run-time latency by 14.2 percent. The second model increases the BLEU score from 30.02 to 30.98. Thus, GP-trained H-LSTMs can be seen to be compact, fast, and accurate.
ISSN:	0018-9340 1557-9956
DOI:	10.1109/TC.2019.2954495