FloatSD: A New Weight Representation and Associated Update Method for Efficient Convolutional Neural Network Training

In this paper, we propose a floating-point signed digit (FloatSD) format for convolutional neural network (CNN) weight representation and its update method during CNN training. The number of non-zero digits in a weight can be as few as two during the forward and backward passes of the CNN training,...

Full description

Saved in:

Bibliographic Details
Published in	IEEE journal on emerging and selected topics in circuits and systems Vol. 9; no. 2; pp. 267 - 279
Main Authors	Lin, Po-Chen, Sun, Mu-Kai, Kung, Chuking, Chiueh, Tzi-Dar
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.06.2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Artificial neural networks CIFAR-10 Complexity Computational modeling Computer memory Convolution Convolutional neural network (CNN) Convolutional neural networks Digits Floating point arithmetic ImageNet low-complexity training MNIST Multiplication Neural networks Neurons Quantization (signal) Representations Training Weight weight quantization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this paper, we propose a floating-point signed digit (FloatSD) format for convolutional neural network (CNN) weight representation and its update method during CNN training. The number of non-zero digits in a weight can be as few as two during the forward and backward passes of the CNN training, reducing the convolution multiplication to addition of two shifted multiplicands (partial products). Furthermore, the mantissa field and the exponent field of neuron activations and gradients during training are also quantized, leading to floating-point numbers represented by eight bits. We tested the FloatSD method using three popular CNN applications, namely, MNIST, CIFAR-10, and ImageNet. These three CNNs were trained from scratch using the conventional 32-bit floating-point arithmetic and the FloatSD weight representation, 8-bit floating-point numbers for activations and gradients, and half-precision 16-bit floating-point accumulation. We obtained FloatSD accuracy results very close to or even better than those trained in 32-bit floating-point arithmetic. Finally, the proposed method not only significantly reduces the computational complexity for CNN training but also achieves memory capacity and bandwidth saving of about three quarters, demonstrating its effectiveness in the low-complexity implementation of CNN training.
ISSN:	2156-3357 2156-3365
DOI:	10.1109/JETCAS.2019.2911999