Improved Linear Convergence of Training CNNs With Generalizability Guarantees: A One-Hidden-Layer Case

We analyze the learning problem of one-hidden-layer nonoverlapping convolutional neural networks with the rectified linear unit (ReLU) activation function from the perspective of model estimation. The training outputs are assumed to be generated by the neural network with the unknown ground-truth pa...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transaction on neural networks and learning systems Vol. 32; no. 6; pp. 2622 - 2635
Main Authors	Zhang, Shuai, Wang, Meng, Xiong, Jinjun, Liu, Sijia, Chen, Pin-Yu
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.06.2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Accelerated gradient descent (GD) Algorithms Artificial neural networks Complexity Complexity theory Convergence convolutional neural networks generalizability global optimality linear convergence Machine learning Mathematical models Neural networks Noise Noise levels Normal distribution Parameter estimation Sociology Tensors Testing Training Training data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We analyze the learning problem of one-hidden-layer nonoverlapping convolutional neural networks with the rectified linear unit (ReLU) activation function from the perspective of model estimation. The training outputs are assumed to be generated by the neural network with the unknown ground-truth parameters plus some additive noise, and the objective is to estimate the model parameters by minimizing a nonconvex squared loss function of the training data. Assuming that the training set contains a finite number of samples generated from the Gaussian distribution, we prove that the accelerated gradient descent (GD) algorithm with a proper initialization converges to the ground-truth parameters (up to the noise level) with a linear rate even though the learning problem is nonconvex. Moreover, the convergence rate is proved to be faster than the vanilla GD. The initialization can be achieved by the existing tensor initialization method. In contrast to the existing works that assume an infinite number of samples, we theoretically establish the sample complexity of the required number of training samples. Although the neural network considered here is not deep, this is the first work to show that accelerated GD algorithms can find the global optimizer of the nonconvex learning problem of neural networks. This is also the first work that characterizes the sample complexity of gradient-based methods in learning convolutional neural networks with the nonsmooth ReLU activation function. This work also provides the tightest bound so far of the estimation error with respect to the output noise.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2162-237X 2162-2388 2162-2388
DOI:	10.1109/TNNLS.2020.3007399