Convergence of Two-Layer Regression with Nonlinear Units

Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax Re...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Deng, Yichuan, Zhao, Song, Xie, Shenghao
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 16.08.2023
Subjects	Computation Convergence Greedy algorithms Human performance Large language models Lipschitz condition Newton methods Regression
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax ReLU regression problem. Generally speaking, our goal is to find an optimal solution to the regression problem involving the ReLU unit. In this work, we calculate a close form representation for the Hessian of the loss function. Under certain assumptions, we prove the Lipschitz continuous and the PSDness of the Hessian. Then, we introduce an greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution. Last, We relax the Lipschitz condition and prove the convergence in the sense of loss value.
ISSN:	2331-8422