Apparatus and method for adaptable and efficient lane-wise tensor processing

An apparatus and method for performing efficient, adaptable tensor operations. For example, one embodiment of a processor comprises: front end circuitry to schedule a plurality of matrix operations responsive to a tensor matrix multiplication instruction; a plurality of lanes to perform parallel exe...

Full description

Saved in:
Bibliographic Details
Main Authors Burns, Steven, Cook, Jeffrey, Marr, Deborah, Srinivasan, Srikanth, Davare, Abhijit, Nurvitadhi, Eriko, Ayupov, Andrey, Pearce, Jonathan, Sheffield, David, Mishra, Asit, Kirkpatrick, Desmond, Sorokin, Anton Alexandrovich
Format Patent
LanguageEnglish
Published 15.09.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:An apparatus and method for performing efficient, adaptable tensor operations. For example, one embodiment of a processor comprises: front end circuitry to schedule a plurality of matrix operations responsive to a tensor matrix multiplication instruction; a plurality of lanes to perform parallel execution of the matrix operations, each lane comprising: first, second, and third tile registers to store blocks of a first matrix (A), second matrix (B), and third matrix (C), respectively; at least one tensor arithmetic logic unit (TALU) to multiply a block of elements of the first matrix with a block of elements of the second matrix to generate a product and to accumulate the product with a block of elements of the third matrix, wherein each lane is to multiply one or more different blocks of the first and second matrix and to accumulate the resulting one or more products with one or more different blocks of the third matrix; and broadcast circuitry to broadcast one or more invariant matrix blocks to different tile registers within a lane and/or different tile registers across different lanes.
Bibliography:Application Number: US201816147696