A certain examination on heterogeneous systolic array (HSA) design for deep learning accelerations with low power computations

Acceleration techniques play a crucial role in enhancing the performance of modern high-speed computations, especially in Deep Learning (DL) applications where the speed is of utmost importance. One essential component in this context is the Systolic Array (SA), which effectively handles computation...

Full description

Saved in:
Bibliographic Details
Published inSustainable computing informatics and systems Vol. 44; p. 101042
Main Authors Jayaraman Rajanediran, Dinesh Kumar, Ganesh Babu, C., Priyadharsini, K.
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.12.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Acceleration techniques play a crucial role in enhancing the performance of modern high-speed computations, especially in Deep Learning (DL) applications where the speed is of utmost importance. One essential component in this context is the Systolic Array (SA), which effectively handles computational tasks and data processing in a rhythmic manner. Google's Tensor Processing Unit (TPU) leverages the power of SA for neural networks. The core SA's functionality and performance lies in the Computation Element (CE), which facilitates parallel data flow. In our article, we introduce a novel approach called Proposed Systolic Array (PSA), which is implemented on the CE and further enhanced with a modified Hybrid Kogge Stone adder (MHA). This design incorporates principles to expedite computations by rounding and extracting data model in SA as PSA-MHA. The PSA, utilizing a data flow model with MHA, significantly accelerates data shifts and control passes in execution cycles. We validated our approach through simulations on the Cadence Virtuoso platform using 65 nm process technology, comparing it to the General Matrix Multiplication (GMMN) benchmark. The results showed remarkable improvements in the CE, with a 30.29 % reduction in delay, a 23.07 % reduction in area, and an 11.87 % reduction in power consumption. The PSA outperformed these improvements, achieving a 46.38 % reduction in delay, a 7.58 % reduction in area, and an impressive 48.23 % decrease in Area Delay Product (ADP). To further substantiate our findings, we applied the PSA-based approach to pre-trained hybrid Convolutional and Recurrent (CNN-RNN) neural models. The PSA-based hybrid model incorporates 189 million Multiply-Accumulate (MAC) units, resulting in a weighted mean architecture value of 784.80 for the RNN component. We also explored variations in bit width, which led to delay reductions ranging from 20.17 % to 30.29 %, area variations between 13.08 % and 32.16 %, and power consumption changes spanning from 11.88 % to 20.42 %. •The proposed model brings the unique feature on architecture of data flow. This data flow-based architecture design is a new paradigm and the models are computed with different algorithms.•The CE and PSA are based on the hybrid adder structures.•The estimated hardware blocks showed that CE achieved notable reduction on the primary parameters such as delay, area, and power by 30.29 %, 23.07 %, and 11.87 %, respectively.•Also, the FPGA-based PSA with the modified architecture exhibited considerable variations in delay, area and ADP by 46.38 %, 7.58 % 48.23 % respectively.•These models have been tested and simulated on the CNN and RNN models for performance evaluation and validation. The results showed better perfroamnce of PSA than the exisiting state of art models and insisted the significant role of CE to accelarate the DL models in parallel computations.
ISSN:2210-5379
DOI:10.1016/j.suscom.2024.101042