A Reschedulable Dataflow-SIMD Execution for Increased Utilization in CGRA Cross-Domain Acceleration

When a coarse-grained reconfigurable array (CGRA) architecture shifts toward cross-domain acceleration, control flow and memory accesses often degrade the processing elements (PEs) utilization and array efficiency by breaking the intact dataflow graph (DFG) into regions with mismatched pipelining ra...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on computer-aided design of integrated circuits and systems Vol. 42; no. 3; pp. 874 - 886
Main Authors	Yin, Chen, Jing, Naifeng, Jiang, Jianfei, Wang, Qin, Mao, Zhigang
Format	Journal Article
Language	English
Published	New York IEEE 01.03.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Access execute decoupling Arrays Bandwidth coarse-grained reconfigurable array (CGRA) Codes Computer architecture dataflow decoupling Domains Graph theory Hardware Kernel Pipeline processing Plasticine Run time (computers) Scheduling subgraph scheduling System-on-chip Utilization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	When a coarse-grained reconfigurable array (CGRA) architecture shifts toward cross-domain acceleration, control flow and memory accesses often degrade the processing elements (PEs) utilization and array efficiency by breaking the intact dataflow graph (DFG) into regions with mismatched pipelining rate and access-execution stages. In this article, we propose a reschedulable dataflow and SIMD execution, which decouples the DFG with mismatched dataflow into multiple independent subgraphs. We map only one subgraph at a time but with fully unrolling, and reschedule different subgraphs serially in the runtime. Therefore, each subgraph works in its own way without interfering with others. At the same time, an individual subgraph can execute its dataflow in stream for utilization improvement, while unrolled instances composing as SIMD facilitate request coalescing for efficient memory access. With lightweight hardware modification, our design can be integrated in a general CGRA architecture. The experimental results show that our proposal improves the performance and energy efficiency over stream-dataflow CGRA in static-scheduling (Plasticine) by <inline-formula> <tex-math notation="LaTeX">1.6\times </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">1.8\times </tex-math></inline-formula>, over which in dynamic scheduling (TIA) by <inline-formula> <tex-math notation="LaTeX">1.5\times </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">2.7\times </tex-math></inline-formula>, and outperforms Plasticine organized in vector-SIMD by <inline-formula> <tex-math notation="LaTeX">1.2\times </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">1.4\times </tex-math></inline-formula>.
ISSN:	0278-0070 1937-4151
DOI:	10.1109/TCAD.2022.3185544