Scalable embedded computing through reconfigurable hardware: Comparing DF-Threads, cilk, openmpi and jump
Data-Flow Threads (DF-Threads) is a new execution model that permits to seamlessly distribute the workload across several cores (in a multi-core) and several nodes (in a multi-node/multi-board configuration). In this paper, the advance in deploying this execution model is shown while developing it b...
Saved in:
Published in | Microprocessors and microsystems Vol. 63; pp. 66 - 74 |
---|---|
Main Author | |
Format | Journal Article |
Language | English |
Published |
Kidlington
Elsevier B.V
01.11.2018
Elsevier BV |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Data-Flow Threads (DF-Threads) is a new execution model that permits to seamlessly distribute the workload across several cores (in a multi-core) and several nodes (in a multi-node/multi-board configuration).
In this paper, the advance in deploying this execution model is shown while developing it by using a combination of a simulator model (i.e., the COTSon framework) and a reconfigurable hardware platform (i.e., the AXIOM-board). The AXIOM platform consists of a custom board based on the Xilinx Zynq Ultrascale+ ZU9EG, which incorporates the largest FPGA available on that System-on-Chip at the moment, four 64-bit ARM cores and two 32-bit ARM cores, up to 32GiB of main memory and several 16Gbit/s transceivers.
While a complete DF-Threads system is still under development, but is already capable of running a full Linux OS and simple applications, so some initial results are presented here. In particular, well-known programming models that are used to exploit the Thread-Level Parallelism such as Cilk, OpenMPI and Jump are compared with DF-thread execution. Cilk is good for multi-cores, but it is not suitable for multi-nodes systems. In the latter cases, the distribution of the workload could be managed partly by the programmer when using programming models such as message-passing (OpenMPI has been chosen for reference) or distributed shared-memory (Jump in our case).
The obtained results show that a DF-Thread execution on a cluster of eight 4-core boards can provide a speed-up of more than 14x compared to the same configuration when using OpenMPI and more than 80x when compared with a OpenMPI single core, single node execution. |
---|---|
ISSN: | 0141-9331 1872-9436 |
DOI: | 10.1016/j.micpro.2018.08.005 |