A hardware-based multi-objective thread mapper for tiled manycore architectures

Thread mapping is typically performed as an integral part of cooperative or pre-emptive operating system (OS) scheduling in order to share the processor core(s) among competing applications. Schedulers usually follow a single-objective performance optimization, such as maximizing core utilization or...

Full description

Saved in:
Bibliographic Details
Published in2015 33rd IEEE International Conference on Computer Design (ICCD) pp. 459 - 462
Main Authors Pujari, Ravi Kumar, Wild, Thomas, Herkersdorf, Andreas
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.10.2015
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Thread mapping is typically performed as an integral part of cooperative or pre-emptive operating system (OS) scheduling in order to share the processor core(s) among competing applications. Schedulers usually follow a single-objective performance optimization, such as maximizing core utilization or satisfying deadlines by the prioritization of threads. Meeting multiple orthogonal objectives, like performance vs. power or thermal resilience, in the era of manycore processors is a challenge because of the associated scalability and thread management overhead. We tackle these challenges by employing a two stage thread management strategy. In the first stage (not covered in this short paper), threads are assigned to regions or compute tiles. For the second stage we introduce in this paper the TCU (Thread Control Unit), a configurable, low latency, low overhead hardware thread mapper that takes various runtime sensor parameters into account. It can map threads within a small and bounded number of clock cycles in round robin, single or multi-objective manner. TCU is designed to consider not just load balancing or performance criteria but also physical constraints like power budgets, temperature limits and reliability aspects. TCU macro achieves 150K thread mappings per second on a tiled MPSoC FPGA prototype while operating at moderate 50 Mz. Evaluations of different mapping policies show that multi-objective thread mapping provides about 10 to 40% less mapping latency for periodic and bursty traffic compared to single-objective or round robin schemes. FPGA and ASIC syntheses reveal a 9% hardware overhead for the TCU on a four core compute tile.
DOI:10.1109/ICCD.2015.7357148