HetSpark: A Framework that Provides Heterogeneous Executors to Apache Spark

The increasing computational complexity of Big Data software requires the scale up of the nodes of clusters of commodity hardware that have been used widely for Big Data workloads. Thus, FPGA-based accelerators and GPU devices have recently become a first class citizen in data centers. Utilizing the...

Full description

Saved in:
Bibliographic Details
Published inProcedia computer science Vol. 136; pp. 118 - 127
Main Authors Hidri, Klodjan Klodi, Bilas, Angelos, Kozanitis, Christos
Format Journal Article
LanguageEnglish
Published Elsevier B.V 2018
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The increasing computational complexity of Big Data software requires the scale up of the nodes of clusters of commodity hardware that have been used widely for Big Data workloads. Thus, FPGA-based accelerators and GPU devices have recently become a first class citizen in data centers. Utilizing these devices is not trivial task however from an engineering effort perspective since developers versed in distributed computing frameworks, such as Apache Spark are used to developing in higher level languages and APIs, like Python and Scala, while accelerators require the use of low-level APIs like Cuda and OpenCl. Through recent developments in accelerator virtualization like VineTalk [6] a software layer that handles the complex communication between applications and FPGAs or GPU devices, software development using accelerators has been simplified. This paper presents HetSpark, a heterogeneous modification of Apache Spark. HetSpark enables Apache Spark to operate with two classes of executors: an accelerated class, and a commodity class. HetSpark applications are expected to use VineTalk for their entire interaction with accelerators. The schedulers of HetSpark are sophisticated enough to detect the existence of VineTalk routines in the java binary code. Thus, they take decisions as to which tasks require the use of accelerators, and they send them only to executors of the former class. Finally, we evaluated thoroughly the performance of HetSpark with different mixes of executors of the two different classes. When applications run linear tasks, we observed that the use of CPU-only accelerators is preferable to GPU enhanced accelerators, while for applications with computationally challenging tasks, the time savings from the use of GPUs compensate for data transfers between commodity and accelerated executors.
ISSN:1877-0509
1877-0509
DOI:10.1016/j.procs.2018.08.244