Comparing application performance on HPC-based Hadoop platforms with local storage and dedicated storage
Many high-performance computing (HPC) sites extend their clusters to support Hadoop MapReduce for a variety of applications. However, HPC cluster differs from Hadoop cluster on the configurations of storage resources. In the Hadoop Distributed File System (HDFS), data resides on the compute nodes, w...
Saved in:
Published in | 2016 IEEE International Conference on Big Data (Big Data) pp. 233 - 242 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.12.2016
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Many high-performance computing (HPC) sites extend their clusters to support Hadoop MapReduce for a variety of applications. However, HPC cluster differs from Hadoop cluster on the configurations of storage resources. In the Hadoop Distributed File System (HDFS), data resides on the compute nodes, while in the HPC cluster, data is stored on separate nodes dedicated to storage. Dedicated storage offloads I/O load from the compute nodes and provides more powerful storage. Local storage provides better locality and avoids contention for shared storage resources. To gain an insight of the two platforms, in this paper, we investigate the performance and resource utilization of different types (i.e., I/O-intensive, data-intensive and CPU-intensive) of applications on the HPC-based Hadoop platforms with local storage and dedicated storage. We find that the I/O-intensive and data-intensive applications with large input file size can benefit more from the dedicated storage, while these applications with small input file size can benefit more from the local storage. CPU-intensive applications with a large number of small-size input files benefit more from the local storage, while these applications with large-size input files benefit approximately equally from the two platforms. We verify our findings by trace-driven experiments on different types of jobs from the Facebook synthesized trace. This work provides guidance on choosing the best platform to optimize the performance of different types of applications and reduce system overhead. |
---|---|
DOI: | 10.1109/BigData.2016.7840609 |