Comparing application performance on HPC-based Hadoop platforms with local storage and dedicated storage

Many high-performance computing (HPC) sites extend their clusters to support Hadoop MapReduce for a variety of applications. However, HPC cluster differs from Hadoop cluster on the configurations of storage resources. In the Hadoop Distributed File System (HDFS), data resides on the compute nodes, w...

Full description

Saved in:

Bibliographic Details
Published in	2016 IEEE International Conference on Big Data (Big Data) pp. 233 - 242
Main Authors	Zhuozhao Li, Haiying Shen, Denton, Jeffrey, Ligon, Walter
Format	Conference Proceeding
Language	English
Published	IEEE 01.12.2016
Subjects	Big data Distributed databases Measurement Metadata Resource management Servers Throughput
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Many high-performance computing (HPC) sites extend their clusters to support Hadoop MapReduce for a variety of applications. However, HPC cluster differs from Hadoop cluster on the configurations of storage resources. In the Hadoop Distributed File System (HDFS), data resides on the compute nodes, while in the HPC cluster, data is stored on separate nodes dedicated to storage. Dedicated storage offloads I/O load from the compute nodes and provides more powerful storage. Local storage provides better locality and avoids contention for shared storage resources. To gain an insight of the two platforms, in this paper, we investigate the performance and resource utilization of different types (i.e., I/O-intensive, data-intensive and CPU-intensive) of applications on the HPC-based Hadoop platforms with local storage and dedicated storage. We find that the I/O-intensive and data-intensive applications with large input file size can benefit more from the dedicated storage, while these applications with small input file size can benefit more from the local storage. CPU-intensive applications with a large number of small-size input files benefit more from the local storage, while these applications with large-size input files benefit approximately equally from the two platforms. We verify our findings by trace-driven experiments on different types of jobs from the Facebook synthesized trace. This work provides guidance on choosing the best platform to optimize the performance of different types of applications and reduce system overhead.
DOI:	10.1109/BigData.2016.7840609