Predicting the performance of big data applications on the cloud

Data science applications have become widespread as a means to extract knowledge from large datasets. Such applications are often characterized by highly heterogeneous and irregular data access patterns, thus often being referred to as big data applications. Such characteristics make the application...

Full description

Saved in:

Bibliographic Details
Published in	The Journal of supercomputing Vol. 77; no. 2; pp. 1321 - 1353
Main Authors	Ardagna, D., Barbierato, E., Gianniti, E., Gribaudo, M., Pinto, T. B. M., da Silva, A. P. C., Almeida, J. M.
Format	Journal Article
Language	English
Published	New York Springer US 01.02.2021 Springer Nature B.V
Subjects	Accuracy Big Data Cloud computing Compilers Computer Science Data science Infrastructure Interpreters Mathematical analysis Mathematical models Modelling Performance prediction Processor Architectures Programming Languages Performance prediction Data science Parallel computing Big data Analytical and simulation models Apache spark
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Data science applications have become widespread as a means to extract knowledge from large datasets. Such applications are often characterized by highly heterogeneous and irregular data access patterns, thus often being referred to as big data applications. Such characteristics make the application execution quite challenging for existing software and hardware infrastructures to meet their resource demands. The cloud computing paradigm, in turn, offers a natural hosting solution to such applications since its on-demand pricing model allows allocating effectively computing resources according to application’s needs. However, these properties impose extra challenge to the accurate performance prediction of cloud-based applications, which is a key step to adequate capacity planning and managing of the hosting infrastructure. In this article, we tackle this challenge by exploring three modeling approaches for predicting the performance of big data applications running on the cloud. We evaluate two queuing-based analytical models and dagSim, a fast ad-hoc simulator, in various scenarios based on different applications and infrastructure setups. The considered approaches are compared in terms of prediction accuracy and execution time. Our results indicate that our two best approaches, one analytical model and dagSim, can predict average application execution times with only up to a 7 % relative error, on average. Moreover, a comparison with the widely used event-based simulator available with the Java Modeling Tool (JMT) suite demonstrates that both the analytical model and dagSim run very fast, requiring at least two orders of magnitude lower execution time than JMT while providing slightly better accuracy, being thus practical for online prediction.
ISSN:	0920-8542 1573-0484
DOI:	10.1007/s11227-020-03307-w