ML-NA: A Machine Learning Based Node Performance Analyzer Utilizing Straggler Statistics

Current Cloud clusters often consist of heterogeneous machine nodes, which can trigger performance challenges such as the task straggler problem, whereby a small subset of parallel tasks running abnormally slower than the other sibling ones. The straggler problem leads to extended job response and d...

Full description

Saved in:
Bibliographic Details
Published in2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS) pp. 73 - 80
Main Authors Ouyang, Xue, Wang, Changjian, Yang, Renyu, Yang, Guogui, Townend, Paul, Xu, Jie
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Current Cloud clusters often consist of heterogeneous machine nodes, which can trigger performance challenges such as the task straggler problem, whereby a small subset of parallel tasks running abnormally slower than the other sibling ones. The straggler problem leads to extended job response and deteriorates system throughput. Poor performance nodes are more likely to engender stragglers, and can undermine straggler mitigation effectiveness. For example, as the dominant mechanism for straggler alleviation, speculative execution functions by creating redundant task replicas on other machine nodes as soon as a straggler is detected. When speculative copies are assigned onto the poor performance nodes, it is hard for them to catch up with the stragglers compared to replicas run on fast nodes. And due to the fact that the performance heterogeneity is caused not only by static attribute variations such as physical capacity, but also dynamic characteristic uctuations such as contention level, analyzing node performance is important yet challenging. In this paper we develop ML-NA, a Machine Learning based Node performance Analyzer. By leveraging historical parallel tasks execution log data, ML-NA classies cluster nodes into different categories and predicts their performance in the near future as a scheduling guide to improve speculation effectiveness and minimize task straggler generation. We consider MapReduce as a representative framework to perform our analysis, and use the published OpenCloud trace as a case study to train and to evaluate our model. Results show that ML-NA can predict node performance categories with an average accuracy up to 92.86%.
ISSN:1521-9097
2690-5965
DOI:10.1109/ICPADS.2017.00021