TerrierTail: Mitigating Tail Latency of Cloud Virtual Machines

Large-scale online services parallelize sub-operations of a user's request across a large number of physical machines (service components) so as to enhance the responsiveness. Even a temporary spike in latency of any service component can notably inflate the end-to-end delay; therefore, the tai...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on parallel and distributed systems Vol. 29; no. 10; pp. 2346 - 2359
Main Authors Asyabi, Esmail, SanaeeKohroudi, SeyedAlireza, Sharifi, Mohsen, Bestavros, Azer
Format Journal Article
LanguageEnglish
Published New York IEEE 01.10.2018
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Large-scale online services parallelize sub-operations of a user's request across a large number of physical machines (service components) so as to enhance the responsiveness. Even a temporary spike in latency of any service component can notably inflate the end-to-end delay; therefore, the tail of the latency distribution of service components has become a subject of intensive research. The key characteristics of clouds such as elasticity and on-demand resource provisioning have made clouds attractive for hosting large-scale online services wherein VMs are the building blocks of services. However, adherence to traditional hypervisor scheduling policies has led to unpredictable CPU access latencies for virtual CPUs (vCPUs) that are responsible for performing network IO processes. This has resulted in poor and unpredictable performance for network IO, exacerbating VMs' long tail latencies and discouraging the hosting of large-scale parallel web services on virtualized clouds. This paper presents TerrierTail, a hypervisor CPU scheduler whose primary goal is to trim the tail of the latency distribution of individual VMs in virtualized clouds. In TerrierTail, we have modified the network driver to identify vCPUs that are responsible for performing network IO processes. Leveraging this information, the TerrierTail scheduler mitigates the CPU access latencies of such vCPUs using novel scheduling policies, resulting in a higher and more predictable network IO performance and therefore lower tail latency. TerrierTail's gains come at no measurable negative impacts on other performance attributes (e.g., fairness) or on the performance of VMs running other types of workloads (e.g., CPU-intensive VMs). A prototype implementation of TerrierTail in the Xen hypervisor substantially outperforms the default Credit scheduler of Xen. For example, TerrierTail mitigates the tail latency of a Memcached server by up to 53 percent and an RPC server by up to 50 percent at 99.9th percentile.
ISSN:1045-9219
1558-2183
DOI:10.1109/TPDS.2018.2827075