MQL: ML-Assisted Queuing Latency Analysis for Data Center Networks

Data center network (DCN) performance analysis is becoming increasingly critical due to the growing data center scale and proliferation of latency-critical applications. Packetlevel simulators, the de-facto performance evaluation tools, allow accurate modeling of the network and protocols, but they...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) pp. 50 - 60
Main Authors Narayana, Shruti Yadav, Tong, Jie, Krishnakumar, Anish, Yildirim, Nuriye, Shriver, Emily, Ketkar, Mahesh, Ogras, Umit Y.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.04.2023
Subjects
Online AccessGet full text
DOI10.1109/ISPASS57527.2023.00014

Cover

More Information
Summary:Data center network (DCN) performance analysis is becoming increasingly critical due to the growing data center scale and proliferation of latency-critical applications. Packetlevel simulators, the de-facto performance evaluation tools, allow accurate modeling of the network and protocols, but they are extremely slow. Simulation of large-scale DCNs with thousands of nodes can take days, making meaningful design space exploration impractical. Analytical techniques, such as queuing theory, can mitigate the scalability problem and offer high accuracy when specific workload assumptions are satisfied. However, their accuracy may decline as these assumptions break, and execution times explode unless designed carefully. To address these challenges, we propose a novel and scalable performance analysis methodology that combines two powerful techniques. First, it uses queuing theory and the maximum entropy (ME) principle to approximate the waiting time in each queue in a DCN. It then finds the end-to-end latency of each flow using traffic input, routing algorithm, and network parameters. This ME-based queuing model can approximate the latency under generalized exponential input traffic and general service distributions. Since its accuracy can degrade as traffic diverges from input and service time assumptions, the second step of the proposed methodology learns and corrects the systematic errors using a regression tree. The resulting ML-assisted technique achieves less than 3% modeling error on average compared to ns-3 simulations. Moreover, the speedup over ns-3 ranges from 100× to 9000× on DCNs with 128 to 1024 nodes.
DOI:10.1109/ISPASS57527.2023.00014