Enhancing Load Balancing With In-Network Recirculation to Prevent Packet Reordering in Lossless Data Centers

Many existing load balancing mechanisms work effectively in lossy datacenter networks (DCNs), but they suffer from serious packet reordering in lossless Ethernet DCNs deployed with the hop-by-hop Priority-based Flow Control (PFC). The key reason is that the prior solutions are not able to perceive P...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM transactions on networking Vol. 32; no. 5; pp. 4114 - 4127
Main Authors Hu, Jinbin, He, Yi, Luo, Wangqing, Huang, Jiawei, Wang, Jin
Format Journal Article
LanguageEnglish
Published IEEE 01.10.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Many existing load balancing mechanisms work effectively in lossy datacenter networks (DCNs), but they suffer from serious packet reordering in lossless Ethernet DCNs deployed with the hop-by-hop Priority-based Flow Control (PFC). The key reason is that the prior solutions are not able to perceive PFC triggering correctly and in a timely manner when making load balancing decisions. Once the forwarding path pauses transmission due to PFC triggering, the packets allocated on it are blocked, inevitably leading to out-of-order packets and retransmission. In this paper, we present an Reordering-robust Load Balancing (RLB) scheme with PFC prediction in lossless DCNs. At its heart, RLB leverages the derivative of ingress queue length to predict PFC triggering and proactively notifies the upstream switches to choose an appropriate rerouting path or perform packet recirculation to avoid reordering. Furthermore, under switch failure scenarios, RLB adjusts the recirculation threshold adaptively to mitigate the risk of packets over-recirculation. We have implemented RLB in the hardware programmable switch. As a building block for existing load balancing mechanisms, we have integrated RLB into Presto, LetFlow, Hermes and DRILL. The evaluation results show that the RLB-enhanced solutions deliver significant performance by avoiding packet reordering. For example, it reduces the <inline-formula> <tex-math notation="LaTeX">99^{th} </tex-math></inline-formula> percentile flow completion time (FCT) by up to 72%, 67%, 58% and 54% over DRILL, Presto, LetFlow and Hermes, respectively.
ISSN:1063-6692
1558-2566
DOI:10.1109/TNET.2024.3403671