Hopper: Predictive Load Balancing for RDMA Traffic

Erfan Nosrati & Majid Ghaderi, University of Calgary | arXiv:2506.08132 | June 2025


Problem

ML training workloads (LLMs, DLRMs) on GPU clusters use RoCEv2 (RDMA over Ethernet) in leaf-spine topologies. Existing load balancing approaches all fail for RDMA:

No existing approach does informed, host-only path selection for RDMA while controlling OOO delivery.


Core Insight

Modern RNICs provide per-packet RTT measurement (hardware timestamps) and limited OOO buffering (IRN). Hopper exploits both jointly — operating at RTT granularity and always probing before switching, never blindly.


System Design

Three modules running once per RTT epoch:

Module What it does
Congestion Detection Tracks RTT moving average; triggers probing at th_probe = 1.5x base RTT
Path Probing Sends 2 random-probe QPs (different UDP src port → different ECMP hash); power-of-two-choices selection
Path Switching Switches only if best probe RTT < 80% of current; delays switch by estimated RTT delta to minimize OOO bursts

Path selection is entirely host-side — only standard ECMP needed on switches, no P4 or custom firmware.


Evaluation

Simulation (ns-3): 128-server leaf-spine, DCQCN transport, Meta AI ML training traces + AliCloud/Hadoop workloads.

Testbed: NVIDIA CX-5 RNICs, Dell SONiC switches, deliberate link asymmetry (10 Gbps vs 1 Gbps paths), GPT-3 AllReduce workload.

Key results vs FlowBender:

Limitation: At 80% load, switch-based ConWeave (with global visibility) still wins.


Key Takeaways

  1. Random path selection is the root problem — FlowBender/RPS/NCCL's internal spraying all suffer under topology asymmetry
  2. Power-of-two-choices probing — probe 2 candidates, pick the better one; O(1) overhead, dramatically better than random
  3. Switch-timing control — delaying path switch by the RTT-delta estimate substantially reduces OOO packets
  4. Host-only has a ceiling — under saturation (80% load), globally-aware switch-based techniques win; Hopper is the best practical option when switch reprogramming isn't available

Future work explicitly called out: Embed Hopper logic directly into NCCL or other collective communication libraries.


Connection to NCCL / DynamICCL