Hopper: Predictive Load Balancing for RDMA Traffic
Erfan Nosrati & Majid Ghaderi, University of Calgary | arXiv:2506.08132 | June 2025
Problem
ML training workloads (LLMs, DLRMs) on GPU clusters use RoCEv2 (RDMA over Ethernet) in leaf-spine topologies. Existing load balancing approaches all fail for RDMA:
- ECMP — static hash routing, no adaptation to skewed flow-size distributions
- Random Packet Spraying (RPS) — floods RNIC on-chip SRAM with out-of-order (OOO) packets; breaks under topology asymmetry
- Flowlet switching (CONGA, SilkRoad) — relies on inter-packet gaps; RDMA sends packets back-to-back with no gaps
- FlowBender — reacts to ECN but picks new paths randomly, often landing on another congested path
No existing approach does informed, host-only path selection for RDMA while controlling OOO delivery.
Core Insight
Modern RNICs provide per-packet RTT measurement (hardware timestamps) and limited OOO buffering (IRN). Hopper exploits both jointly — operating at RTT granularity and always probing before switching, never blindly.
System Design
Three modules running once per RTT epoch:
| Module | What it does |
|---|---|
| Congestion Detection | Tracks RTT moving average; triggers probing at
th_probe = 1.5x base RTT |
| Path Probing | Sends 2 random-probe QPs (different UDP src port → different ECMP hash); power-of-two-choices selection |
| Path Switching | Switches only if best probe RTT < 80% of current; delays switch by estimated RTT delta to minimize OOO bursts |
Path selection is entirely host-side — only standard ECMP needed on switches, no P4 or custom firmware.
Evaluation
Simulation (ns-3): 128-server leaf-spine, DCQCN transport, Meta AI ML training traces + AliCloud/Hadoop workloads.
Testbed: NVIDIA CX-5 RNICs, Dell SONiC switches, deliberate link asymmetry (10 Gbps vs 1 Gbps paths), GPT-3 AllReduce workload.
Key results vs FlowBender:
- Up to 20% avg FCT / 14% p99 FCT improvement on ML training (simulation)
- Up to 45% avg / 77% p99 FCT improvement on testbed
- 51% reduction in total ML training time (1 MB chunks)
- Hopper even beats the switch-based CONGA at moderate load (50%)
Limitation: At 80% load, switch-based ConWeave (with global visibility) still wins.
Key Takeaways
- Random path selection is the root problem — FlowBender/RPS/NCCL's internal spraying all suffer under topology asymmetry
- Power-of-two-choices probing — probe 2 candidates, pick the better one; O(1) overhead, dramatically better than random
- Switch-timing control — delaying path switch by the RTT-delta estimate substantially reduces OOO packets
- Host-only has a ceiling — under saturation (80% load), globally-aware switch-based techniques win; Hopper is the best practical option when switch reprogramming isn't available
Future work explicitly called out: Embed Hopper logic directly into NCCL or other collective communication libraries.
Connection to NCCL / DynamICCL
- NCCL's internal QP-per-channel random spraying has the same vulnerability as RPS; the paper calls this out explicitly
- Hopper's chunk-size knob maps directly to NCCL's
chunkSize/numPipeOps - Per-QP RTT is a low-overhead, switch-independent congestion signal directly usable as RL state in a DynamICCL agent
- The plugin architecture of DynamICCL is the natural integration point the authors suggest for embedding Hopper-style logic