Hopper: Predictive Load Balancing for RDMA Traffic

Erfan Nosrati & Majid Ghaderi, University of Calgary | arXiv:2506.08132 | June 2025

Problem

ML training workloads (LLMs, DLRMs) on GPU clusters use RoCEv2 (RDMA over Ethernet) in leaf-spine topologies. Existing load balancing approaches all fail for RDMA:

ECMP — static hash routing, no adaptation to skewed flow-size distributions
Random Packet Spraying (RPS) — floods RNIC on-chip SRAM with out-of-order (OOO) packets; breaks under topology asymmetry
Flowlet switching (CONGA, SilkRoad) — relies on inter-packet gaps; RDMA sends packets back-to-back with no gaps
FlowBender — reacts to ECN but picks new paths randomly, often landing on another congested path

No existing approach does informed, host-only path selection for RDMA while controlling OOO delivery.

Core Insight

Modern RNICs provide per-packet RTT measurement (hardware timestamps) and limited OOO buffering (IRN). Hopper exploits both jointly — operating at RTT granularity and always probing before switching, never blindly.

System Design

Three modules running once per RTT epoch:

Module	What it does
Congestion Detection	Tracks RTT moving average; triggers probing at `th_probe = 1.5x base RTT`
Path Probing	Sends 2 random-probe QPs (different UDP src port → different ECMP hash); power-of-two-choices selection
Path Switching	Switches only if best probe RTT < 80% of current; delays switch by estimated RTT delta to minimize OOO bursts

Path selection is entirely host-side — only standard ECMP needed on switches, no P4 or custom firmware.

Evaluation

Simulation (ns-3): 128-server leaf-spine, DCQCN transport, Meta AI ML training traces + AliCloud/Hadoop workloads.

Testbed: NVIDIA CX-5 RNICs, Dell SONiC switches, deliberate link asymmetry (10 Gbps vs 1 Gbps paths), GPT-3 AllReduce workload.

Key results vs FlowBender:

Up to 20% avg FCT / 14% p99 FCT improvement on ML training (simulation)
Up to 45% avg / 77% p99 FCT improvement on testbed
51% reduction in total ML training time (1 MB chunks)
Hopper even beats the switch-based CONGA at moderate load (50%)

Limitation: At 80% load, switch-based ConWeave (with global visibility) still wins.

Key Takeaways

Random path selection is the root problem — FlowBender/RPS/NCCL's internal spraying all suffer under topology asymmetry
Power-of-two-choices probing — probe 2 candidates, pick the better one; O(1) overhead, dramatically better than random
Switch-timing control — delaying path switch by the RTT-delta estimate substantially reduces OOO packets
Host-only has a ceiling — under saturation (80% load), globally-aware switch-based techniques win; Hopper is the best practical option when switch reprogramming isn't available

Future work explicitly called out: Embed Hopper logic directly into NCCL or other collective communication libraries.

Connection to NCCL / DynamICCL

NCCL's internal QP-per-channel random spraying has the same vulnerability as RPS; the paper calls this out explicitly
Hopper's chunk-size knob maps directly to NCCL's chunkSize / numPipeOps
Per-QP RTT is a low-overhead, switch-independent congestion signal directly usable as RL state in a DynamICCL agent
The plugin architecture of DynamICCL is the natural integration point the authors suggest for embedding Hopper-style logic