Also: Brief Summaries

Hopper: Predictive Load Balancing for RDMA Traffic — Detailed Summary

Erfan Nosrati & Majid Ghaderi, University of Calgary | arXiv:2506.08132 | June 2025

Per-paragraph summary (2 points each), organized by section and subsection headings.

Abstract

Paragraph 1:

Distributed ML training on large AI clusters requires efficient network interconnects built on RDMA with leaf-spine topologies, and load balancing across redundant paths is essential to maximizing bandwidth utilization.
Existing load-balancing techniques are either optimized for TCP or require specialized hardware switches, limiting applicability in AI clusters running GPU-generated RDMA traffic.

Paragraph 2:

Hopper is a host-only RDMA load balancer that continuously monitors RTT on the active path, probes two alternative paths, and switches traffic to a less congested path while controlling switch timing to minimize out-of-order (OOO) packets.
ns-3 simulations and hardware testbed experiments show Hopper reduces average FCT slowdown by up to 20% and p99 tail FCT by up to 14% compared to the best prior host-based technique, FlowBender.

1. Introduction

Paragraph 1:

Modern ML models (LLMs, DLRMs) require distributed training on thousands of GPUs, and GPU synchronization communication can dominate total training time in large clusters.
Optimizing the cluster network to minimize communication overhead is therefore the key lever for accelerating distributed training.

Paragraph 2:

Large AI clusters use RDMA (typically RoCEv2 over standard Ethernet) and hierarchical leaf-spine topologies that provide multiple redundant paths for fault tolerance and scalability.
Effective load balancing across multiple paths is critical to maximize bandwidth utilization, yet existing work predominantly targets CPU-generated TCP traffic and does not perform well with hardware-generated RDMA traffic.

Paragraph 3:

ECMP is widely deployed but fails to distribute load evenly when flow-size distributions are skewed — a small number of large flows carry a disproportionate share of traffic.
Distributed ML training workloads are particularly problematic for ECMP because they generate fewer but far larger flows than conventional datacenter workloads, leading to worse load imbalance.

Paragraph 4:

Existing alternatives each have a fatal flaw: RPS generates massive OOO packets; flowlet switching depends on inter-packet gaps absent in hardware-generated RDMA traffic; congestion-triggered approaches (FlowBender, MP-RDMA, ConWeave) are either blind in path selection, require complex host logic, or need programmable P4 switches.
There is a clear gap for a practical, deployable host-based solution that selects paths intelligently for RDMA in AI clusters.

Paragraph 5:

Hopper is a host-based, single-path RDMA load balancer that uses measured RTT as a congestion signal, probes alternative paths before committing to a switch, and controls switch timing to minimize OOO packets — requiring only standard ECMP on switches.
Unlike RPS, Hopper is robust to topology asymmetries; it outperforms FlowBender (host-based) and CONGA (switch-based) while remaining simpler and more deployable than ConWeave (P4 switch-based).

Paragraph 6:

Hopper is directly motivated by FlowBender's design for TCP, but FlowBender's random path selection causes two problems: the newly chosen path may also be congested, and the random RTT difference between old and new paths can create a large OOO burst at high RDMA transmission rates.
Hopper avoids both by (a) probing two randomly selected alternatives and choosing the better one only if it is substantially less congested, and (b) delaying transmission on the new path in proportion to the RTT difference to smooth OOO arrivals.

Paragraph 7 (Contributions):

Hopper is designed as a host-based load balancer operating at near-RTT granularity without excessive OOO packets, exploiting RNICs' existing RTT measurement and limited packet reordering capabilities.
Evaluation in ns-3 simulation and hardware testbed demonstrates up to 20% average and 14% p99 FCT improvement over FlowBender for ML training workloads.

2. Background and Motivation

Paragraph 1:

The central motivation is that random path selection — used by RPS, FlowBender, and PLB — is inefficient, and modern RNICs now support hardware timestamping and limited packet reordering sufficient to implement congestion-aware path selection entirely at the host.
This section provides RDMA background and elaborates the motivation for moving beyond random path selection.

Paragraph 2 (RDMA):

RDMA delivers low latency and high throughput by implementing all transport functions (congestion control, loss recovery) entirely in NIC hardware, enabling direct memory access over the network without CPU involvement.
An RDMA connection is identified by a Queue Pair (QP), which is the user-space communication interface to the RDMA transport residing on the RNIC hardware.

Paragraph 3 (RoCEv2):

RoCEv2 encapsulates RDMA messages in UDP/IP to run RDMA over standard Ethernet; it relies on Priority Flow Control (PFC) to create a lossless fabric, but PFC introduces congestion spreading and deadlocks.
To address PFC's drawbacks, RoCEv2 stacks are augmented with DCQCN (ECN-based congestion control) and IRN (selective repeat loss recovery for limited OOO packet handling).

Paragraph 4 (Problem with OOO Packets):

Commercial RNICs have only limited on-chip SRAM (e.g., NVIDIA CX-5 has ~2 MB), so per-connection reordering buffers consume ~0.1 MB each, severely capping the number of concurrent connections that can handle OOO packets.
Generating excessive OOO packets overwhelms RNIC resources and forces expensive NACK-based loss recovery, so OOO packet rate must be carefully controlled.

Paragraph 5 (Problem with Random Packet Spraying):

RPS distributes packets uniformly across paths but creates massive OOO packets and is fundamentally vulnerable to topology asymmetries — device failures and transient congestion rerouting are common in large clusters, making uniform spreading harmful.
For distributed ML training, this is particularly damaging because training progress is gated by the slowest flow in a collective operation, so inflated tail FCTs directly inflate training time.

Paragraph 6 (Summary):

Evaluating per-packet path congestion for every send decision is prohibitive in overhead; a practical middle ground is to amortize path assessment over one RTT, achieving faster response than per-flow while avoiding per-packet overhead.
Hopper operates at RTT granularity — slower than per-packet (RPS) but faster than per-flow (ECMP) — accepting imperfect load distribution in exchange for far fewer OOO packets and robustness to topology changes.

3. Design

Paragraph 1 (Design Overview):

Hopper leverages modern RNICs' limited packet reordering and precise timestamping; it operates over control epochs of one RTT each, with three modules: congestion detection, path probing, and path switching — all acting only on packet headers to influence ECMP hashing.
Algorithm 1 shows the high-level control logic: maintain a moving-average RTT, trigger probing when avg_rtt > th_probe, and trigger switching when avg_rtt > th_cong.

3.1 Congestion Detection Module

Paragraph 1 (Congestion Signal):

The two canonical congestion signals for RDMA are ECN (used by DCQCN) and RTT (used by Timely and Swift); both are readily available at the RDMA transport layer with low overhead.
Hopper uses RTT because it is supported by commodity NICs without any switch support and is especially well-suited for path probing — ECN may not reliably fire during a short probe window, while RTT always reflects queuing delay on the probed path.

Paragraph 2 (Detection Mechanism):

Hopper measures per-packet RTT using ACK timestamps from IRN and maintains an EWMA with coefficient M over each RTT epoch; the path is considered congested when average RTT exceeds threshold th_cong.
Both th_cong and M are tunable: increasing M makes Hopper more responsive to RTT spikes; decreasing M smooths noise at the cost of slower reaction.

3.2 Path Probing Module

Paragraph 1 (Probe Mechanism):

Modern RNICs' limited OOO support makes it theoretically possible to embed probing directly into data transmission by routing a few data packets on alternate paths; however, current Hopper uses out-of-band probe packets to avoid complicating in-flight data management.
Out-of-band probing must be carefully rate-controlled because it adds load to an already-congested network, and each probed path requires a new QP, straining RNIC scalability if not bounded.

Paragraph 2 (Probe Initiation):

When avg_rtt exceeds th_probe (a lower threshold than th_cong), Hopper applies the power-of-two-choices strategy: it creates two QPs bound to distinct UDP source ports (distinct ECMP hash buckets, effectively distinct paths) and records the measured RTT on each.
To prevent redundant probing, Hopper timestamps recently explored paths and skips re-probing any path visited within a ttl_probe interval (set to 4× base RTT by default).

3.3 Path Switching Module

Paragraph 1 (Path Selection):

Hopper only commits to a path switch when the current path is congested (avg_rtt > th_cong) and a probed alternative has an RTT better by at least a configurable margin NLM (set to 80% of current RTT); switches that do not clear this margin are suppressed.
This conservatism prevents "thrashing" between paths of similar congestion levels and reduces unnecessary OOO packet generation from gratuitous switches.

Paragraph 2 (Path Switching Mechanics):

To execute a switch, Hopper promotes the QP used for probing to become the active data QP and releases the QP of the old path; if neither probed path passes the NLM threshold, both probe QPs are freed but their records are retained to prevent repeated probing of the same congested paths.
Retaining recent probe records throttles probing frequency and prevents multiple flows from stampeding onto the same newly-discovered alternative path, which would immediately re-congest it.

Paragraph 3 (Reducing Out-of-Order Packets):

Because both current-path and alternative-path RTTs are known, Hopper delays the first packet on the new path by a duration proportional to the RTT difference, reducing the OOO arrival window at the receiver RNIC.
To set the delay precisely, Hopper fits a linear regression to RTT measurements within the current epoch, extrapolating the RTT of the last in-flight packet on the congested path as a conservative upper bound for the switching delay — minimizing NACK-triggered retransmissions.

4. Evaluation

Paragraph 1:

Hopper is evaluated via ns-3 simulations and a hardware testbed, addressing three questions: (1) congestion-aware vs. random path selection (Hopper vs. FlowBender), (2) dynamic load balancing vs. flowlet switching for RDMA (Hopper vs. CONGA), and (3) performance gap between host-based and in-network load balancing (Hopper vs. ConWeave).
Default Hopper parameters are: M=1, th_probe=1.5× base RTT, th_cong=2.5× base RTT, ttl_probe=4× base RTT, NLM=80%.

4.1 Simulation Experiments

4.1.1 Setup and Methodology

Paragraph 1 (Network Topology):

The simulated topology is a leaf-spine network with 128 servers and 16 switches: 8 leaf switches (each connected to 16 servers) and 8 spine switches, fully interconnected; all links are 100 Gbps with 1 µs latency, giving a base RTT of 8 µs.
This topology is identical to ConWeave's simulation setup, enabling fair comparison.

Paragraph 2 (Transport):

DCQCN is used for congestion control, matching the algorithm in NVIDIA CX-5 RNICs; parameters are the same as ConWeave's paper to ensure comparability.
An extended IRN implementation is used for packet reordering: OOO packets within 30 sequence numbers of the expected sequence are buffered normally; beyond that threshold, the receiver sends a NACK with SACK to trigger loss recovery.

Paragraph 3 (Workloads):

Three workloads are evaluated: AliCloud Storage, Meta Hadoop (standard datacenter benchmarks), and a distributed ML training workload derived from collective message-size distributions of real Meta AI cluster jobs covering AllReduce, AllGather, and ReduceScatter operations.
Flow start times follow a Poisson distribution; two load levels are studied: 50% (moderate) and 80% (high) network utilization.

Paragraph 4 (Baselines & Metric):

Baselines are FlowBender, CONGA, and ConWeave; PLB is excluded because it is designed for TCP and behaves like ECMP for RDMA.
The primary metric is FCT slowdown — a flow's actual FCT divided by its FCT in an unloaded network; both average and p99 tail are reported.

4.1.2 Results and Discussion

Paragraph 1 (Datacenter Workloads — Hadoop):

Hopper consistently outperforms FlowBender for both average FCT (up to 7.8% improvement) and p99 tail FCT (up to 19.6% improvement) across all load levels and flow sizes; better large-flow balancing indirectly relieves congestion for short flows too.
Compared to ConWeave, Hopper is within 8.3% for small flows at moderate load, showing host-only techniques are competitive with P4-based approaches when load is not extreme.

Paragraph 2 (ML Workload):

For ML training workloads (dominated by large flows from collective operations), Hopper achieves up to 20% average and 14% p99 FCT improvement over FlowBender across all scenarios, and at 50% load also outperforms CONGA by up to 10% average and 23% p99 FCT.
At 80% load, switch-based techniques (CONGA, ConWeave) generally outperform host-based ones for the largest flows, confirming that in-network visibility is advantageous under heavy congestion — the primary limitation of any host-based approach.

4.2 Testbed Experiments

Paragraph 1:

A software prototype of Hopper is implemented on a hardware testbed using Dell PowerEdge R740 servers with NVIDIA CX-5 RNICs and a Dell PowerSwitch S5248F-ON running SONiC OS; the control logic runs in user space using the Linux rdma-core library, while data transfers use standard RDMA verbs.
The testbed deliberately introduces path asymmetry — leaf-to-spine links at 10 Gbps vs. 1 Gbps — creating a realistic scenario where load balancers must avoid slow paths.

4.2.1 Setup and Methodology

Paragraph 1 (Implementation):

Because CX-5 RNICs are black-box hardware, Hopper cannot be implemented inside the NIC; instead, flows are divided into chunks and sent via RDMA verbs, with path-switching decisions applied at chunk boundaries; RTT is measured as time between initiating an RDMA send and receiving the completion event.
Two chunk sizes are evaluated: 10 MB and 1 MB; chunks below 100 KB degrade performance due to excessive system call and context-switch overhead; 1 MB is identified as the practical sweet spot.

Paragraph 2 (Network Topology):

The physical testbed has two leaf switches and six spine switches (created using VRF on the physical switch); deliberately asymmetric path bandwidths (10 Gbps vs. 1 Gbps links) allow 1G link utilization to serve as a direct indicator of poor load balancing.
Four clients connect to each leaf; each RNIC has two 25 Gbps ports, each assigned to one host.

Paragraph 3 (Transport & Workload):

All testbed experiments use DCQCN with NVIDIA's recommended default parameters for CX-5 RNICs.
The workload is an AllReduce collective pattern with flow sizes and frequencies derived from GPT-3 training traces; 204 flows are generated, experiments are repeated five times, and each round synchronizes via a centralized completion server before starting the next.

Paragraph 4 (Baselines & Metrics):

Testbed baselines are limited to ECMP and a software FlowBender using RTT-based (instead of ECN-based) congestion detection; ConWeave and CONGA are excluded because both require switch modifications unavailable on the testbed.
Three performance metrics are used: (1) average and tail FCT, (2) average total training time per round, and (3) average utilization of 1G vs. 10G links.

4.2.2 Results and Discussion

Paragraph 1 (Avoiding Congested Paths):

Both FlowBender and Hopper eventually move flows off 1G paths, but Hopper consistently utilizes 10G links more than FlowBender by up to 35%, demonstrating superior ability to discover less congested paths through informed probing.
As chunk size increases from 1 MB to 10 MB, 1G link utilization rises for both algorithms, but FlowBender's degradation is nearly twice that of Hopper because random path selection gives FlowBender a higher probability of re-selecting slow paths.

Paragraph 2 (FCT Slowdown):

Hopper outperforms FlowBender in all scenarios for both average and tail slowdown, except at p99 with 10 MB chunks where they are equivalent; smaller chunk sizes improve load balancing granularity but increase system call overhead.
With 1 MB chunks, Hopper improves average FCT slowdown by 45%, p95 by 61.9%, and p99 by 77.2% compared to FlowBender — reflecting how well informed probing outperforms random selection under real asymmetric conditions.

Paragraph 3 (Training Time):

With 1 MB chunks, Hopper reduces average total ML training time by 51% compared to FlowBender (715 s vs. 1080 s); with 10 MB chunks the reduction is 48.3% (890 s vs. 1320 s).
These improvements compound because AllReduce training is synchronization-bound — a round completes only when the slowest flow finishes — so reducing tail FCT has multiplicative impact on training throughput.

Paragraph 1 (Flowlets in RDMA):

RDMA traffic lacks sufficient inter-packet gaps for flowlet detection (confirmed by real measurements in MP-RDMA and ConWeave papers); one workaround (HF2T) artificially inserts gaps to create flowlets, but artificial delays prolong FCTs when the network is uncongested.
This gap is a core reason Hopper operates at RTT granularity rather than flowlet granularity.

Paragraph 2 (Multipath RDMA):

MP-RDMA breaks a flow into subflows distributed across paths in proportion to congestion, but requires complex host-side logic for multi-path synchronization that Hopper deliberately avoids for simplicity and deployability.
NCCL internally uses multiple QPs per message with random spraying at chunk granularity — effectively chunk-level RPS — making it vulnerable to topology asymmetry; Hopper's integration into collective communication libraries such as NCCL is proposed as future work.

Paragraph 3 (Path Probing):

Both switch-based (CONGA: passive probing; ConWeave: active probing via P4) and host-based (Clove, MP-RDMA, Hermes: power-of-two-choices) path probing approaches have been studied extensively.
Hopper's design is closest in spirit to Clove (pre-profiled port-to-path mapping) while adopting Hermes's power-of-two-choices strategy; combining both ideas for the RDMA context is Hopper's specific contribution.

Paragraph 4 (RTT Measurement):

Hardware timestamping in modern NICs (Intel E810, NVIDIA CX-5/CX-6) enables high-precision RTT measurement; Timely (SIGCOMM 2015) showed RTT can be reliably measured with NIC timestamps; Swift improves precision further by excluding local NIC processing delays.
NVIDIA's ZTR-RTT congestion control also relies on hardware-timestamped RTT measurement, confirming RTT is a mature, commodity-available signal — exactly what Hopper depends on.

6. Conclusion

Paragraph 1:

Hopper is a host-based RDMA load balancer that switches flow paths every RTT but only when the current path is congested and a proactive probe has identified a substantially less congested alternative; evaluated under both datacenter and ML training workloads, it outperforms state-of-the-art host-based techniques and approaches switch-based techniques.
The key insight is that congestion-aware informed probing (power-of-two-choices) combined with proportional-delay switching (linear RTT extrapolation) eliminates the two main weaknesses of random path selection: suboptimal target path quality and excessive OOO packet bursts.

Paragraph 2:

The black-box nature of commercial RNICs (CX-5) was the primary implementation challenge, necessitating a user-space control-plane implementation via rdma-core rather than in-NIC deployment.
The most promising future direction is integrating Hopper directly into a collective communication library such as NCCL, since inter-GPU communication in distributed ML training is exclusively collective-operation-based, making the library the natural integration point.

Appendix A: Hopper's Workflow

Paragraph 1:

The workflow illustrates four states: (a) congestion detection — active path P3 RTT reaches th_probe (12 µs); (b) path probing — two alternatives P1 and P4 are probed via QPs on distinct source ports; (c) delay computation — RTTs compared (P3: 14 µs, P4: 10 µs, P1: 8 µs) and switch delayed by RTT difference (14−8 = 6 µs) to minimize OOO; (d) path switch — flow moves to P1 (8 µs).
This four-phase example concretely shows how Hopper's three modules interact sequentially within a single control epoch, and why the delay in phase (c) is critical for OOO avoidance at the receiver RNIC.

Appendix B: AliCloud Storage Workload

Paragraph 1:

At 50% load, Hopper improves average FCT by up to 6% for smaller flows and up to 19% for large flows compared to CONGA and FlowBender, while performing up to 6% worse than ConWeave for small flows but up to 16% better in p99 FCT than CONGA.
At 80% load, Hopper continues to outperform FlowBender by approximately 10% in average FCT, confirming results generalize across workloads beyond Hadoop.

Appendix C: Meta Hadoop Workload (Full Distribution)

Paragraph 1:

The full flow-size distribution for Meta Hadoop is presented, including flow sizes between 2 KB and 49 KB omitted from the main text for presentational clarity.
Inclusion of the intermediate flow-size range confirms that Hopper's advantages over FlowBender hold across the full distribution, with no anomalies hidden by the two-region presentation in §4.1.2.

Hopper: Predictive Load Balancing for RDMA Traffic — Detailed Summary

Abstract

1. Introduction

2. Background and Motivation

3. Design

3.1 Congestion Detection Module

3.2 Path Probing Module

3.3 Path Switching Module

4. Evaluation

4.1 Simulation Experiments

4.1.1 Setup and Methodology

4.1.2 Results and Discussion

4.2 Testbed Experiments

4.2.1 Setup and Methodology

4.2.2 Results and Discussion

5. Related Work

6. Conclusion

Appendix A: Hopper's Workflow

Appendix B: AliCloud Storage Workload

Appendix C: Meta Hadoop Workload (Full Distribution)