Communication-Efficient Data Parallel Distributed Deep Learning: A Comprehensive Survey — Detailed Summary
Zhenheng Tang (HKBU), Shaohuai Shi (HIT-Shenzhen), Wei Wang, Bo Li (HKUST), Xiaowen Chu (HKUST-GZ) | arXiv:2003.06307v2 | Sep 2023 | 35 pages | ACM CSUR-style submission
This summary is structured around the survey's central artifact: the four-dimension taxonomy of communication-efficient distributed DL. Within each branch I label each concept as either:
- [KNOB] a parameter that an external optimizer (such as DynamICCL's RL agent) could plausibly set at runtime;
- [DESIGN] a system-design choice baked into the framework / library that DynamICCL inherits and cannot change;
- [INFO] a theoretical or empirical observation that informs the agent but is not actuated.
The final section maps the entire taxonomy to DynamICCL.
0. Paper-Level Summary
0.1 Abstract (verbatim distillation)
Distributed deep learning is bottlenecked by communication. The survey proposes a four-dimension taxonomy — communication synchronization, system architectures, compression techniques, and parallelism of communication and computing — investigates state-of-the-art works in each, compares convergence rates analytically and empirically, and offers extrapolated future directions.
0.2 What is new vs. prior surveys (Section 1.1)
- Peteiro-Barral et al. (2013): general distributed ML for big data.
- Xing et al. (2016): synchronization, scheduling, balancing, topologies.
- Ben-Nun & Hoefler (2019): DNN operators and parallelism (concurrency-focused).
- Guo (2018): quantized neural networks (quantization-only).
- Zhang et al. (2020): brief overview of large-scale DDL systems.
This survey claims novelty in:
- Demystifying communication compression in depth (largely missing in priors).
- Comparing convergence bounds across all four taxonomy dimensions in a single table (Table 9).
- Running uniform empirical benchmarks (FedML + MPI framework) over many algorithms with 4–32 workers.
0.3 Benchmark framework and configuration (Section 1.3)
- Built on FedML and MPI for Python.
- Tasks: CIFAR-10 / ResNet-20 (image classification) and Shakespeare / stacked character-level LSTM (NLP).
- Worker count: 4, 8, 16, 32 RTX 2080 Ti, PyTorch v1.7.
1. Background: The Optimization Problem (Section 2)
The base optimization is:
min_x E_xi~D [ F(x; xi) ]
solved by mini-batch SGD:
G_t(x_t) = grad F_t(x_t; xi_t)
x_{t+1} = x_t - gamma * G_t(x_t)
BSP-SGD distributes the computation:
G_{i,t}(x_t) = grad F_{i,t}(x_t; xi_{i,t})
x_{t+1} = x_t - (gamma / n) * sum_{i=1..n} G_{i,t}(x_t)
Every algorithm in the survey is a deviation from this BSP-SGD baseline along one or more taxonomy dimensions.
2. Dimension 1 — Communication Synchronization (Section 3)
2.1 Taxonomy and timeline
| -- Computation -- | -- Communication -- | Update |
BSP : All workers wait at a barrier; identical models everywhere.
SSP : Bounded staleness s; faster workers may run ahead by up to s steps.
ASP : No barrier; PS updates whenever any worker arrives.
Local : Each worker runs tau local steps, then averages.
(Reproducing Fig. 3 of the paper.)
2.2 Knobs vs. design choices in this dimension
| Concept | Type | Rationale |
|---|---|---|
| Choice of BSP / SSP / ASP / Local | [DESIGN] | Set in the training script (e.g., DDP vs. async PS) — fixed before NCCL is invoked. |
Staleness bound s in SSP |
[KNOB] | Could be tuned at runtime (Chen et al. R2SP, backup workers). |
Local steps tau in Local-SGD |
[KNOB] | Survey shows tau in {2,4,8,16} — final accuracy roughly invariant; communication frequency varies linearly. Tunable. |
| Backup-worker count (Chen et al. [31]) | [KNOB] | Number of stragglers to drop. |
| FedAvg client-sampling fraction | [KNOB] | Federated-only setting; not relevant to HPC training. |
2.3 Mathematical formulations
SSP update (Eq. 6):
x_{i,t+1} = x_0 - gamma * [ pre-window updates ] - gamma * [ in-window updates ] - gamma * [ read-my-writes ]
ASP update (Eq. 7):
x_{t+1} = x_t - gamma * sum_i G_{i, t-tau_{i,k}}(x_{i, t-tau_{k,i}})
Local-SGD (Eq. 8):
x_{i,t+1} = x_{i,t} - gamma * G_{i,t}(x_{i,t}) if t+1 not in I_T
x_{i,t} - gamma * (1/n) * sum_i G_{i,t}(x_{i,t}) if t+1 in I_T
2.4 Empirical observations (Tables 2, 3, 4, 5)
- BSP-SGD test accuracy on ResNet-20 (gamma=0.1): 4 workers 91.25 -> 32 workers 89.21 (mild large-batch generalization gap).
- ASP-SGD diverges at 32 workers under most learning rates (drops to 0.00%).
- Local-SGD with tau=2..16 is essentially indistinguishable in final accuracy.
- FedAvg lags noticeably (62.41 -> 40.23 for gamma=0.001 / 4 -> 32 workers).
Table 5 (relative-level summary):
| Architecture | Sync | Model Consistency | Comm. Frequency | Comm. Congestion | Convergence |
|---|---|---|---|---|---|
| PS | BSP | high | high | high | stable |
| PS | SSP | normal | high | normal | normal |
| PS | ASP | low | high | low | unstable |
| PS | Local | normal | low | high | unstable |
| All-Reduce | BSP | high | high | low | easy |
| All-Reduce | Local | normal | low | low | stable |
| Gossip | BSP | low | high | low | stable |
| Gossip | ASP | low | high | low | unstable |
| Gossip | Local | low | low | low | stable |
2.5 Open problems flagged
- Choosing
sandtauadaptively under heterogeneous device speeds (the congestion/straggler problem). - ASP's accuracy collapse at scale is unresolved.
3. Dimension 2 — System Architecture (Section 4)
3.1 Three architectures
(a) Parameter Server (b) All-Reduce (c) Gossip
+---------+ +---+ -- +---+ +---+ ~ +---+
| Servers | | W | | W | | W | | W |
+----+----+ +-+-+ +-+-+ +-+-+ +-+-+
| | | | |
+-+--+--+-+ +--------+ (peer-to-peer
|W| |W| |W| (collective) graph; symmetric
+-+ +-+ +-+ doubly stochastic
matrix W)
3.2 Parameter Server (Section 4.1)
- DistBelief, GeePS, ps-lite — extensively studied; main weakness is server congestion.
- Server side accelerations: programmable switches doing in-network aggregation (Sapio et al.); user-defined filters (Li et al.).
| Concept | Type |
|---|---|
| PS vs. All-Reduce vs. Gossip | [DESIGN] |
| Number of parameter servers | [KNOB] |
| Worker-relevance threshold (Wang) | [KNOB] |
| In-network aggregation (switches) | [DESIGN] (hardware feature) |
3.3 All-Reduce (Section 4.2)
The survey's Table 6 — communication cost of representative All-Reduce algorithms for an N-dim vector across n nodes (alpha = latency, beta = inverse bandwidth):
| Algorithm | Latency | Bandwidth |
|---|---|---|
| Binary tree | 2 alpha log n | 2 beta N log n |
| Recursive doubling | alpha log n | beta N log n |
| Ring | 2(n-1) alpha | 2(n-1)/n * beta * N |
Algorithms covered:
- Ring [Patarasuk & Yuan 2009]: bandwidth-optimal; latency O(n). Used in Gloo and earlier NCCL.
- Double Binary Trees [Sanders 2009]: full bandwidth, O(log n) latency. Adopted in NCCL >= 2.4.
- Hierarchical All-Reduce [Goyal 2017, Jia 2018]: reduces latency by factor equal to number of hierarchies.
- 2D-Torus All-Reduce [Jouppi et al., Mikami et al.]: latency reduction via torus topology.
- BML [Wang 2018]: tailored to BCube.
- BLink [Wang 2020], PLink [Luo 2020]: topology-aware adaptive All-Reduce for cloud / heterogeneous interconnects.
| Concept | Type |
|---|---|
| Choice of Ring vs. Tree vs. Recursive | [KNOB] |
| Choice of hierarchical vs. flat | [KNOB] |
| Number of hierarchies / hierarchy mapping | [KNOB] |
| 2D-Torus vs. BCube vs. Fat-Tree topology | [DESIGN] (physical) |
| Tree branching factor (binary vs. m-ary) | [KNOB] |
This is the cell where DynamICCL operates. The survey explicitly notes the trade-off:
"for some small messages or small-scale clusters, recursive doubling or ring-based algorithms would be better" which mirrors the algorithm-vs-message-size trade-off DynamICCL learns.
3.4 Gossip (Section 4.3)
- Each worker has a local model; communicates with neighbors per a graph W.
- Trade-off: lower congestion at the cost of longer convergence ("consensus").
- Methods: DPSGD [Lian et al.], SGP+PUSHSUM [Assran et al.], CHOCO-SGD [Koloskova], COLA [He et al.].
- Asymmetric gossip (PUSHSUM) avoids deadlocks/symmetric-comm requirement.
| Concept | Type |
|---|---|
| Gossip vs. centralized arch | [DESIGN] |
| Mixing matrix W | [DESIGN] |
| Number of peers per round | [KNOB] |
| Random vs. deterministic peer pick | [KNOB] |
3.5 Empirical comparison (Table 7)
BSP-SGD (PS) vs DP-SGD (Gossip) on ResNet-20:
- 4 workers: 91.25 vs 91.08 (essentially equal).
- 32 workers: 89.21 vs 88.99 (gossip slightly behind).
4. Dimension 3 — Compression (Sections 5 and 6)
4.1 Quantization (Section 5)
Original 32-bit gradient -> Quant() -> low-bit gradient -> Unquant() -> approximate gradient
Update rule (Eqs. 9–11):
G_quant_{i,t} = Quant( G_{i,t} + delta_{i,t} )
delta_{i,t} = G_{i,t} - Unquant( G_quant_{i,t} ) (error feedback)
x_{t+1} = x_t - gamma * (1/n) * sum_i G_quant_{i,t}
Quantization methods covered:
| Method | Bits | Idea |
|---|---|---|
| 1-bit SGD [Sei et al. 2014] | 1 | Sign + threshold; speech apps; 10x speedup |
| QSGD [Alistarh 2016] | family | Stochastic quantization unbiased estimator with bit-budget knob |
| TernGrad [Wen 2017] | 2 bits | Ternary {-1,0,+1} with scalar-sharing & layer-wise scalars |
| SignSGD / signSGD-MV [Bernstein] | 1 | Sign + majority vote; convergence proofs |
| DIANA [Mishchenko 2019] | varies | Block-wise quantization |
| SRQ + VLC [Suresh 2017] | varies | Random rotation + Huffman coding |
| Adaptive quant [Faghri / Jhunjhunwala] | adaptive | Adjust bits during training |
| Concept | Type |
|---|---|
| Quantization vs. no compression | [DESIGN] (chosen by user / framework) |
Number of bits b |
[KNOB] (per-tensor or global) |
| Layer-wise vs. global scaling | [KNOB] |
| Use of error feedback | [KNOB] |
| Quantizer family (uniform, dither, ternary) | [DESIGN] |
Theoretical maximum compression is 32x (single-precision FP).
4.2 Sparsification (Section 6)
Goal: send only k of N coordinates. Compression up to 1000x reported.
Sub-taxonomy (Section 6 introduction):
Sparsification
|
+-- 6.1 Random sparsification (Random-k, Random Mask, Subsampling)
+-- 6.2 Deterministic sparsification
| +-- 6.2.1 Fixed Threshold (Strom)
| +-- 6.2.2 Adaptive Threshold (Top-k, AdaComp, gTop-k, SBC, STC)
+-- 6.3 Coordinate Descent (BCD, IBCD)
+-- 6.4 Proximal methods (L0/L1 regularization)
Representative methods:
| Method | Idea |
|---|---|
| Random-k | Pick k random indices each iteration |
| Random Mask | Pre-defined random sparsity pattern, regenerated per iteration |
| Top-k [Aji&Heafield, Lin DGC] | Pick k largest |
| gTop-k [Shi] | Top-k applied a second time after global aggregation |
| AdaComp [Chen] | Self-adapting compression rate per layer |
| SBC [Sattler] | Sparse Binary Compression: sparsify + sign average + quantize |
| STC [Sattler] | Sparse Ternary Compression — same idea for federated learning |
| Truncated grad [Langford] | Threshold-based sparsity (online learning origin) |
| Concept | Type |
|---|---|
| Sparsification vs. no compression | [DESIGN] |
k value or sparsity ratio |
[KNOB] (per-tensor or global) |
| Fixed-threshold value | [KNOB] |
| Random-k vs. Top-k vs. Threshold | [DESIGN] |
| Use of error feedback (EF-Top-k) | [KNOB] (essential at high compression) |
| Layer-wise vs. global threshold | [KNOB] |
The survey notes (open problem 3): adaptive per-layer / per-peer compression ratios are an open research direction.
4.3 Empirical comparison (Table 8)
ResNet-20 / 32 workers / gamma=0.1:
| Compression scheme | Compression Ratio | Final Accuracy |
|---|---|---|
| BSP, no compression | 1 | 89.21% |
| BSP + quant (16-bit) | 2 | 89.34% |
| BSP + quant (2-bit) | 16 | 85.37% |
| BSP + Top-k | 10 | 86.75% |
| BSP + Top-k | 100 | 77.66% |
| BSP + Top-k | 1000 | 61.98% |
| BSP + EF-Top-k | 10 | 88.65% |
| BSP + EF-Top-k | 100 | 88.08% |
| BSP + EF-Top-k | 1000 | 87.76% |
| DPSGD (gossip, no comp) | 1 | 88.99% |
| DCD-PSGD (gossip + comp 4) | 4 | 85.78% |
| CHOCO-SGD (gossip + comp 100) | 100 | 89.00% |
Key empirical insight: error feedback rescues sparsification at extreme compression; gossip + compression (CHOCO-SGD) at 100x matches BSP at 1x.
5. Dimension 4 — Computation/Communication Parallelism (Section 7)
5.1 Pipelining
Backward layers: |bwd_3|bwd_2|bwd_1|
Communication: |c_3 |c_2 |c_1| <- WFBP overlaps c_l with bwd_{l-1}
- WFBP (Wait-Free Backward Propagation) [Awan, Zhang]: overlap layer-l gradient comm with layer-(l-1) gradient computation.
- MG-WFBP (merged-gradient WFBP) [Shi]: tensor fusion to amortize alpha startup; addresses small-tensor latency dominance at large scale.
- For sparsified gradients: three-stage pipeline (compute, sparsify, communicate).
- All-Reduce decomposed into two operations (reduce-scatter + all-gather) to enable interleaving with forward/backward compute [Wang et al.].
5.2 Scheduling
- Priority scheduling + tensor partitioning [Peng et al.]: parallelize feed-forward computation with communication of earlier-layer gradients.
- Communication-contention-aware scheduling [Wang]: avoids cross-job interference on shared GPU clusters.
- DAG-aware ordering [Shi]: enforces a global tensor order across all workers to prevent deadlock and ensure correctness.
| Concept | Type |
|---|---|
| WFBP vs. blocking comm | [DESIGN] (framework feature) |
| Merge-gradient (tensor fusion) threshold | [KNOB] |
| Tensor partition size | [KNOB] |
| Priority order of layers | [KNOB] |
| Concurrent-collective scheduling | [DESIGN] |
5.3 Open issues
- For very large tensors, the "long communication" can stall subsequent small tensors — exactly the problem MG-WFBP and tensor partitioning try to solve; no single solution dominates.
- 2022 benchmark [Agarwal et al.] shows WFBP-style overlap reduces training time substantially; without overlap, compression alone is much less useful.
6. Convergence Analysis (Section 8)
The survey collects standard assumptions:
- L-Lipschitz gradient.
- Unbiased stochastic gradient.
- Bounded variance: E ||grad F - grad f||^2 <= sigma^2.
- Bounded second moment: E ||grad F||^2 <= M^2.
- (Optional) mu-strong convexity.
For gossip:
- W is symmetric doubly stochastic.
- Spectral gap rho < 1.
For compression:
- k-contraction: E ||x - C(x)||^2 <= (1 - d/n) ||x||^2.
- Unbiased compression (sometimes): E[C(x)] = x.
Selected convergence rates (Table 9 of survey):
| Architecture | Sync | Compression | Convex | Non-convex |
|---|---|---|---|---|
| PS / AR | BSP | None | O(1/T) | O(1/sqrt T) |
| PS / AR | BSP | Quant | O(1/T) | O(1/sqrt T) |
| PS / AR | BSP | Spars. | O(1/T) | O(1/sqrt T) |
| PS | SSP | None | -- | O(1/sqrt T) |
| PS | ASP | None | O(1/T) | O(1/sqrt T) |
| PS / AR | LocalSGD | None | O(1/T) | O(1/sqrt T) |
| Gossip | BSP | None | -- | O(1/sqrt T) |
| Gossip | BSP | Quant | -- | O(1/sqrt T) |
| Gossip | ASP | None | -- | O(1/sqrt T) |
[INFO] All four taxonomy combinations recover the same asymptotic rate O(1/sqrt T) for non-convex problems — the difference is in the constant factors (variance, compression contraction d, staleness s, peer count, etc.) that govern the practically-important wall-clock time.
7. Auxiliary Technologies (Section 9)
These are orthogonal correctness/convergence helpers that plug into any compression scheme. All are [KNOB] or [DESIGN] choices that augment the base compression.
7.1 Error Accumulation (9.1)
Step recipe:
C_{i,t} = Sparse(v_{i,t-1} + grad_{i,t})
v_{i,t} = (v_{i,t-1} + grad_{i,t}) - C_{i,t} [residual carried forward]
x_{t+1} = x_t - gamma * (1/n) * sum_i C_{i,t}
Used in 1-bit SGD, EF-SignSGD, ECQ-SGD, EF-Top-k, etc.
[KNOB]: turn on/off error feedback; controls whether high-compression schemes will converge.
7.2 Momentum Correction (9.2)
DGC-style momentum applied to the residual error vector:
u_{i,t} = m * u_{i,t-1} + grad_{i,t}
v_{i,t} = v_{i,t-1} + u_{i,t}
x_{t+1} = x_t - gamma * sum_i sparse(v_{i,t})
7.3 Low-rank Decomposition (9.3)
- ATOMO [Wang]: gradients sparsified in an atomic decomposition basis (entry-wise, SVD, Fourier). 1-bit-QSGD and TernGrad are special cases.
- Spectral-ATOMO: SVD-based ATOMO; 2x and 3x faster than QSGD/TernGrad.
- Count Sketch [Ivkin]: O((1/eps) log n) sketch approximates every coordinate and l2 norm; server recovers d largest after summing.
7.4 Local Gradient Clipping (9.4)
- Standard gradient clipping is applied after aggregation; for sparsified BSP, clipping must be applied locally (and the threshold scaled by sqrt(N)) to recover variance equivalence.
7.5 Warm-up Training (9.5)
- During warm-up: lower learning rate AND lower compression aggressiveness (sparsity grows exponentially toward final value).
- Avoids extreme delayed-gradient effects in the first few epochs.
8. Conclusion and Future Directions (Section 10)
The survey explicitly enumerates four open problems:
- Foundation model training: do current comm-efficient methods scale to GPT-3 / GShard / RecSys-class models?
- Higher compression level: above 1000x without accuracy loss?
- Adaptive compression: per-layer, per-tensor, or per-peer compression ratios chosen automatically.
- Fault-tolerant algorithms: handling stragglers, network congestion, and worker failures in heterogeneous deployments.
(3) and (4) are directly aligned with DynamICCL's research thesis: an RL agent that adapts collective configuration to heterogeneous, congested, runtime-variable conditions.
9. Knob-vs-Design Master Table (across all four dimensions)
This is the practical reference for DynamICCL design.
+---------------------+----------+------+--------------------------------+
| Decision | Type | Who | Notes |
+---------------------+----------+------+--------------------------------+
| BSP/SSP/ASP/Local | [DESIGN] | User | Set in training script (DDP=BSP)|
| s (staleness bound) | [KNOB] | RL? | Only in SSP regimes |
| tau (local steps) | [KNOB] | RL? | Only in Local-SGD regimes |
| Architecture | [DESIGN] | User | PS/AR/Gossip — DDP=AR |
| AR algorithm | [KNOB] | RL | Ring / Tree / Recursive |
| AR protocol | [KNOB] | RL | LL / LL128 / Simple |
| nChannels | [KNOB] | RL | NCCL channels |
| numThreads | [KNOB] | RL | NCCL threads per channel |
| chunkSize | [KNOB] | RL | NCCL chunk granularity |
| Hierarchical AR | [KNOB] | RL? | Whether to use hierarchy |
| numHierarchies | [KNOB] | RL? | Levels of hierarchy |
| Topology | [DESIGN] | HW | Torus / BCube / Fat-Tree |
| Quantization on/off | [DESIGN] | User | Library / training script |
| Quantization bits | [KNOB] | User | If on |
| Sparsification on/off| [DESIGN]| User | Library / training script |
| Top-k k value | [KNOB] | User | If on |
| Error feedback | [KNOB] | User | If compression on |
| WFBP overlap | [DESIGN] | Frmwk| Built into Horovod / DDP |
| Tensor fusion thresh| [KNOB] | User | Horovod_FUSION_THRESHOLD |
| Priority schedule | [KNOB] | User | BytePS / P3 / TicTac |
+---------------------+----------+------+--------------------------------+
10. Mapping the Taxonomy to DynamICCL
DynamICCL's RL agent (Agent-2) outputs
<algo, proto, nChannels, numThreads> for each NCCL
collective call. In taxonomy terms:
DynamICCL operates ENTIRELY INSIDE this cell of the survey:
Architecture = All-Reduce (FIXED by user)
Synchronization = BSP (FIXED by user)
Compression = None (FIXED by user)
Pipelining = WFBP (FIXED by framework)
AR algorithm choice <-- DynamICCL ACTION
AR protocol choice <-- DynamICCL ACTION
nChannels / numThreads <-- DynamICCL ACTION
chunkSize <-- adjacent action / next experiment
What DynamICCL gains from this survey
Analytical reward shaping: Table 6 gives closed-form latency and bandwidth costs for Ring / Tree / Recursive. The agent's optimal policy should converge toward these analytically-predicted regions, with deviations capturing real runtime effects (congestion, contention) that the analytical model omits.
Action-space justification: The survey explicitly notes Ring is bandwidth-optimal but latency-linear in n; Tree is logarithmic-latency. This justifies including BOTH in DynamICCL's action set rather than defaulting to one — exactly the point at which NCCL's heuristic switches between protocols and where the heuristic frequently mis-decides.
Confirmed orthogonality: Synchronization (BSP), architecture (AR), compression (off), and pipelining (WFBP) are the user / framework's choice. DynamICCL does NOT need to model them — they are exogenous constants, simplifying the RL state space.
State-feature ideas: Table 5 highlights "communication congestion" as a system-level signal that varies by architecture and sync. DynamICCL's state can include a real-time congestion estimate (per the LSTM detector in Saraswati's notes) without modeling the underlying architecture choice.
What this survey does NOT address (DynamICCL's research gap)
Runtime configuration tuning of the All-Reduce primitive itself. The survey treats AR as a single algorithmic choice (Ring or Tree) — but modern NCCL exposes (algo, proto, nChannels, numThreads, chunkSize) as a configuration manifold with dozens of points per collective. The survey's cost model (Table 6) is too coarse to predict performance within this manifold; this is exactly the gap DynamICCL fills.
Adaptive selection per collective call. The survey only considers static algorithm choice across an entire training run. DynamICCL adapts per-collective, per-iteration, per-message-size — a finer granularity consistent with the survey's open problem (3) ("different compression ratios for different layers/tensors").
Online congestion-aware reconfiguration. The survey acknowledges fault-tolerance and congestion as open problem (4) but offers no mechanism. DynamICCL's RL+LSTM stack is one such mechanism.
Future-extension targets within the taxonomy
If DynamICCL's action space expands beyond pure NCCL knobs:
| Future axis | Taxonomy cell unlocked |
|---|---|
Tune tau for Local-SGD / FedAvg |
Sync dim — knob 2.2 |
Tune compression k per tensor |
Compression dim — open problem (3) |
| Tune tensor-fusion threshold | Pipelining dim — knob 5.2 |
| Tune number of AR hierarchies | Architecture dim — knob 3.3 |
| Switch BSP vs Local-SGD at runtime | Sync dim (very ambitious) |
The survey thus serves both as a map of where DynamICCL currently lives (the AR + BSP cell, with NCCL-specific sub-knobs the survey does not detail) and as a map of where DynamICCL could expand (the other taxonomy cells, each with their own knob-vs-design decomposition that the same RL framework could in principle absorb).