Toward a Standardized Representation for Deep Learning Collective Algorithms — Detailed Summary
Jinsun Yoo, William Won, Meghan Cowan, Nan Jiang, Benjamin Klenk, Srinivas Sridharan, Tushar Krishna | Georgia Institute of Technology + NVIDIA | IEEE Micro, Vol. 45, Issue 2 (March/April 2025), Theme Article: Hot Interconnects 31 | DOI: 10.1109/MM.2025.3547363
Per-section summary organized by paper headings. Each section includes paragraph-level bullet points and exact quantitative results where the paper provides them.
Abstract
- The explosion of ML model size has forced training onto distributed clusters at very large scale; collective communication has emerged as a primary bottleneck.
- Many tools attempt to optimize collective-algorithm production and collective execution, but each tool uses its own representation — MSCCLang has MSCCL-IR (XML), TACCL has its solver output, TACOS has the Time-Expanded-Network (TEN) format — and downstream consumers (MSCCL-Runtime, ASTRA-sim) each implement their own internal collective algorithms.
- The fragmentation balkanizes the optimization environment: every new algorithm must be re-encoded for every consumer, and every consumer must re-implement every algorithm it wants to support.
- The paper proposes a standardized workflow built on a common representation: an extension of the Chakra Execution Trace (ET) graph-based format already used to represent distributed-ML workloads.
- The key conceptual move: encode collective algorithms in the same format used for workload operators, so communication messages and compute operators live at the same level of abstraction — enabling fine-grained co-optimization and decoupling producer and consumer tools.
- The proof of concept: simulate algorithms generated by MSCCLang and TACOS through ASTRA-sim across multiple network configurations.
1. Introduction
Background and motivation:
- Deep-learning models have grown into hundreds of billions of parameters, forcing distributed execution across many GPUs/NPUs.
- Modern training runs at the cluster scale routinely depend on collective operations such as All-Reduce and All-Gather to synchronize gradients, weights, or activations across workers.
- The collective phase is increasingly the dominant scaling bottleneck; this has spawned a rich ecosystem of synthesizers and runtimes aimed at squeezing performance out of the network.
The fragmentation problem:
- Producers of collective algorithms each emit their own representation: MSCCLang -> MSCCL-IR (XML), TACCL -> MILP solution structure, TACOS -> TEN. NCCL itself uses templated CUDA kernels.
- Consumers (MSCCL-Runtime, ASTRA-sim, LIBRA) each carry their own internal collective implementations, often hand-coded and not interchangeable.
- Net effect: O(producers x consumers) engineering work, with no easy way to plug a new synthesizer into an existing simulator or to test a new algorithm in a real runtime without re-coding it for that runtime's ABI.
The opportunity:
- A standardized intermediate representation would decouple producers and consumers and enable rapid algorithm exploration.
- It would also enable fine-grained co-optimization between collective algorithms and the surrounding compute graph: today, NPUs typically wait for an entire collective to complete before resuming compute, even when only one chunk's worth of data is needed.
Contributions:
- Define a small set of node types extending Chakra ET to express arbitrary collective algorithms as per-NPU DAGs.
- Build the converters: MSCCLang/MSCCL-IR -> Chakra ET, and TACOS -> Chakra ET.
- Extend ASTRA-sim to consume Chakra-ET-encoded collectives in place of its native implementations.
- Demonstrate end-to-end simulation across two topologies (2-D mesh, 3-D hypercube) with three algorithms (MSCCLang-Ring, MSCCLang-Direct, TACOS), showing the bandwidth spread that the pipeline preserves.
2. Background
2.1 Chakra Execution Trace (ET)
- Chakra ET is a graph-based representation of distributed-ML workloads, supported by MLCommons as a community standard.
- A Chakra ET captures the per-rank execution graph of a distributed workload, including compute operators, dependencies, and (today) coarse-grained collective calls.
- The format already supports the abstractions needed to represent per-rank DAGs, making it a natural target for extension.
2.2 Upstream Producers of Collective Algorithms
- MSCCLang is a Python-based domain-specific language for expressing collective algorithms; it compiles to MSCCL-IR, an XML-based intermediate representation consumed by MSCCL-Runtime.
- TACCL uses mixed-integer linear programming (MILP) over a topology graph to synthesize collective schedules.
- TACOS uses a Time-Expanded Network (TEN) abstraction to synthesize topology-aware collective algorithms.
- All three solve very similar problems but emit incompatible outputs.
2.3 Downstream Consumers of Collective Algorithms
- MSCCL-Runtime executes MSCCL-IR algorithms on real GPU clusters through an NCCL-based runtime.
- ASTRA-sim is a distributed-ML simulator with native implementations of standard collective algorithms (Ring, Tree, Recursive Halving-Doubling) and configurable network models (analytical and detailed).
- Each consumer has its own format requirements, and adding a new algorithm to a consumer requires writing implementation code in that consumer's stack.
3. Representing Collective Algorithms with Chakra ET
3.1 Motivation
- A standard format must capture the smallest meaningful units of work in a collective: point-to-point sends, point-to-point receives, and per-chunk compute operations (e.g., partial reduction).
- It must express dependencies between these units within and across NPUs.
- It must elevate communication to the same DAG level as compute, so joint scheduling is expressible.
3.2 Node Type Extension
The authors propose three new Chakra ET node types (Table 1 in the paper):
| Chakra ET Node Type | Description |
|---|---|
COMM_SEND |
Send a point-to-point message to a destination NPU |
COMM_RECV |
Wait for a point-to-point message that a source will send |
COMP |
Run a compute task (e.g., reduction) |
- A collective algorithm becomes a per-NPU DAG over these node types.
- Edges encode inter-operator dependencies: e.g., a
COMPnode that performs a reduction depends on its incomingCOMM_RECVand any prior partial-resultCOMPs; aCOMM_SENDof a reduced chunk depends on theCOMPthat produced it.
3.3 Worked Example: Ring-Based Reduce-Scatter
- The paper walks through a four-NPU ring-based Reduce-Scatter to illustrate the encoding (referenced as Figure 3).
- Each NPU's DAG alternates
COMM_RECV->COMP(reduce) ->COMM_SENDfor N-1 rounds, with explicit dependency edges. - Figure 5 in the paper presents complete Chakra-ET graphs for a 3-NPU All-Reduce and a 4-NPU All-Reduce, demonstrating that the format scales naturally to more participants.
3.4 Fine-Grained Compute-Communicate Overlap
- The paper's most original conceptual point: by reordering nodes within the per-NPU DAG, one can express chunk-level overlap between compute and communication.
- Today's coarse-grained execution waits for an entire collective to
finish before resuming compute. With Chakra-ET-level visibility, the
scheduler can fire a
COMPthat depends on chunk i as soon as chunk i'sCOMM_RECVlands, without waiting for chunks i+1, i+2, .... - The format thus exposes overlap opportunities that the previous hard-coded collective abstractions hid.
4. Proof-of-Concept Methodology
4.1 MSCCL-IR -> Chakra ET Converter
- A standalone converter parses MSCCL-IR XML files and constructs the corresponding per-NPU Chakra-ET vertex/edge graph.
- The converter handles MSCCLang's send/recv/compute primitives directly by mapping them to the new Chakra ET node types.
- Quantitative cost (Table 2 in the paper):
| Number of NPUs | 16 | 32 | 64 | 128 |
|---|---|---|---|---|
| Conversion duration (ms) | 259 | 398 | 1485 | 7662 |
- The 128-NPU All-Reduce conversion takes 7.66 s, which is dominated by IR parsing and graph construction; this remains within practical offline-synthesis budgets.
- The roughly super-linear growth (4x NPUs -> ~30x time, 16 -> 128) is consistent with All-Reduce's O(N^2) message count in some encodings.
4.2 TACOS -> Chakra ET Generator
- TACOS was modified to emit Chakra ET directly out of its TEN solution rather than its native TEN format.
- Quantitative cost: TACOS takes 1080 ms to synthesize an All-Reduce for 128 NPUs (substantially faster than the MSCCL pipeline at the same scale because TACOS skips an XML serialization round).
4.3 ASTRA-sim Extension
- ASTRA-sim was extended with a new input parameter that accepts a Chakra-ET file specifying the collective algorithm to simulate.
- When this parameter is supplied, ASTRA-sim bypasses its native implementations of Ring/Tree/Recursive-Halving-Doubling and instead walks the supplied DAG node-by-node, dispatching point-to-point messages on its analytical/detailed network model.
- This is the key consumer-side change that closes the loop: producer (any tool) -> Chakra ET -> consumer (ASTRA-sim).
5. Evaluation
5.1 Setup
| Component | Value |
|---|---|
| Simulator | ASTRA-sim 2.0 (analytical network model) |
| NPUs | 64 |
| Topologies | 2-D Mesh (8x8); 3-D Hypercube (4x4x4) |
| Link latency | 500 ns |
| Link bandwidth | 50 GB/s |
| Workloads | All-Gather, All-Reduce |
| Algorithms | MSCCLang-Ring, MSCCLang-Direct, TACOS topology-aware |
| Metric | Achieved bus bandwidth (GB/s) vs. chunk size (KB - GB) |
5.2 Bandwidth on 2-D Mesh (64 NPUs)
- TACOS-synthesized All-Gather reaches ~100 GB/s at large chunk sizes (1 MB - 1 GB).
- MSCCLang-Ring reaches ~20 GB/s in the same regime.
- MSCCLang-Direct reaches ~5 GB/s.
- The 5x-20x spread between TACOS and the MSCCLang baselines demonstrates that the algorithm choice (not the simulator overhead) drives the result, and that Chakra-ET-piped algorithms preserve the fidelity of the original synthesizer.
5.3 Bandwidth on 3-D Hypercube (64 NPUs)
- TACOS-synthesized All-Gather reaches ~150 GB/s.
- MSCCLang-Ring reaches ~40 GB/s (richer connectivity helps the ring more than on the 2-D mesh).
- MSCCLang-Direct is again limited to ~5 GB/s.
- The hypercube gives TACOS a 1.5x boost over the mesh result, and ring a 2x boost — confirming that algorithm sensitivity to topology is captured end-to-end through the Chakra-ET pipeline.
5.4 Cross-Cutting Observations
| Observation | Quantitative evidence |
|---|---|
| Algorithm choice dominates topology | TACOS 100-150 GB/s vs. Direct 5 GB/s on same fabric |
| Topology shifts ring performance | Ring: 20 GB/s (Mesh) -> 40 GB/s (Hypercube) |
| Direct is uniformly poor | Direct ~5 GB/s on both topologies |
| Pipeline overhead is acceptable | 7.66 s to convert 128-NPU All-Reduce; one-shot offline |
| TACOS synthesis is fast | 1080 ms for 128-NPU All-Reduce |
6. Conclusion and Future Work
- The paper closes by re-emphasizing that representation standardization is a foundational enabler for the rest of the collective-optimization research agenda.
- Future Work:
- Use Chakra ET to study collective optimizations in the context of actual ML workloads (compute + communication co-optimization), not standalone collective microbenchmarks.
- Onboard more producers (NCCL kernel templates, TACCL, MSCCL++) and more consumers (LIBRA, real GPU runtimes, MSCCL-Runtime) to the Chakra-ET ecosystem.
- Extend the format to capture hardware-offloaded collectives such as NVIDIA SHARP, where reductions execute inside the network fabric itself rather than at the NPUs.
7. Major Focal-Point Tools/Papers Cited
| Tool / Paper | Role | Producer or Consumer |
|---|---|---|
| Chakra ET | Standardized format (extended in this work) | Format |
| MSCCLang | Python DSL for collective algorithms; emits MSCCL-IR | Producer |
| MSCCL-IR | XML-based IR for collective algorithms | Format |
| TACCL | MILP-based topology-aware synthesizer | Producer |
| TACOS | TEN-based topology-aware synthesizer | Producer |
| MSCCL-Runtime | NCCL-based runtime executing MSCCL-IR | Consumer |
| ASTRA-sim | Distributed-ML simulator (extended in this work) | Consumer |
| NCCL | Standard collective library; baseline for runtimes | Both (template-based) |
| LIBRA | Distributed-ML simulator | Consumer (future work) |
| NVIDIA SHARP | Hardware-offloaded in-network reductions | Future-work target |
| MLCommons | Custodians of the Chakra ET standard | Standard body |
8. Related Work Positioning
- Distributed-training optimization: Megatron-LM, ZeRO++, PyTorch FSDP. The paper places itself orthogonally — it does not propose new parallelism strategies but a substrate to evaluate them more uniformly.
- Collective synthesizers: NCCL templated kernels, TACCL (MILP), TACOS (TEN), MSCCLang (DSL). The paper unifies their outputs into one format rather than competing as another synthesizer.
- Network simulators: ASTRA-sim, LIBRA. Standardizing the input format means simulators can simulate any algorithm, not only those natively implemented.
- Hardware offload: NVIDIA SHARP. Acknowledged as out-of-scope for the current node-type set.
9. Limitations of the Work
- Simplistic workloads: The evaluation uses single-collective microbenchmarks; the interaction with realistic full-model training graphs (Megatron, GPT, BERT) is not measured.
- Limited ecosystem coverage: Only MSCCLang and TACOS are shown as producers; only ASTRA-sim is shown as a consumer. NCCL, MSCCL-Runtime, TACCL, MSCCL++, and LIBRA are not yet integrated.
- Hardware-offloaded primitives missing: SHARP-style in-network reductions and multicast accelerators are not representable in the current node-type set.
- Analytical-network simulation only: Achievable bandwidth numbers in the evaluation are upper bounds; congestion, contention, and protocol-level effects are abstracted away.
- Representation-centric, not optimization-centric: The paper is a format paper, not a design-space exploration. It shows that the pipeline preserves quality, not that it discovers new algorithms.
10. Discussion of NCCL
- NCCL is mentioned as the runtime substrate of MSCCL-Runtime — the XML-driven dispatcher runs on top of NCCL kernels.
- NCCL's Ring and Recursive Halving-Doubling algorithms are cited as the canonical implementations the paper aims to make expressible inside Chakra ET as well.
- NCCL's chunkSize-driven pipelining is implicitly the intended target for the fine-grained overlap discussion: representing chunk-level COMM/COMP nodes is the substrate on which a tuner can reason about overlapping the N-th chunk's compute with the (N+1)-th chunk's send.
- The paper does not propose NCCL changes; it proposes a representation that NCCL-driven runtimes (like MSCCL-Runtime) can target without rebuilding their dispatch logic for each new algorithm.
11. Cross-Cutting Take-Aways
| Take-away | Derived from |
|---|---|
| Representation fragmentation is the actual bottleneck blocking ecosystem composition | Sec. 1, 2 |
Three node types (COMM_SEND, COMM_RECV,
COMP) suffice to encode all common collective
algorithms |
Sec. 3 |
| MSCCL-IR -> Chakra ET conversion at 128 NPUs costs 7.66 s — practical for offline use | Table 2 |
| TACOS synthesis at 128 NPUs costs 1.08 s — practical for online use | Sec. 4.2 |
| Topology + algorithm interaction is preserved end-to-end | Sec. 5 (Mesh vs. Hypercube) |
| Algorithm choice swings achievable bandwidth by 5x-30x | Sec. 5 evaluation |
| Joint compute-collective scheduling is the next frontier this format unblocks | Sec. 6 |
12. Relevance to DynamICCL
DynamICCL is an RL-based NCCL configuration optimizer that selects
per-collective parameters — algorithm (Ring / Tree /
CollNet / NVLS), protocol (LL / LL128 / Simple),
nChannels, numThreads, chunkSize
— to minimize collective wall-clock time on HPC GPU clusters. It
conditions on state features including message size (log-binned), model
intensity I = C/D, local batch size, topology fingerprint (NVLink-only /
NVLink+PCIe / PCIe+IB / Ethernet), and an LSTM-encoded recent-collective
timing window. Reward is -collective_wall_clock_us. It
operates inside NCCL via the tuner-plugin API. This paper informs
DynamICCL in several concrete ways.
Direct mappings:
| Paper finding | DynamICCL design implication |
|---|---|
| Chakra ET as universal collective representation | Adopt Chakra ET as the canonical format for logging RL trajectories: each (state, action, reward) tuple stores the resulting algorithm DAG, enabling cross-cluster reproducibility and policy distillation. |
| MSCCL-IR -> Chakra ET conversion is fast and lossless | DynamICCL can ingest MSCCLang/TACOS/TACCL warm-start policies via Chakra ET — imitation-learning prior over expert synthesizers. |
| Algorithm choice swings bandwidth 5x-30x | Confirms algorithm as the highest-leverage action axis
in DynamICCL's action space; spend exploration budget here before
fine-tuning numThreads/chunkSize. |
| Topology shifts the optimal algorithm (Mesh vs. Hypercube) | Topology fingerprint must be a first-class state feature; consider GNN-encoded topology to generalize across fabrics, not just a categorical embedding. |
| Fine-grained compute-comm overlap is currently absent | DynamICCL's reward should optionally include an overlap-quality term, not just collective wall-clock — the Chakra-ET DAG exposes this surface area. |
| Ecosystem fragmentation = duplicated engineering | DynamICCL should publish its tuner trajectories in Chakra-ET form to feed the open ecosystem and avoid recreating the fragmentation problem at the RL layer. |
| TACOS-style topology-aware synthesis dominates baseline ring/direct | Use TACOS-generated schedules as imitation-learning targets when bootstrapping DynamICCL on a new topology. |
Specific design priors for the RL agent:
Trajectory format: Log every (s, a, r) tuple with the executed collective stored as a Chakra-ET DAG, not as opaque NCCL parameters. This allows post-hoc inspection of why a configuration won or lost — necessary for credit assignment in long-horizon training.
Action-space initialization: Pre-train the actor by behavioral cloning on Chakra-ET traces of TACOS, MSCCLang-Ring, and MSCCLang-Direct schedules. The 5x-30x bandwidth gap between TACOS and Direct is a strong supervised signal.
Topology encoding: Encode topology as a GNN-derived embedding from the Chakra-ET-compatible topology graph, rather than as a categorical
{NVLink-only, NVLink+PCIe, PCIe+IB, Ethernet}feature. The Mesh-vs-Hypercube swing on the same algorithm class motivates richer topology features.Reward shaping for overlap:
- Primary:
r = -collective_wall_clock_us - Optional:
r += overlap_bonus * (compute_overlap_us / collective_wall_clock_us) - The Chakra-ET DAG makes the overlap term measurable directly from the trace, not estimable via heuristics.
- Primary:
State features (per the paper's predictive variables):
- Message size (log-binned)
- Per-chunk DAG depth (a Chakra-ET-derived feature)
- Topology embedding (GNN over the Chakra-ET topology subgraph)
- Recent-collective timing window (LSTM-encoded)
- Producer-of-record (Ring template / Tree template / TACOS schedule / MSCCLang custom) — categorical, since algorithms have different reward signatures.
Research positioning: This paper is upstream substrate for DynamICCL. It does not compete; it provides the format on which DynamICCL's actions can be expressed, logged, and reproduced across simulators (ASTRA-sim) and real clusters (NCCL-driven runtimes). DynamICCL should adopt Chakra ET both for its trajectory store and for its policy outputs, and contribute back a public corpus of tuner-plugin trajectories — directly addressing the authors' own open problem of "leveraging the standard representation to explore collective optimizations in actual ML workloads."
Open-problem alignment: The paper's call for joint compute+collective scheduling at chunk granularity is precisely the regime where DynamICCL's
chunkSizeaction gains its meaning. With Chakra ET as the substrate, DynamICCL can move from "minimize this collective" to "minimize this iteration" — closing the loop between local NCCL configuration and global iteration time.