Toward a Standardized Representation for Deep Learning Collective Algorithms

Jinsun Yoo, William Won, Meghan Cowan, Nan Jiang, Benjamin Klenk, Srinivas Sridharan, Tushar Krishna | Georgia Institute of Technology + NVIDIA | IEEE Micro, Vol. 45, Issue 2 (March/April 2025), Theme Article: Hot Interconnects 31 | DOI: 10.1109/MM.2025.3547363

Problem

The deep learning collective communication ecosystem has fragmented into a patchwork of incompatible representations. Upstream producers of collective algorithms each emit their own format — MSCCLang ships an XML-based MSCCL-IR; TACCL produces a mixed-integer-linear-programming solution structure; TACOS uses a Time-Expanded Network (TEN) representation; NCCL itself encodes algorithms in templated CUDA kernels. Downstream consumers of collective algorithms — runtime systems (MSCCL-Runtime, NCCL) and simulators (ASTRA-sim, LIBRA) — each carry their own internal collective implementations, often duplicated and not interchangeable with the producer side. The consequence is that every new synthesizer must re-implement integration code for every runtime/simulator it wants to target, and every simulator must re-implement every algorithm it wants to model. This O(P x C) engineering burden blocks rapid exploration. Additionally, current representations are too coarse to express fine-grained compute-communication overlap at the chunk level — the NPU typically waits for an entire collective to complete before starting the next compute task, even though most algorithms could in principle interleave with neighboring matmul chunks.

Core Insight

Extend the existing Chakra Execution Trace (ET) format — already an MLCommons-blessed graph-based representation for whole distributed-ML workloads — to also encode arbitrary collective algorithms as a per-NPU directed acyclic graph of point-to-point send/receive/compute nodes. This single move elevates communication messages to the same level as compute operators and decouples producers from consumers: any synthesizer can target any runtime or simulator through one universal intermediate representation.

Method

Chakra ET is extended with three node types sufficient to express any deterministic collective algorithm:

+----------------------------------------------------------+
| Chakra ET Node Types for Collective Algorithms           |
+--------------+-------------------------------------------+
| COMM_SEND    | Point-to-point send to a destination NPU  |
| COMM_RECV    | Wait for a P2P message from a source NPU  |
| COMP         | Run a compute task (e.g., reduction op)   |
+--------------+-------------------------------------------+

A collective algorithm becomes a per-NPU DAG. Edges express inter-operator dependencies: e.g., a reduction COMP depends on its input COMM_RECV; a COMM_SEND of a reduced chunk depends on the COMP that produced it. The authors validate the format against three concrete algorithms: ring-AllReduce, ring Reduce-Scatter, and a TACOS-synthesized topology-aware AllReduce.

Producer integration:

MSCCL-IR -> Chakra ET converter parses MSCCLang's XML and emits a Chakra ET DAG per NPU.
TACOS was modified to emit Chakra ET directly out of its TEN solution.

Consumer integration:

ASTRA-sim was extended with a new input parameter accepting a Chakra ET trace for the collective; when supplied, it bypasses its native hard-coded algorithm implementations and simulates the supplied DAG directly.

This proves the workflow end-to-end: two producers emit Chakra ET, one simulator ingests it, and the simulated bandwidth varies as expected with both algorithm and topology.

Experimental Setup

Component	Value
Simulator	ASTRA-sim 2.0 (analytical network model)
NPUs	64
Topologies	2-D Mesh (8x8) and 3-D Hypercube (4x4x4)
Link latency	500 ns
Link bandwidth	50 GB/s
Workloads	All-Gather, All-Reduce
Algorithms compared	MSCCLang-Ring, MSCCLang-Direct, TACOS (topology-aware)
Scaling test	Chakra-ET generation time at 16/32/64/128 NPUs (All-Reduce)
Metric	Achieved collective bus bandwidth (GB/s) vs. chunk size (KB - GB)

Headline Quantitative Results

Chakra ET generation cost (MSCCL-IR -> Chakra ET, All-Reduce):

NPUs	16	32	64	128
Generation time (ms)	259	398	1485	7662

The conversion scales roughly with the size of the per-NPU DAG; the 128-NPU case (~7.66 s) is well within practical offline-synthesis budgets.

TACOS synthesis cost (All-Reduce, 128 NPUs): 1080 ms.

End-to-end simulated bandwidth on 64 NPUs, 2-D Mesh:

TACOS-optimized All-Gather: ~100 GB/s at large chunk sizes (1 MB - 1 GB)
MSCCLang-Ring: ~20 GB/s
MSCCLang-Direct: ~5 GB/s

End-to-end simulated bandwidth on 64 NPUs, 3-D Hypercube:

TACOS-optimized All-Gather: ~150 GB/s
MSCCLang-Ring: ~40 GB/s
MSCCLang-Direct: ~5 GB/s

The 5x-30x bandwidth spread between algorithms on the same topology demonstrates that the Chakra-ET pipeline preserves enough information to reproduce the algorithmic-quality differences observed in prior literature — i.e., the standardized format is not lossy with respect to scheduling quality.

Limitations

The evaluation uses synthetic workloads (single-collective microbenchmarks) rather than full distributed-ML training traces. The interaction with realistic compute graphs is not measured.
Only two synthesizers (MSCCLang, TACOS) and one simulator (ASTRA-sim) are integrated. NCCL itself, MSCCL-Runtime, and other consumers (LIBRA) are not yet on board.
The Chakra-ET node set covers point-to-point sends, recvs, and compute but does not yet capture hardware-offloaded collectives like NVIDIA SHARP, multicast accelerators, or in-network reduction trees. Algorithms that exploit these are flattened to software-equivalent DAGs.
ASTRA-sim's analytical model abstracts away congestion, contention, and link-level effects; the bandwidth numbers are ideal-case upper bounds.
The paper is a representation paper, not a design-space exploration: it does not propose new algorithms, only the substrate to compare them uniformly.

Open Problems Called Out

Co-optimization with workload graphs. With collective communication now expressed at the same DAG level as compute operators, the natural next step is to build joint compute+collective schedulers that overlap matmul chunks with specific point-to-point messages — beyond today's coarse "wait for AllReduce to finish" pattern.
Ecosystem expansion. More producers (NCCL templates, TACCL, MSCCL++) and more consumers (real GPU runtimes, GPU-cluster simulators like LIBRA, MSCCL-Runtime) need Chakra-ET front/back ends.
Hardware-offloaded collectives. Extending Chakra ET to capture SHARP-style in-network reductions, multicast, and hardware-accelerated collective kernels remains future work.
Standardization governance. The authors implicitly call for a community-driven schema for the new node types; without governance, format drift will recreate the fragmentation problem inside Chakra ET.

Relevance to DynamICCL

DynamICCL is an RL-based NCCL configuration optimizer that selects per- collective algorithm (Ring / Tree / CollNet / NVLS), protocol (LL / LL128 / Simple), nChannels, numThreads, and chunkSize to minimize collective wall-clock time. It conditions on state features including message size (log-binned), model intensity I = C/D, local batch size, topology fingerprint, and an LSTM-encoded recent-collective timing window. Reward is -collective_wall_clock_us. It operates inside NCCL via the tuner-plugin API. This paper informs DynamICCL in five ways:

Chakra ET is the natural format for trajectory replay. DynamICCL's replay buffer stores (state, action, reward) tuples; each action instantiates a specific collective algorithm at the NCCL layer. Storing the executed algorithm as a Chakra-ET DAG (rather than as opaque NCCL parameters) lets the agent reason about why one configuration beat another — e.g., "ring-AllReduce on 8x8 mesh achieved 100 GB/s, but TACOS-style schedule got 150 GB/s." This is a fidelity upgrade for the reward-attribution analysis.
Topology fingerprint as a state feature. The 5x bandwidth swing between Mesh and Hypercube on the same algorithm (TACOS: 100 vs. 150 GB/s) confirms that DynamICCL's policy must observe topology as a first-class feature — the same algorithm choice has very different reward signatures across fabrics. Chakra ET's DAG-level representation would also allow DynamICCL to feed graph-encoded topology features (via GNN) into the policy, instead of a flat fingerprint.
Action-space priors from synthesizer outputs. TACOS-synthesized schedules dominate ring/direct schedules by 5x-30x. A pre-trained imitation-learning prior built from Chakra-ET traces of expert synthesizers (MSCCLang, TACOS, TE-CCL) would give DynamICCL a strong warm start; the agent then need only fine-tune NCCL-tunable knobs around the synthesized algorithm class.
Reward shaping via per-chunk overlap potential. The paper's call for fine-grained compute-comm overlap maps directly to DynamICCL's chunkSize action. By representing compute and communication at the same node level, Chakra ET exposes the overlap surface area between a collective DAG and the surrounding workload DAG. DynamICCL could shape its reward to favor configurations that maximize achievable overlap, not just minimize collective wall-clock in isolation.
Research positioning. This paper is upstream of DynamICCL: it provides a substrate (Chakra ET) on which a learned tuner can express its actions, log its decisions, and reproduce its experiments across simulators and real clusters. DynamICCL should adopt Chakra ET as its canonical action-and-trace logging format, and contribute back its tuner-plugin trajectories as a public Chakra-ET corpus — feeding the open problem of "co-optimizing workload + collective at scale" that the authors leave unresolved.