Brief Summary: nnScaler

Full title: nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training Authors: Zhiqi Lin, Youshan Miao, Quanlu Zhang, Fan Yang, Yi Zhu, Cheng Li, Saeed Maleki, Xu Cao, Ning Shang, Yilei Yang, Weijiang Xu, Mao Yang, Rachee Singh, Janardhan Kulkarni, Lidong Zhou (Microsoft Research) Year: 2024 Venue: OSDI '24


Problem

Existing distributed training frameworks (Megatron-LM, Alpa, DeepSpeed) hard-code a small set of parallelization strategies. When a new model architecture (e.g., SwinTransformer, AlphaFold2) does not fit the assumptions of any pre-built strategy, the engineer must manually re-implement the framework internals or accept a suboptimal plan. The combinatorial search space for parallelization (which operators to split, how, and where to place them) is enormous and growing with model diversity.

Core Insight

Parallelization plans can be expressed as combinations of three primitive operations — op-trans (how to partition an operator), op-assign (which device handles it), and op-order (temporal execution ordering) — and constraints on these primitives derived from model semantics can prune the search space by 11.7× while still covering the optimal plan.

Method

nnScaler represents a training computation as a directed acyclic graph of operators. It introduces:

Data dependency is tracked via vTensor-pTensor abstraction: vTensors are logical (full) tensors; pTensors are physical (partitioned) shards. Dependency violation is detected symbolically and resolved by inserting communication operators (AllReduce, AllGather, ReduceScatter, AllToAll, or point-to-point send/recv).

Constraint tables: Users or framework developers provide constraint tables (as in Tables 2-7 of the paper) that specify which op-trans choices are legal given a model's semantic invariants. For example, data parallelism is captured as: all operators use the same op-trans splitting on the batch dimension.

Novel parallelization plans enabled:

Implementation: 24K lines of Python; constraint resolution uses Z3 or Gurobi ILP solver; Tessel uses heuristic search over topological sorts; deployed in Microsoft production for Phi-3, LongRoPE, RetNet, YOCO.

Key Results

Model Hardware nnScaler Best Baseline Speedup
SwinTransformer-H DGX-2 (32 V100) Megatron-LM 3.5×
T5-3B DGX-2 (32 V100) Megatron-LM 2.2×
AlphaFold2 DGX-2 DeepSpeed 1.8×
Plan search 11.7× faster Alpa (no constraints)

Limitations

Relevance to DynamICCL

nnScaler is relevant to DynamICCL because it makes the communication pattern of a training job an output of the parallelization plan search, not a fixed input. This has two direct consequences.

First, nnScaler inserts NCCL collectives (AllReduce, AllGather, ReduceScatter, AllToAll, Send/Recv) automatically based on the dependency analysis between vTensors and pTensors. The resulting communication graph can differ radically from standard DP/MP patterns — for example, the Coshard plan eliminates inter-device AllReduces for the coupled operators entirely. DynamICCL must be able to identify which collective types appear in any given plan and tune accordingly.

Second, nnScaler plans (especially 3F1B and interlaced pipeline) produce irregular temporal communication patterns — bursts of small point-to-point sends interleaved with large AllReduces. DynamICCL's LSTM+CUSUM Trigger Agent must generalize beyond the regular periodic AllReduce bursts of standard DP to handle these plan-specific patterns.