Brief Summary: nnScaler

Full title: nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training Authors: Zhiqi Lin, Youshan Miao, Quanlu Zhang, Fan Yang, Yi Zhu, Cheng Li, Saeed Maleki, Xu Cao, Ning Shang, Yilei Yang, Weijiang Xu, Mao Yang, Rachee Singh, Janardhan Kulkarni, Lidong Zhou (Microsoft Research) Year: 2024 Venue: OSDI '24

Problem

Existing distributed training frameworks (Megatron-LM, Alpa, DeepSpeed) hard-code a small set of parallelization strategies. When a new model architecture (e.g., SwinTransformer, AlphaFold2) does not fit the assumptions of any pre-built strategy, the engineer must manually re-implement the framework internals or accept a suboptimal plan. The combinatorial search space for parallelization (which operators to split, how, and where to place them) is enormous and growing with model diversity.

Core Insight

Parallelization plans can be expressed as combinations of three primitive operations — op-trans (how to partition an operator), op-assign (which device handles it), and op-order (temporal execution ordering) — and constraints on these primitives derived from model semantics can prune the search space by 11.7× while still covering the optimal plan.

Method

nnScaler represents a training computation as a directed acyclic graph of operators. It introduces:

op-trans(op, algo, n): Partitions operator op using algorithm algo across n devices, producing n sub-operators on n vTensors.
op-assign(op, d): Places a sub-operator on physical device d. Uses an ILP solver to find optimal assignment respecting device capacity.
op-order(op1, op2): Constrains op1 to execute before op2. A temporal ordering engine (Tessel) enumerates valid schedules subject to these constraints and data dependencies.

Data dependency is tracked via vTensor-pTensor abstraction: vTensors are logical (full) tensors; pTensors are physical (partitioned) shards. Dependency violation is detected symbolically and resolved by inserting communication operators (AllReduce, AllGather, ReduceScatter, AllToAll, or point-to-point send/recv).

Constraint tables: Users or framework developers provide constraint tables (as in Tables 2-7 of the paper) that specify which op-trans choices are legal given a model's semantic invariants. For example, data parallelism is captured as: all operators use the same op-trans splitting on the batch dimension.

Novel parallelization plans enabled:

Coshard (SwinTransformer): Two operators with coupled non-batch splits are co-placed on the same device, avoiding inter-device communication for the intermediate activation.
Interlaced pipeline (T5): Embedding table is replicated across all pipeline stages; non-embedding layers use standard 1F1B. This eliminates the pipeline bubble caused by large embedding lookups.
3F1B (AlphaFold2): Three forward passes followed by one backward pass exploits AlphaFold2's unique computation graph structure.

Implementation: 24K lines of Python; constraint resolution uses Z3 or Gurobi ILP solver; Tessel uses heuristic search over topological sorts; deployed in Microsoft production for Phi-3, LongRoPE, RetNet, YOCO.

Key Results

Model	Hardware	nnScaler	Best Baseline	Speedup
SwinTransformer-H	DGX-2 (32 V100)	—	Megatron-LM	3.5×
T5-3B	DGX-2 (32 V100)	—	Megatron-LM	2.2×
AlphaFold2	DGX-2	—	DeepSpeed	1.8×
Plan search	—	11.7× faster	Alpa (no constraints)	—

84.1% of HuggingFace models are automatically parallelizable without any manual constraints.
Used in production for 4 Microsoft models at the time of publication.

Limitations

Does not support asynchronous pipeline parallelism (PipeDream-style weight stashing) — only synchronous plans.
Does not support TeraPipe (token-level pipeline masks) — requires dynamic masking nnScaler cannot express statically.
ILP placement solver can become a bottleneck for very large operator graphs (hundreds of operators × many devices).
Constraint tables must be written manually for genuinely novel model architectures; coverage is not 100%.
Evaluation hardware (DGX-2, 32 V100s) is high-end; performance on bandwidth-constrained clusters (e.g., Ethernet) is not evaluated.

Relevance to DynamICCL

nnScaler is relevant to DynamICCL because it makes the communication pattern of a training job an output of the parallelization plan search, not a fixed input. This has two direct consequences.

First, nnScaler inserts NCCL collectives (AllReduce, AllGather, ReduceScatter, AllToAll, Send/Recv) automatically based on the dependency analysis between vTensors and pTensors. The resulting communication graph can differ radically from standard DP/MP patterns — for example, the Coshard plan eliminates inter-device AllReduces for the coupled operators entirely. DynamICCL must be able to identify which collective types appear in any given plan and tune accordingly.

Second, nnScaler plans (especially 3F1B and interlaced pipeline) produce irregular temporal communication patterns — bursts of small point-to-point sends interleaved with large AllReduces. DynamICCL's LSTM+CUSUM Trigger Agent must generalize beyond the regular periodic AllReduce bursts of standard DP to handle these plan-specific patterns.