Detailed Summary: nnScaler

Full title: nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training Authors: Zhiqi Lin, Youshan Miao, Quanlu Zhang, Fan Yang, Yi Zhu, Cheng Li, Saeed Maleki, Xu Cao, Ning Shang, Yilei Yang, Weijiang Xu, Mao Yang, Rachee Singh, Janardhan Kulkarni, Lidong Zhou (Microsoft Research) Year: 2024 Venue: OSDI '24 Code: https://github.com/microsoft/nnscaler

Abstract (paraphrased)

Deep learning training increasingly requires model parallelism, but existing frameworks support only a fixed menu of parallelization strategies. nnScaler introduces three orthogonal primitives — operator transformation, operator assignment, and operator ordering — plus a constraint language to prune the exponential search space. Given a model and a set of semantic constraints, nnScaler automatically generates a parallelization plan that correctly handles data dependencies (by inserting communication operators) and is provably deadlock-free. The system achieves up to 3.5× speedup over Megatron-LM on SwinTransformer, 2.2× on T5, and 1.8× over DeepSpeed on AlphaFold2. It is deployed in Microsoft production for Phi-3, LongRoPE, RetNet, and YOCO.

Motivation

Three trends drive the need for nnScaler:

Model diversity is outpacing frameworks. Megatron-LM handles GPT/BERT/T5. Alpa handles model-parallel plans for Transformer variants. DeepSpeed handles memory-constrained DP. None of them handle SwinTransformer's window-attention (non-batch spatial splits), AlphaFold2's triangular updates, or custom RNN/MoE hybrids without manual modification of framework internals.
Manual parallelization is error-prone and costly. A practitioner implementing a new model in Megatron-LM must understand tensor shapes, AllReduce insertion points, and pipeline schedule construction manually. An incorrect insertion silently corrupts gradients.
Automatic methods are either too slow or too restrictive. Alpa uses an ILP over operator placement but does not support temporal ordering constraints. Its search space is exponential in the number of operators, making it impractical without aggressive pruning. Without semantic constraints, the pruning is either unsound (misses valid plans) or too aggressive (misses optimal plans).

Background

Existing Parallelism Strategies

Data Parallelism (DP): All operators are replicated. A single AllReduce at the end of each backward pass synchronizes gradients. Communication volume = 2Ψ per step.

Tensor/Model Parallelism (MP): Operators are partitioned along non-batch dimensions. Communication appears at tensor boundary crossings (AllReduce after each column-then-row GEMM pair, as in Megatron-LM). Communication volume is proportional to activation size × number of operator boundaries.

Pipeline Parallelism (PP): Layers are partitioned into sequential stages across devices. Communication = point-to-point activation sends between adjacent stages. Standard 1F1B schedule minimizes bubbles.

These strategies are not mutually exclusive, but combining them correctly requires careful insertion of collectives at dependency boundaries.

Alpa (Prior Best Automatic Approach)

Alpa (OSDI 2022) divides the problem into inter-operator and intra-operator parallelism and uses a two-level ILP+DP search. It does not support operator ordering constraints, so it cannot express PipeDream-style overlapping schedules or plans where two operators must share a device for performance (Coshard). Its search is also unconstrained — no user-provided semantic knowledge — making it 11.7× slower than nnScaler for the same quality.

System Design

Three Primitives

op-trans(op, algo, n):
  Partitions operator `op` using transformation algorithm `algo`
  across n sub-operators, each operating on 1/n of the data.
  Result: n sub-operators on n vTensors.
  Examples:
    algo = "by_batch":   split input tensor on batch dimension
    algo = "by_column":  split weight matrix on column dimension
    algo = "by_row":     split weight matrix on row dimension
    algo = "replicate":  no split; op is replicated

op-assign(op, d):
  Places sub-operator `op` on physical device d ∈ {0,...,D-1}.
  Two sub-operators may share a device (co-location).

op-order(op1, op2):
  Enforces op1 executes before op2 on their respective devices.
  Used to express pipeline schedules and avoid deadlocks.

vTensor-pTensor Abstraction

vTensor: logical tensor (full shape, e.g., [batch, seq, hidden])
pTensor: physical shard (partition of a vTensor on one device)

Data dependency rule:
  If op B consumes vTensor T, and op A produces vTensor T:
  → A must complete before B starts (data dependency edge in DAG)
  → If A and B are on different devices with different partitioning:
    → Communication operator (AllReduce, AllGather, ReduceScatter,
      AllToAll, or Send/Recv) is automatically inserted between them.

The communication operator is determined by the transformation mismatch:

A partitions T as	B requires T as	Communication inserted
split(batch, n)	full(T)	AllGather
full(T)	split(batch, n)	Scatter (ReduceScatter)
split(batch, n)	split(batch, n), same partition	None
split(batch, n)	split(batch, m), different n/m	AllToAll
partial_sum(T)	full(T)	AllReduce

Constraint Language

Users provide constraint tables that restrict the search space. A constraint is a logical assertion over op-trans and op-assign choices:

Example — Data Parallelism constraint (Table 2):
  For all operators op_i:
    op-trans(op_i, algo_i, n) such that algo_i = "by_batch"
    (all operators split on batch dimension, same n)

Example — Pipeline Parallelism constraint (Table 3):
  Partition operators into S sequential stages.
  All operators in stage s → op-assign(op, devices_s).
  For ops at stage boundary: op-order(last_in_stage_s, first_in_stage_{s+1}).

Example — 1F1B schedule constraint (Table 4):
  For microbatch m and stage s:
    op-order(forward_{m,s}, forward_{m,s+1})    [forward in order]
    op-order(backward_{m,s+1}, backward_{m,s})  [backward in reverse]
    For m < m':
      op-order(forward_{m,s}, forward_{m',s})   [in-order within stage]

Constraints are expressed in a Python DSL and compiled to Z3 SMT formulas. The solver outputs a valid assignment or UNSAT (infeasible constraint set).

Plan Search Architecture

                    Model (PyTorch graph)
                           │
                    ┌──────▼──────┐
                    │  Tracer     │
                    │ (op DAG +   │
                    │  vTensors)  │
                    └──────┬──────┘
                           │
              Constraint Tables (user-provided)
                           │
                    ┌──────▼───────────┐
                    │  op-trans search │
                    │  (prune by       │
                    │   constraints)   │
                    └──────┬───────────┘
                           │
                    ┌──────▼───────────┐
                    │  ILP Solver      │
                    │  op-assign       │
                    │  (device place-  │
                    │   ment)          │
                    └──────┬───────────┘
                           │
                    ┌──────▼───────────┐
                    │  Tessel          │
                    │  op-order        │
                    │  (temporal       │
                    │   ordering)      │
                    └──────┬───────────┘
                           │
                    ┌──────▼───────────┐
                    │ Communication    │
                    │ insertion        │
                    │ (AllReduce,      │
                    │  AllGather, etc.)│
                    └──────┬───────────┘
                           │
                    Distributed Training Plan
                    (Python + NCCL calls)

Novel Parallelization Plans

Coshard (SwinTransformer, Table 5):

SwinTransformer uses window-attention: the attention mechanism operates on spatial windows, not the full sequence. The window partition operator and the attention operator share a non-batch spatial split. Standard Megatron-LM places them on separate devices and inserts an AllReduce between them. Coshard assigns both operators to the same device, eliminating the communication entirely.

Standard:
  op_window_partition (device 0) → [AllReduce] → op_attention (device 1)

Coshard:
  op_window_partition (device 0) → op_attention (device 0)   [no comm]
  op_window_partition (device 1) → op_attention (device 1)   [no comm]

Interlaced Pipeline (T5, Table 6):

T5 has a large embedding table that does not fit cleanly into a linear pipeline without causing large pipeline bubbles. The interlaced plan:

Replicates the embedding table across all pipeline stages (small memory, frequently accessed).
Assigns the remaining (non-embedding) layers to a standard 1F1B pipeline.
Result: all stages are active simultaneously; the embedding lookup never blocks a stage.

3F1B (AlphaFold2, Table 7):

AlphaFold2's recycling mechanism processes the same structure representation three times (three forward passes) before computing gradients (one backward pass). Standard 1F1B wastes GPU cycles during the three-forward phase. 3F1B schedules three forward passes and one backward pass per microbatch, matching the model's computation graph structure and eliminating the forward-phase idle time.

Key Algorithm

Constraint-Guided Search Speedup

Without constraints: search space for N operators, K transformation algorithms, D devices = K^N × D^N × (N!)^stages assignments and orderings.

With constraints (example — data parallelism): search space collapses to 1 point (all operators use by_batch). For mixed plans, the constraint table eliminates all assignments that would violate a semantic invariant, pruning the ILP variable domain before solving.

Measured speedup vs. Alpa (unconstrained): 11.7× average across the benchmarks.

Communication Insertion (Formal Rule)

For every pair (producer op A, consumer op B) connected by vTensor T:

Determine A's output partition type P_A and B's required input partition type P_B.
If P_A == P_B and same device: no communication.
If P_A == P_B and different devices: insert point-to-point Send/Recv.
If P_A != P_B: insert collective (type from table above).

The inserted collectives are standard NCCL calls, so the output plan is directly executable without framework modification.

Deadlock Prevention

op-order constraints form a DAG over operators. Deadlock occurs if a cycle exists in the dependency-plus-ordering graph. Tessel performs a topological sort and rejects any op-order addition that would create a cycle. The final schedule is a valid linear order of all operator executions across all devices.

Evaluation Methodology

Hardware:

DGX-2H node: 16 V100 32 GB GPUs, NVLink (600 GB/s intra-node), InfiniBand 100 Gbps inter-node.
2 × DGX-2H = 32 V100 GPUs for multi-node experiments.
DGX-1: 8 V100 16 GB GPUs (for AlphaFold2 experiments).

Baselines:

Megatron-LM v2.0 (Tensor Parallelism for Transformer variants)
Alpa v0.2 (automatic search, unconstrained)
DeepSpeed v0.7 (ZeRO + pipeline parallelism for AlphaFold2)

Models:

SwinTransformer-H (window attention, 66M parameters, vision)
T5-3B (encoder-decoder, 3B parameters, NLP)
AlphaFold2 (triangular self-attention, protein structure prediction)

Metric: Training throughput (samples/second or tokens/second). Memory footprint. Search time.

NCCL version: 2.14

Results

Throughput

Model	Config	nnScaler	Baseline	Speedup
SwinTransformer-H	32 V100, DGX-2	Coshard plan	Megatron-LM	3.5×
T5-3B	32 V100, DGX-2	Interlaced pipeline	Megatron-LM	2.2×
AlphaFold2	DGX-1	3F1B plan	DeepSpeed	1.8×

Plan Search Time

Method	Search Time	Plan Quality
Alpa (unconstrained)	47 minutes	same plan quality
nnScaler (constrained)	4 minutes	same plan quality
Speedup	11.7×	—

HuggingFace Coverage

84.1% of tested HuggingFace models are automatically parallelizable with default constraint tables (no user annotation required).
Remaining 15.9% require user-provided constraint tables for the non-standard operators.

Production Deployment

Model	Application
Phi-3	Small language model (Microsoft)
LongRoPE	Long-context extension
RetNet	Retention-based LM
YOCO	You Only Cache Once

Limitations

No async pipeline parallelism: nnScaler does not support weight stashing (PipeDream) or other asynchronous schedules. All plans are synchronous. This is a design choice for correctness guarantees but limits throughput for bandwidth-constrained settings where communication-compute overlap is critical.
No TeraPipe support: TeraPipe's token-level pipeline masks require dynamic tensor shapes that nnScaler cannot express statically in its op-trans framework.
ILP scaling: For models with hundreds of distinct operators (e.g., a 96-layer Transformer with heterogeneous blocks), the ILP placement problem grows super-linearly. The paper does not report ILP solve time for models >3B parameters.
Constraint authoring burden: While 84.1% of HuggingFace models work out of the box, novel architectures require manually authored constraint tables. Correctness of these tables is user responsibility; an incorrect constraint silently produces a semantically wrong plan.
Bandwidth-constrained evaluation gap: All experiments use DGX-2 hardware with NVLink (600 GB/s) and InfiniBand (100 Gbps). Performance on Ethernet clusters (e.g., Chameleon Cloud 1 GbE) is not evaluated. The Coshard plan's benefit (eliminating AllReduce) would be even larger on slow networks, but the interlaced pipeline's point-to-point embedding sends might become a bottleneck.
Static plan: Plans are generated once before training and do not adapt to hardware state changes (e.g., network congestion, GPU slowdowns). This is the same limitation as PipeDream's static partitioner.

Megatron-LM (Shoeybi et al., 2019): Intra-layer tensor parallelism for GPT/BERT. nnScaler subsumes and generalizes Megatron's MLP/attention split as a special case of op-trans with by_column/by_row constraints.
Alpa (Zheng et al., 2022): Two-level ILP+DP search for intra- and inter-operator parallelism. nnScaler adds op-order (temporal constraints) and the constraint language for pruning. 11.7× faster search.
PipeDream (Narayanan et al., 2019): 1F1B async pipeline. nnScaler's 1F1B is synchronous only; PipeDream's weight stashing is out of scope.
ZeRO (Rajbhandari et al., 2020): Parameter partitioning across DP ranks. ZeRO is complementary — nnScaler handles plan generation; ZeRO handles memory optimization within a plan.
FlexFlow (Jia et al., 2019): Operator-parallel search via MCMC simulation. nnScaler differs by using symbolic constraint satisfaction rather than simulation.
GShard / Switch Transformers: MoE routing introduces AllToAll collectives. nnScaler handles AllToAll insertion via the pTensor mismatch detection framework.

RL Formulation Table

This paper contains no reinforcement learning. Not applicable.

The plan search uses ILP (integer linear programming) for placement and heuristic topological sort (Tessel) for temporal ordering. These are deterministic optimization methods, not RL.

Relevance to DynamICCL

nnScaler makes the collective communication pattern of a training job a function of the parallelization plan, not just the model architecture. This has three direct implications for DynamICCL.

1. Plan-dependent collective types.

Data Parallel plan:     AllReduce (end of backward)
Megatron-LM plan:       AllReduce (after each GEMM pair, intra-node)
ZeRO P_os+g plan:       ReduceScatter + AllGather (bucketed, overlapped)
Coshard plan:           No AllReduce for co-located ops; AllReduce elsewhere
Interlaced pipeline:    Send/Recv (inter-stage) + AllGather (embedding)
3F1B plan:              Send/Recv (inter-stage, 3 forward + 1 backward)
MoE plan:               AllToAll (dispatch + combine) + AllReduce (DP gradients)

DynamICCL's Config Agent must identify which collective types are active in the current training step and select different NCCL parameters for each. nnScaler's automatic insertion framework means the collective types are deterministic given the plan — they can be enumerated statically at plan-generation time and passed to DynamICCL as prior knowledge.

2. Communication pattern irregularity.

The 3F1B and interlaced pipeline plans produce irregular temporal patterns: three forward-pass Send/Recv bursts followed by one backward-pass Send/Recv burst, with no AllReduce until the end. DynamICCL's LSTM+CUSUM Trigger Agent, trained primarily on periodic DP AllReduce patterns, must be tested and potentially retrained on the irregular burst patterns generated by these plans.

3. Static plan + dynamic NCCL configuration = complementary optimization levels.

nnScaler optimizes at the plan level (which collectives appear) and is static. DynamICCL optimizes at the NCCL configuration level (how each collective is executed) and is dynamic. They operate at orthogonal abstraction levels. For maximum throughput, nnScaler could be used to select the plan, and DynamICCL could then dynamically tune the NCCL parameters for the collectives that plan generates. This is a natural integration point for future work.

4. Practical relevance for Chameleon Cloud.

nnScaler's Coshard plan eliminates inter-device AllReduces for co-located operators entirely. On Chameleon Cloud 1 GbE, where AllReduce is the dominant bottleneck, the Coshard plan (combined with DynamICCL's tuning of the remaining AllReduces) would have a compounding benefit. The paper's DGX-2 evaluation likely underestimates Coshard's speedup on bandwidth-constrained hardware.