Brief Summary: Switch Transformers

Full title: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity Authors: William Fedus, Barret Zoph, Noam Shazeer (Google) Year: 2022 (JMLR 23, submitted 8/21, published 4/22; arXiv version 2021) Venue: Journal of Machine Learning Research 23 (2022)


Problem

Dense transformer models use the same parameters for every input token, which is computationally expensive at scale. Mixture-of-Experts (MoE) models can increase parameter count without proportionally increasing FLOPs per token, but prior MoE implementations suffered from training instability, high communication cost (all-to-all across devices), and complexity (top-k routing with k≥2). Scaling to trillions of parameters remained impractical.

Core Insight

Routing each token to exactly one expert (k=1, "Switch Routing") rather than top-k experts simplifies routing computation, reduces communication cost (expert capacity can be halved), and empirically matches or exceeds the quality of top-2 routing. The parameter count (and thus model quality via scaling laws) can be increased by adding more experts, while keeping the FLOPs per token constant — a fourth scaling axis orthogonal to model depth, width, and training compute.

Method

Key Results

Limitations

Relevance to DynamICCL

Switch Transformers introduce a new collective communication type — all-to-all — into the training loop, which DynamICCL must handle alongside AllReduce/ReduceScatter/AllGather.