1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He | Microsoft / University of Rochester / ETH Zurich | ICML 2021 (PMLR Vol. 139)

Problem

Large-model pre-training (BERT, GPT) is dominated by inter-GPU allreduce of full-precision gradients. On a 64-GPU 40 GbE Ethernet cluster, profiling of BERT-Large shows allreduce consumes up to 94% of each iteration; even on 64-GPU 100 Gb InfiniBand it consumes 75%. Error-compensated 1-bit compression already exists for SGD-class linear optimizers, but Adam — the de-facto optimizer for transformers, since vanilla SGD does not converge on these models — resists naive compression because its variance term v_t = beta_2 * v_{t-1} + (1 - beta_2) * g_t^2 is non-linear in the gradient. The square introduces residual cross-terms (delta_{t-1} - delta_t)^2 that do not telescope, and the coordinate-wise time-varying learning rate eta / sqrt(v_t + eps) admits no clean global correction factor. The result is that direct error-compensated 1-bit Adam diverges on BERT (paper's Figure 1).

Core Insight

During BERT-Large pre-training the L1 norm of Adam's variance term v_t becomes approximately stable after roughly 23k steps (paper's Figure 2); once frozen at v_{T_w}, Adam reduces to preconditioned momentum SGD — which is linear in the gradient and therefore amenable to error-compensated 1-bit compression of the momentum term itself.

Method

The 1-bit Adam algorithm (Algorithm 1 in the paper) has two phases:

Phase 1 — warmup (0..T_w-1): Run vanilla Adam unchanged so the variance can stabilize. Store v_{T_w}.

Phase 2 — compression (T_w..T):

On worker i:
  g_t^(i) = grad F_i(x_t^(i), zeta_t^(i))
  m_t^(i) = beta_1 * m_{t-1} + (1 - beta_1) * g_t^(i)
  hat_m_t^(i) = C_omega[m_t^(i) + delta_{t-1}^(i)]    # 1-bit compress
  delta_t^(i) = m_t^(i) + delta_{t-1}^(i) - hat_m_t^(i)
  send hat_m_t^(i) to server

On server:
  bar_m_t = C_omega[(1/n) * sum_i hat_m_t^(i) + delta_{t-1}]
  delta_t = (1/n) * sum_i hat_m_t^(i) + delta_{t-1} - bar_m_t
  broadcast bar_m_t

On worker i:
  x_{t+1} = x_t - gamma * bar_m_t / sqrt(v_{T_w})

Key design choices:

Compress momentum, not raw gradient — momentum has lower variance and quantizes more cleanly.
Sign-based 1-bit quantization with per-block scale factor ||m + delta||_1 / ||sign(m + delta)||_1 to preserve L1 magnitude.
Auto-detect for T_w via stability ratio ||v_t||_1 / ||v_{t-Delta}||_1 >= 0.96.

The system implements compressed allreduce as a three-phase MPI collective (MPI_Alltoall → local average + re-quantize → MPI_Allgather) because NCCL < 2.7 exposed neither Alltoall nor send/recv, and its allreduce only supported sum/min/max on uncompressed tensors. CUDA-aware (MVAPICH2-GDR) and basic (CPU-staged) variants ship as a DeepSpeed optimizer plugin.

Convergence is proven to match distributed SGD's O(1/sqrt(nT)) linear-speedup rate under standard assumptions (Lipschitz gradient, bounded variance, bounded compression error).

Experimental Setup

Component	Value
Ethernet cluster	4 V100 GPUs/node, 40 GbE TCP (~4.1 Gbps effective)
InfiniBand cluster	8 V100 GPUs/node, 100 Gbps EDR IB
Max GPUs tested	256
Models	BERT-Base (110M), BERT-Large (340M), ResNet-18 (CIFAR-10), ResNet-152 (ImageNet), DCGAN
Fine-tuning	SQuAD 1.1, GLUE
Frameworks	PyTorch + DeepSpeed
MPI	MVAPICH2-GDR (CUDA-aware) / generic MPI (basic)
BERT pre-training peak LR	4e-4
BERT LR schedule	linear warmup over 12.5k steps, then 0.99 decay every 520 steps
BERT total batch size	4096
Adam beta_1, beta_2	0.9, 0.999
Warmup `T_w` (BERT-Large seq128)	23k of 152k steps
Warmup `T_w` (BERT-Large seq512)	1.5k of 10k steps
SQuAD batch / LR / warmup	96 / 3e-5 / 400 of 1848 steps
ResNet-18 batch / LR / warmup	1024 / 1e-4 / 13 of 200 epochs

Headline Quantitative Results

Communication-share profiling (Table 1, BERT-Large seq128):

Ethernet 64 GPUs, BS=64: allreduce = 94% of step (2205.86 ms allreduce vs. 36.65 ms forward).
InfiniBand 64 GPUs, BS=64: allreduce = 75% of step.
InfiniBand 8 GPUs (single node): allreduce = 16% — local NVLink/PCIe is bandwidth-rich.

Throughput / training-time:

BERT-Large seq128, 64-GPU Ethernet: up to 3.3× higher throughput.
BERT-Large total wall-clock: 174.3 h baseline → 51.5 h 1-bit Adam = 3.4× end-to-end.
SQuAD fine-tuning: up to 2.9× higher throughput.
Compression-stage-only speedup: 5.48× (BERT-Large) and 6.17× (SQuAD).

Communication-volume reduction:

FP32 baseline → 1-bit: 97% reduction (32× compression).
FP16 baseline → 1-bit: 94% reduction (16× compression).
End-to-end (warmup included) ≈ 5× reduction.

Cross-network parity: 1-bit Adam on 40 GbE matches vanilla FP16 Adam on ~100 Gb InfiniBand in throughput.

Convergence parity (Table 2): total step counts identical to baseline: BERT-Base 118K seq128 + 22K seq512; BERT-Large 152K seq128 + 10K seq512.

GLUE (Table 3, median over 10 seeds): within-noise on every task. E.g., BERT-Large MNLI-(m/mm): baseline uncompressed 86.1/86.2 vs. 1-bit Adam 85.7/85.4. SQuAD F1: baseline 93.33 vs. 1-bit 93.32.

Naive 1-bit Adam baseline: diverges on BERT (Figure 1) — the freezing of variance is the missing ingredient.

Limitations

Warmup phase runs uncompressed Adam, so for short fine-tuning jobs the warmup fraction can be ~20% (e.g. DCGAN).
Variance stability is empirically observed for the studied models; no sufficient theoretical condition is given, only the 0.96 auto-detect heuristic.
Custom MPI compressed-allreduce path required because NCCL < 2.7 lacked Alltoall and send/recv; cannot run on stock NCCL allreduce alone.
Evaluation is data-parallel only; ZeRO / pipeline / tensor parallelism not measured.
Scaling capped at 256 GPUs; thousand-GPU regime is open.
Auto-detect threshold validated for BERT only; not stress-tested across AdamW, LAMB, etc.

Open Problems

A theoretical sufficient condition for variance stability — would let practitioners predict 1-bit Adam viability without running warmup.
Generalizing the recipe (freeze the non-linear state once stable, compress the linear state) to LAMB, Adafactor, Lion.
Eliminating the warmup phase by warm-starting v_{T_w} from a prior run.
Compressed-allreduce scaling beyond 256 GPUs — whether MPI Alltoall + Allgather remains competitive with ring/tree at thousand-GPU scale.
Integration with model-, pipeline-, and ZeRO-style sharded optimizer states where Adam's state itself is partitioned.

Note on NCCL Tuning

The paper documents that NCCL versions prior to 2.7 supported only sum/min/max allreduce on uncompressed tensors and exposed neither Alltoall nor send/recv (Section 6.1), forcing the authors to build their compressed allreduce on MPI rather than NCCL — a concrete capability gap that any NCCL-tuner work should be aware of when targeting compressed-collective workloads. The Table 1 finding that allreduce consumes 94% of BERT-Large iteration time on 64-GPU Ethernet versus 75% on InfiniBand also bounds the ceiling that algorithm/protocol selection can reach when payloads are large and frequent on bandwidth-constrained interconnects.