1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He | Microsoft / University of Rochester / ETH Zurich | ICML 2021 (PMLR Vol. 139)


Problem

Large-model pre-training (BERT, GPT) is dominated by inter-GPU allreduce of full-precision gradients. On a 64-GPU 40 GbE Ethernet cluster, profiling of BERT-Large shows allreduce consumes up to 94% of each iteration; even on 64-GPU 100 Gb InfiniBand it consumes 75%. Error-compensated 1-bit compression already exists for SGD-class linear optimizers, but Adam — the de-facto optimizer for transformers, since vanilla SGD does not converge on these models — resists naive compression because its variance term v_t = beta_2 * v_{t-1} + (1 - beta_2) * g_t^2 is non-linear in the gradient. The square introduces residual cross-terms (delta_{t-1} - delta_t)^2 that do not telescope, and the coordinate-wise time-varying learning rate eta / sqrt(v_t + eps) admits no clean global correction factor. The result is that direct error-compensated 1-bit Adam diverges on BERT (paper's Figure 1).


Core Insight

During BERT-Large pre-training the L1 norm of Adam's variance term v_t becomes approximately stable after roughly 23k steps (paper's Figure 2); once frozen at v_{T_w}, Adam reduces to preconditioned momentum SGD — which is linear in the gradient and therefore amenable to error-compensated 1-bit compression of the momentum term itself.


Method

The 1-bit Adam algorithm (Algorithm 1 in the paper) has two phases:

Phase 1 — warmup (0..T_w-1): Run vanilla Adam unchanged so the variance can stabilize. Store v_{T_w}.

Phase 2 — compression (T_w..T):

On worker i:
  g_t^(i) = grad F_i(x_t^(i), zeta_t^(i))
  m_t^(i) = beta_1 * m_{t-1} + (1 - beta_1) * g_t^(i)
  hat_m_t^(i) = C_omega[m_t^(i) + delta_{t-1}^(i)]    # 1-bit compress
  delta_t^(i) = m_t^(i) + delta_{t-1}^(i) - hat_m_t^(i)
  send hat_m_t^(i) to server

On server:
  bar_m_t = C_omega[(1/n) * sum_i hat_m_t^(i) + delta_{t-1}]
  delta_t = (1/n) * sum_i hat_m_t^(i) + delta_{t-1} - bar_m_t
  broadcast bar_m_t

On worker i:
  x_{t+1} = x_t - gamma * bar_m_t / sqrt(v_{T_w})

Key design choices:

The system implements compressed allreduce as a three-phase MPI collective (MPI_Alltoall → local average + re-quantize → MPI_Allgather) because NCCL < 2.7 exposed neither Alltoall nor send/recv, and its allreduce only supported sum/min/max on uncompressed tensors. CUDA-aware (MVAPICH2-GDR) and basic (CPU-staged) variants ship as a DeepSpeed optimizer plugin.

Convergence is proven to match distributed SGD's O(1/sqrt(nT)) linear-speedup rate under standard assumptions (Lipschitz gradient, bounded variance, bounded compression error).


Experimental Setup

Component Value
Ethernet cluster 4 V100 GPUs/node, 40 GbE TCP (~4.1 Gbps effective)
InfiniBand cluster 8 V100 GPUs/node, 100 Gbps EDR IB
Max GPUs tested 256
Models BERT-Base (110M), BERT-Large (340M), ResNet-18 (CIFAR-10), ResNet-152 (ImageNet), DCGAN
Fine-tuning SQuAD 1.1, GLUE
Frameworks PyTorch + DeepSpeed
MPI MVAPICH2-GDR (CUDA-aware) / generic MPI (basic)
BERT pre-training peak LR 4e-4
BERT LR schedule linear warmup over 12.5k steps, then 0.99 decay every 520 steps
BERT total batch size 4096
Adam beta_1, beta_2 0.9, 0.999
Warmup T_w (BERT-Large seq128) 23k of 152k steps
Warmup T_w (BERT-Large seq512) 1.5k of 10k steps
SQuAD batch / LR / warmup 96 / 3e-5 / 400 of 1848 steps
ResNet-18 batch / LR / warmup 1024 / 1e-4 / 13 of 200 epochs

Headline Quantitative Results

Communication-share profiling (Table 1, BERT-Large seq128):

Throughput / training-time:

Communication-volume reduction:

Cross-network parity: 1-bit Adam on 40 GbE matches vanilla FP16 Adam on ~100 Gb InfiniBand in throughput.

Convergence parity (Table 2): total step counts identical to baseline: BERT-Base 118K seq128 + 22K seq512; BERT-Large 152K seq128 + 10K seq512.

GLUE (Table 3, median over 10 seeds): within-noise on every task. E.g., BERT-Large MNLI-(m/mm): baseline uncompressed 86.1/86.2 vs. 1-bit Adam 85.7/85.4. SQuAD F1: baseline 93.33 vs. 1-bit 93.32.

Naive 1-bit Adam baseline: diverges on BERT (Figure 1) — the freezing of variance is the missing ingredient.


Limitations


Open Problems

  1. A theoretical sufficient condition for variance stability — would let practitioners predict 1-bit Adam viability without running warmup.
  2. Generalizing the recipe (freeze the non-linear state once stable, compress the linear state) to LAMB, Adafactor, Lion.
  3. Eliminating the warmup phase by warm-starting v_{T_w} from a prior run.
  4. Compressed-allreduce scaling beyond 256 GPUs — whether MPI Alltoall + Allgather remains competitive with ring/tree at thousand-GPU scale.
  5. Integration with model-, pipeline-, and ZeRO-style sharded optimizer states where Adam's state itself is partitioned.

Note on NCCL Tuning

The paper documents that NCCL versions prior to 2.7 supported only sum/min/max allreduce on uncompressed tensors and exposed neither Alltoall nor send/recv (Section 6.1), forcing the authors to build their compressed allreduce on MPI rather than NCCL — a concrete capability gap that any NCCL-tuner work should be aware of when targeting compressed-collective workloads. The Table 1 finding that allreduce consumes 94% of BERT-Large iteration time on 64-GPU Ethernet versus 75% on InfiniBand also bounds the ceiling that algorithm/protocol selection can reach when payloads are large and frequent on bandwidth-constrained interconnects.