The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits — Brief Summary

Authors: Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei (Microsoft Research; University of Chinese Academy of Sciences) Venue: arXiv preprint, February 2024 arXiv: 2402.17764


Problem

Full-precision (FP16/BF16) LLMs have two dominant cost drivers: (1) floating-point multiply-accumulate operations in matrix multiplication (the dominant compute cost), and (2) memory bandwidth consumption when loading model weights from DRAM to on-chip SRAM during inference. Post-training quantization reduces these costs but is sub-optimal compared to quantization-aware training. The original 1-bit BitNet (Wang et al., 2023) reduced weights to {-1, +1} but could not match full-precision perplexity at smaller model sizes.

Core Insight

Replace every weight in the Transformer with a ternary value from {-1, 0, +1} — 1.58 bits in information-theoretic terms. Ternary weights eliminate floating-point multiplications entirely: matrix-vector products become pure addition and subtraction operations. The inclusion of 0 enables explicit feature filtering (a weight can ignore an input dimension), recovering the modeling expressiveness lost in strict binary {-1, +1} quantization. At 3B parameters with 100B training tokens, BitNet b1.58 matches FP16 LLaMA in perplexity and zero-shot accuracy while being 2.71x faster and using 3.55x less GPU memory.

Method

BitNet b1.58 is a Transformer variant that replaces all nn.Linear layers with BitLinear. The quantization scheme:

Weight quantization (absmean):

gamma = (1 / n*m) * sum_{i,j} |W_{ij}|     # average absolute value
W_tilde = RoundClip(W / (gamma + eps), -1, 1)
RoundClip(x, a, b) = max(a, min(b, round(x)))

Each weight is scaled by the mean absolute value of the weight matrix, then rounded to {-1, 0, +1}.

Activation quantization: Activations are scaled per token to [-Q_b, Q_b] using 8-bit quantization (absmax scaling), with no zero-point offset.

The model adopts LLaMA-compatible components: RMSNorm, SwiGLU activations, rotary embeddings, no bias terms. This makes BitNet b1.58 a drop-in replacement in the LLaMA ecosystem (Huggingface, vLLM, llama.cpp).

Training uses straight-through estimators (inherited from original BitNet) to pass gradients through the non-differentiable rounding operation.

Key Results

Evaluated on RedPajama (100B tokens), compared to reproduced FP16 LLaMA:

Model Size Memory Latency PPL
LLaMA 700M 2.08 GB (1x) 1.18 ms (1x) 12.33
BitNet b1.58 700M 0.80 GB (2.6x) 0.96 ms (1.23x) 12.87
LLaMA 3B 7.89 GB (1x) 5.07 ms (1x) 10.04
BitNet b1.58 3B 2.22 GB (3.55x) 1.87 ms (2.71x) 9.91
BitNet b1.58 3.9B 2.38 GB (3.32x) 2.11 ms (2.40x) 9.62

At 3B, BitNet b1.58 matches or beats LLaMA perplexity with 3.55x less memory and 2.71x lower latency. At 70B, BitNet b1.58 is 4.1x faster and 7.16x more memory-efficient.

Throughput (70B, two 80GB A100s): BitNet b1.58 supports 11x larger batch size and 8.9x higher throughput than LLaMA 70B.

Energy: BitNet b1.58 saves 71.4x arithmetic operations energy for matrix multiplication on 7nm chips (INT8 addition vs. FP16 multiply-add).

Zero-shot tasks (3B): BitNet b1.58 3.9B achieves 51.2 average vs. LLaMA 3B 49.7 across ARC-Easy, ARC-Challenge, HellaSwag, BoolQ, OpenBookQA, PIQA, WinoGrande.

2T token training: BitNet b1.58 3B outperforms StableLM-3B on all benchmarks (74.34 vs. 73.22 average).

Limitations

Relevance to DynamICCL

BitNet b1.58 is indirectly relevant to DynamICCL through its effect on collective communication patterns during distributed training and inference.

Smaller tensors, different AllReduce dynamics. During distributed training of a BitNet b1.58 model, gradient AllReduce operates on ternary-quantized gradients (or quantized gradient updates). The effective message size per AllReduce is dramatically smaller than FP16, shifting the collective from being bandwidth-bound toward latency-bound. DynamICCL's Agent-2 would need to select different configurations (e.g., prefer tree algorithm for small messages, ll protocol) than for full-precision training.

Communication-compute ratio shift. Because BitNet b1.58 matrix multiplications are pure additions, the compute-to-communication ratio changes relative to FP16 models. Communication becomes proportionally more significant. This increases the marginal value of NCCL tuning — precisely DynamICCL's target.

MoE scaling direction. The paper explicitly discusses 1-bit MoE as future work, noting that ternary weights reduce inter-chip activation transfer overhead. MoE models are heavy users of AllToAll collectives; DynamICCL's algorithm selection capability is directly applicable there.