The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits — Detailed Summary
Authors: Shuming Ma*, Hongyu Wang*, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei (Microsoft Research; University of Chinese Academy of Sciences) Venue: arXiv preprint, 27 February 2024 arXiv: 2402.17764v1 (* equal contribution)
Abstract
BitNet b1.58 is a 1-bit LLM variant in which every weight is ternary: drawn from {-1, 0, +1}, representing 1.58 bits of information per parameter. It extends the original binary BitNet (Wang et al., 2023) by introducing 0 as a third weight value, enabling explicit feature filtering. BitNet b1.58 matches full-precision (FP16/BF16) LLaMA in both perplexity and zero-shot task accuracy starting at 3B parameters, while being 2.71x faster at inference and consuming 3.55x less GPU memory. At 70B scale, it achieves 4.1x lower latency, 7.16x lower memory, and 8.9x higher throughput. Ternary weights transform matrix multiplication from floating-point multiply-accumulate to pure integer addition, defining a new computation paradigm that calls for dedicated hardware optimized for 1-bit arithmetic.
1. Motivation and Problem Statement
1.1 The Cost of Full-Precision LLMs
LLMs at modern scale (7B–70B parameters) face two deployment bottlenecks:
Compute: The dominant operation in a Transformer is matrix multiplication (nn.Linear). In FP16, each multiply-accumulate requires a floating-point multiplication and addition. The energy cost of FP16 multiplication on 7nm CMOS is much higher than addition (roughly 6x per the Horowitz 2014 energy model). As LLM model size grows, the fraction of total energy attributable to nn.Linear increases.
Memory bandwidth: During autoregressive inference, model weights must be loaded from DRAM to on-chip SRAM for every token generated. For a 70B FP16 model, this requires ~140 GB of weight data to be transferred per forward pass, severely bottlenecking throughput on even the largest GPU clusters.
1.2 Post-Training Quantization vs. Quantization-Aware Training
Post-training quantization (PTQ) methods (GPTQ, SmoothQuant, QuIP#) compress existing FP16 models to 4-bit or 2-bit after training. While widely used in production, PTQ is sub-optimal: quantization error compounds through layers, and the model was never trained to be robust to low-precision weights. Quantization-aware training (QAT) trains the model with quantization applied from the start, allowing the model to adapt. The original 1-bit BitNet (WMD+23) applied QAT to Transformers with binary {-1, +1} weights and showed promise, but could not fully match FP16 perplexity at small model sizes due to limited representational capacity.
1.3 Why Ternary Weights (1.58 bits)
Binary weights {-1, +1} force every neuron to either transmit or negate an input — there is no way to suppress (zero out) an input. Ternary weights {-1, 0, +1} allow explicit feature filtering via 0-valued weights, which improves the model's ability to learn sparse representations. The 0 weight value costs 0.58 additional bits per parameter (log2(3) ≈ 1.58 bits vs. 1 bit for binary), a minimal overhead that yields significant representational benefit.
2. Background
2.1 The Original BitNet
BitNet (Wang et al., 2023) replaces all nn.Linear layers in a Transformer with a 1-bit variant. Weights are binarized to {-1, +1} using a sign function with straight-through gradient estimation for backpropagation. Activations are quantized to 8 bits using absmax per-token scaling. The result is that matrix-vector products reduce to: for each output unit, sum the input elements at positions where the weight is +1, subtract the input elements where the weight is -1. No multiplication needed.
2.2 Straight-Through Estimator
The rounding operation in quantization (round(x)) has zero gradient almost everywhere. The straight-through estimator (STE) bypasses this by passing the gradient of the loss with respect to the quantized value directly through the rounding as if it were the identity function. This allows end-to-end gradient-based training of quantized networks.
2.3 Energy Cost of Arithmetic
Per the Horowitz (2014) energy model at 45nm CMOS (and similarly at 7nm):
- FP16 multiplication: ~3.7 pJ
- FP16 addition: ~0.9 pJ
- INT8 addition: ~0.03 pJ
FP16 multiply-add (one MAC): ~4.6 pJ INT8 add (BitNet b1.58 equivalent): ~0.03 pJ
Ratio: ~71.4x energy savings for arithmetic operations, as cited in the paper. The energy savings translate directly into throughput for power-bound chips.
3. BitNet b1.58 System Design
3.1 Architecture Overview
BitNet b1.58 is a standard Transformer with nn.Linear replaced by BitLinear. It uses LLaMA-compatible components to maximize ecosystem compatibility:
- Normalization: RMSNorm (no mean subtraction, only RMS scaling)
- Activation: SwiGLU in feed-forward layers
- Position encoding: Rotary embeddings (RoPE)
- Biases: Removed from all layers
- Embedding + output projection: Remain in full precision (not quantized)
Standard Transformer layer:
x -> LayerNorm -> QKV Linear -> Attention -> Out Linear -> Add -> ...
x -> -> FFN Linear_1 -> SwiGLU -> FFN Linear_2 -> Add -> ...
BitNet b1.58:
Replace all Linear with BitLinear (ternary weights, INT8 activations)
LayerNorm -> RMSNorm
Retain full-precision embeddings and LM head
3.2 Weight Quantization: absmean
The absmean quantizer scales weights by the mean absolute value of the matrix and rounds to the nearest integer in {-1, 0, +1}:
gamma = (1 / n*m) * sum_{i,j} |W_{ij}| (mean absolute value)
W_tilde = RoundClip(W / (gamma + eps), -1, 1)
RoundClip(x, a, b) = max(a, min(b, round(x)))
where n and m are the dimensions of W, and eps is a small constant for numerical stability. The resulting W_tilde has values in {-1, 0, +1}. The scale factor gamma is stored in full precision and used to dequantize activations after the ternary multiply.
The choice of absmean (vs. absmax used in the original BitNet for weights) is motivated by its robustness to outliers: the mean absolute value is less sensitive to extreme weight values than the maximum.
3.3 Activation Quantization
Activations are quantized to INT8 using per-token absmax scaling:
For each token's activation vector x:
alpha = max(|x_i|) (per-token absmax)
x_q = RoundClip(x / alpha * Q_b, -Q_b, Q_b) where Q_b = 127 for INT8
Scaling is per-token (not per-tensor), which provides better precision for activations that vary in magnitude across the sequence. Zero-point quantization is omitted: all activations are scaled symmetrically to [-Q_b, Q_b], simplifying both implementation and hardware.
3.4 Forward Pass Computation
During inference, the BitLinear computation proceeds as:
1. Quantize weights: W_tilde = quantize_weights(W) [ternary]
2. Quantize activations: X_q = quantize_activations(X) [INT8]
3. Ternary matmul: Y = X_q @ W_tilde^T [INT8 add/sub only]
4. Dequantize: Y_fp = Y * (alpha * gamma) / Q_b [FP16 scaling]
The ternary matmul step involves only addition and subtraction (no multiplication) because each weight is in {-1, 0, +1}. This is the key compute advantage.
3.5 Training Procedure
Training uses standard Adam optimizer with straight-through estimators for gradient flow through the quantization operations. The model is trained from scratch on the full training corpus — there is no fine-tuning or distillation from a full-precision model. The weight quantization is applied at every forward pass during training, so the model learns to operate with ternary weights from initialization.
The paper pre-trains on RedPajama (100B tokens) for the primary comparison experiments and on a 2T token corpus (following StableLM-3B data recipe) for the scaled comparison.
4. Key Results
4.1 Perplexity and Inference Cost vs. LLaMA (Table 1)
Pre-trained on RedPajama (100B tokens), WikiText2 perplexity, inference measured with FasterTransformer + Ladder (2-bit kernel):
| Model | Size | Memory (GB) | Latency (ms) | PPL |
|---|---|---|---|---|
| LLaMA LLM | 700M | 2.08 (1.00x) | 1.18 (1.00x) | 12.33 |
| BitNet b1.58 | 700M | 0.80 (2.60x) | 0.96 (1.23x) | 12.87 |
| LLaMA LLM | 1.3B | 3.34 (1.00x) | 1.62 (1.00x) | 11.25 |
| BitNet b1.58 | 1.3B | 1.14 (2.93x) | 0.97 (1.67x) | 11.29 |
| LLaMA LLM | 3B | 7.89 (1.00x) | 5.07 (1.00x) | 10.04 |
| BitNet b1.58 | 3B | 2.22 (3.55x) | 1.87 (2.71x) | 9.91 |
| BitNet b1.58 | 3.9B | 2.38 (3.32x) | 2.11 (2.40x) | 9.62 |
At 700M and 1.3B, BitNet b1.58 has slightly worse perplexity than LLaMA. At 3B, it matches and exceeds LLaMA with 3.55x memory reduction and 2.71x latency improvement. The 3.9B BitNet b1.58 model (slightly larger, but still cheaper than LLaMA 3B) achieves 9.62 PPL — substantially better, confirming a Pareto improvement.
4.2 Zero-Shot Accuracy vs. LLaMA (Table 2)
Zero-shot accuracy on ARC-Easy (ARCe), ARC-Challenge (ARCc), HellaSwag (HS), BoolQ (BQ), OpenBookQA (OQ), PIQA (PQ), WinoGrande (WGe):
| Model | Size | ARCe | ARCc | HS | BQ | OQ | PQ | WGe | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| LLaMA LLM | 700M | 54.7 | 23.0 | 37.0 | 60.0 | 20.2 | 68.9 | 54.8 | 45.5 |
| BitNet b1.58 | 700M | 51.8 | 21.4 | 35.1 | 58.2 | 20.0 | 68.1 | 55.2 | 44.3 |
| LLaMA LLM | 1.3B | 56.9 | 23.5 | 38.5 | 59.1 | 21.6 | 70.0 | 53.9 | 46.2 |
| BitNet b1.58 | 1.3B | 54.9 | 24.2 | 37.7 | 56.7 | 19.6 | 68.8 | 55.8 | 45.4 |
| LLaMA LLM | 3B | 62.1 | 25.6 | 43.3 | 61.8 | 24.6 | 72.1 | 58.2 | 49.7 |
| BitNet b1.58 | 3B | 61.4 | 28.3 | 42.9 | 61.5 | 26.6 | 71.5 | 59.3 | 50.2 |
| BitNet b1.58 | 3.9B | 64.2 | 28.7 | 44.2 | 63.5 | 24.2 | 73.2 | 60.5 | 51.2 |
At 3B and above, BitNet b1.58 matches or exceeds LLaMA across all seven tasks.
4.3 Latency and Memory Scaling (Figure 2)
Evaluating from 1.3B to 70B parameters:
| Scale | Latency speedup | Memory reduction |
|---|---|---|
| 1.3B | 1.67x | 2.93x |
| 3B | 2.71x | 3.55x |
| 7B | 2.90x | 4.40x |
| 13B | 3.68x | 5.12x |
| 70B | 4.10x | 7.16x |
Benefits grow with scale because: (i) for larger models, nn.Linear contributes a larger fraction of total parameters (embeddings and other full-precision components shrink in relative size), (ii) the DRAM bandwidth savings scale with model size, and (iii) the compute savings scale with the number of multiply-accumulate operations.
4.4 Throughput at 70B Scale (Table 3)
Two 80GB A100 GPUs, pipeline parallelism (GPipe), sequence length 512:
| Model | Max Batch Size | Throughput (tokens/s) |
|---|---|---|
| LLaMA LLM 70B | 16 (1.0x) | 333 (1.0x) |
| BitNet b1.58 70B | 176 (11.0x) | 2977 (8.9x) |
BitNet b1.58 supports 11x larger batch size (because each token requires far less GPU memory for weights and KV cache), translating to 8.9x higher throughput.
4.5 Energy Consumption (Figure 3)
On 7nm process nodes, for matrix multiplication:
- BitNet b1.58: primarily INT8 addition (~0.03 pJ/op)
- LLaMA: FP16 addition + FP16 multiplication (~0.93 pJ/op combined)
- Energy savings: 71.4x for arithmetic operations
End-to-end energy cost (including embedding, normalization, attention) grows with model size but BitNet b1.58 remains 18.6x–41.2x more energy-efficient than LLaMA at 1.3B–70B scale.
4.6 Long-Token Training (Table 4)
BitNet b1.58 3B trained on 2T tokens (StableLM-3B data recipe) vs. StableLM-3B (2T tokens):
| Model | Tokens | Winogrande | PIQA | SciQ | LAMBADA | ARC-easy | Avg. |
|---|---|---|---|---|---|---|---|
| StableLM-3B | 2T | 64.56 | 76.93 | 90.75 | 66.09 | 67.78 | 73.22 |
| BitNet b1.58 3B | 2T | 66.37 | 78.40 | 91.20 | 67.63 | 68.12 | 74.34 |
BitNet b1.58 outperforms StableLM-3B on all five benchmarks with the same training compute.
4.7 New Scaling Law
The paper argues BitNet b1.58 defines a new efficiency-performance scaling law:
- 13B BitNet b1.58 is more efficient (latency, memory, energy) than 3B FP16 LLM
- 30B BitNet b1.58 is more efficient than 7B FP16 LLM
- 70B BitNet b1.58 is more efficient than 13B FP16 LLM
This means that for a given performance target, practitioners can use a larger BitNet b1.58 model at lower cost than a smaller FP16 model.
5. Discussion and Future Directions
5.1 1-bit MoE LLMs
Mixture-of-Experts (MoE) models reduce compute FLOPs per token but incur high memory and inter-chip communication overhead (AllToAll for routing). Ternary weights reduce MoE's memory footprint, potentially eliminating the need to distribute expert parameters across devices and reducing activation communication between devices.
5.2 Long Sequence Support
BitNet b1.58 reduces activation precision from 16 bits to 8 bits, halving KV cache memory. This directly doubles the maximum context length for a given memory budget — important for long-sequence inference workloads.
5.3 Edge and Mobile Deployment
Ternary weights are CPU-friendly (no SIMD floating-point units required). BitNet b1.58 could run large models on mobile CPUs, enabling on-device LLM inference.
5.4 New Hardware
The paper argues that 1-bit arithmetic (addition/subtraction only) enables a new class of AI accelerators — hardware without floating-point multiply units, with larger SRAM capacity due to fewer weight bits, and with energy profiles dominated by memory access rather than compute. Groq's LPU architecture is cited as an early example of this direction.
6. Limitations
- Minimum scale for parity. Full-precision parity is achieved only at 3B parameters with 100B training tokens. Smaller models or fewer training tokens may not reach parity.
- Training from scratch. Post-training quantization of existing FP16 LLMs to ternary precision is not demonstrated. Organizations with large pre-trained FP16 models cannot directly convert to BitNet b1.58.
- Kernel maturity. Results use a 2-bit kernel (FasterTransformer + Ladder); a dedicated 1.58-bit kernel would further improve results. The paper acknowledges there is "still room for optimization."
- Embedding layers excluded. Input embeddings and LM head remain in full precision. For very large vocabularies, these components can be substantial.
- No fine-tuning study. The behavior of ternary models under instruction tuning, RLHF, or domain fine-tuning is not studied.
- Single hardware platform. All inference measurements use NVIDIA A100 GPUs. Performance on other hardware (AMD, Intel, custom ASICs) is not reported.
- Activation quantization error. 8-bit per-token activation quantization introduces quantization error that is not fully analyzed. The paper reports it has "negligible effects on performance" but provides no ablation.
7. Related Work
- BitNet (WMD+23, Wang et al. 2023): Binary {-1, +1} weights, 8-bit activations, QAT for LLMs. BitNet b1.58 extends this with ternary weights.
- GPTQ (Frantar et al., ICLR 2023): Post-training quantization to 4-bit using second-order weight perturbation. Sub-optimal vs. QAT.
- SmoothQuant (Xiao et al., ICML 2023): Post-training activation + weight quantization. 8-bit inference.
- QuIP# (Tseng et al., 2024): Post-training quantization with Hadamard incoherence and lattice codebooks. 2-bit.
- AWQ (Lin et al., 2023): Activation-aware weight quantization, 4-bit PTQ.
- Mesh-TensorFlow / SPMD: Training-time parallelism — orthogonal to BitNet b1.58's quantization approach.
- LLaMA / LLaMA-2 (Touvron et al., 2023): The FP16 baselines used throughout this paper.
- GPipe (Huang et al., 2019): The pipeline parallelism approach used in throughput experiments for the 70B comparison.
8. Relevance to DynamICCL
BitNet b1.58 is a workload characterization input for DynamICCL — it describes how the fundamental nature of LLM training and inference workloads is changing.
Effect on AllReduce message characteristics. During distributed training of a BitNet b1.58 model, gradient accumulation still produces full-precision gradients (STE passes float gradients), but the effective model state being synchronized has dramatically smaller memory footprint. If gradient communication is performed on quantized gradients (a research direction not yet in this paper), AllReduce message sizes would shrink ~8-16x. This shifts collective communication from bandwidth-bound to latency-bound, requiring DynamICCL to favor tree-based algorithms and low-latency protocols (ll) over ring algorithms optimized for large bandwidth-bound transfers.
Higher compute-to-communication ratio for inference serving. With 8.9x higher throughput per hardware unit, a BitNet b1.58 serving cluster generates proportionally more output per unit time. This increases the frequency of AllReduce-equivalent synchronization operations (in multi-node inference) per wall-clock second, increasing the marginal value of reducing per-collective latency — precisely DynamICCL's target.
MoE AllToAll relevance. The paper's projection of 1-bit MoE models with reduced inter-chip communication is directly relevant to DynamICCL's collective selection capability. AllToAll (the MoE routing collective) has very different algorithm/protocol trade-offs than AllReduce. DynamICCL would need to extend its action space to cover AllToAll configurations for this future workload.
Training traffic pattern. BitNet b1.58 is trained from scratch with the same mini-batch structure as FP16 LLMs. The AllReduce pattern — once per backward pass, predictable timing, well-defined tensor size — is the same regular pattern that DynamICCL's LSTM+CUSUM Agent-1 is designed to model. The main difference is that AllReduce message size is smaller, and gradient synchronization completes faster, potentially changing the congestion signature that Agent-1 must detect.
Citation
@article{ma2024era,
title = {The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
author = {Ma, Shuming and Wang, Hongyu and Ma, Lingxiao and Wang, Lei and
Wang, Wenhui and Huang, Shaohan and Dong, Li and Wang, Ruiping and
Xue, Jilong and Wei, Furu},
journal = {arXiv preprint arXiv:2402.17764},
year = {2024}
}