Probabilistic Programming
Chapter 15 — Probabilistic Programming Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 500–527
What is Probabilistic Programming?
A probabilistic programming language (PPL) embeds probability distributions as first-class objects in a programming language, allowing: - Arbitrary generative models expressed as programs - Automatic inference (the language handles the “how to compute posteriors” part) - Expressive power of Turing-complete programs + principled probabilistic reasoning
Core idea: a probabilistic program = a prior over execution traces. Conditioning on observations = posterior over traces.
The Generative Model View
Any probabilistic model can be written as:
def generative_model():
# Sample latent variables (the "story" of how data was generated)
theta = sample('theta', prior_distribution)
# Generate observations
obs = sample('obs', likelihood(theta))
return obs
Inference: given observed obs=data,
compute P(theta | obs=data).
The PPL automates this inference — the programmer only writes the generative model.
Example: Coin Flipping
def coin_model(data):
# Prior: belief about coin bias
p = sample('p', Beta(2, 2))
# Likelihood: each flip is Bernoulli(p)
for i, d in enumerate(data):
observe(f'flip_{i}', Bernoulli(p), d)
return p
# Inference: posterior P(p | data=[T,T,H,T,...])
posterior = infer(coin_model, data=[1,1,0,1,1,0])
Key PPL Constructs
| Construct | Purpose |
|---|---|
sample(name, dist) |
Sample a random variable from distribution |
observe(name, dist, value) |
Condition on observed value |
infer(model, data) |
Run inference → return posterior distribution |
Inference Algorithms in PPLs
PPLs provide multiple backends:
MCMC-based
- Metropolis-Hastings: propose → accept/reject based on probability ratio
- HMC (Hamiltonian Monte Carlo): use gradient information for efficient proposals
- NUTS (No U-Turn Sampler): adaptive HMC — used in Stan, PyMC3
Variational Inference (VI)
- Approximate posterior with a family Q(θ|φ) parameterized by φ
- Optimize φ to minimize KL(Q || P) ↔︎ maximize ELBO
ELBO(φ) = E_Q[log P(x, θ)] - E_Q[log Q(θ|φ)]
- Mean-field VI: Q factors as Π Qᵢ(θᵢ) — coordinate ascent updates
- ADVI: automatic differentiation variational inference — gradient-based
Sequential Monte Carlo (SMC)
- Particle-based — good for sequential/temporal models
- Integrates well with the forward-backward structure of Ch.14
Major PPLs
| Language | Backend | Notes |
|---|---|---|
| Stan | HMC/NUTS | Explicit BUGS-like syntax; excellent for statistics |
| PyMC3/PyMC | NUTS + VI (Theano/JAX) | Python-native; flexible |
| Pyro / NumPyro | SVI + NUTS (PyTorch/JAX) | Neural PPL; deep probabilistic models |
| Edward2 | VI + MCMC (TF) | TensorFlow-based |
| Church / Venture | MH over traces | Original LISP-based research PPLs |
| WebPPL | MH + SMC | JavaScript; browser-based |
Open-Universe Probability Models
Classical BNs have fixed variables. PPLs support open-universe models where: - The number of objects is unknown (random) - New objects can be created during inference
Example: How many vehicles are in a surveillance video? Sample the count, then explain the observations.
Languages: BLOG (Bayesian Logic), IBAL.
Neural Probabilistic Programs
Modern deep learning meets PPLs: - Variational Autoencoders (VAEs): PPL with neural encoder (approximate posterior) + neural decoder (likelihood) - Normalizing Flows: learnable invertible transformations → exact posterior samples - Deep Probabilistic Models (Pyro): neural networks as components inside a PPL
This enables fitting complex distributions to high-dimensional data (images, text) while maintaining probabilistic semantics.
Relevance to DynamICCL
PPLs enable Bayesian experimental design for NCCL: - Model NCCL throughput as a probabilistic function of parameters - Use VI or HMC to infer posterior over parameters given measured throughput - Select next configuration using expected improvement (acquisition function)
This is Bayesian optimization — a PPL-enabled approach to hyperparameter tuning that is more sample-efficient than RL for small parameter spaces.