Approximate Inference: Sampling Methods
Why Approximate Inference?
Exact inference (VE, junction tree) is intractable for high-treewidth networks.
Monte Carlo methods: estimate probabilities by sampling.
Direct Sampling (Prior Sampling)
Generate samples from the full joint P(X₁,…,Xₙ) using the topological order:
function PRIOR-SAMPLE(bn) returns event
x ← {}
for each Xᵢ in topological order do
x[Xᵢ] ← sample from P(Xᵢ | Parents(Xᵢ)) given x
return x
For query P(X=x): count samples where X=x, divide by total samples.
Consistency: as N→∞, estimates converge to true probabilities.
Problem: for conditional queries P(X | e), most samples may not agree with evidence e → rejection sampling is inefficient.
Rejection Sampling
function REJECTION-SAMPLING(X, e, bn, N)
counts ← init to 0
for j = 1 to N do
x ← PRIOR-SAMPLE(bn)
if x is consistent with e then
counts[x[X]] += 1
return NORMALIZE(counts)
Problem: if P(e) is small, almost all samples are rejected. Variance = O(1/P(e)) → exponential in number of evidence variables.
Likelihood Weighting
Fix evidence variables to observed values; weight each sample by the probability of evidence:
function LIKELIHOOD-WEIGHTING(X, e, bn, N)
W ← weighted counts
for j = 1 to N do
x, w ← WEIGHTED-SAMPLE(bn, e)
W[x[X]] += w
return NORMALIZE(W)
function WEIGHTED-SAMPLE(bn, e)
x ← {}; w ← 1.0
for each Xᵢ in topological order do
if Xᵢ has observed value xᵢ in e then
w ← w · P(xᵢ | Parents(Xᵢ))
x[Xᵢ] ← xᵢ
else
x[Xᵢ] ← sample from P(Xᵢ | Parents(Xᵢ))
return x, w
Advantage over rejection: never rejects; uses all samples.
Problem: evidence only influences upstream variables (parents); downstream non-evidence variables are sampled from the prior, not the posterior. Weights can become very small.
MCMC: Markov Chain Monte Carlo
Sample from the posterior P(non-evidence | evidence) directly using a Markov chain.
Gibbs Sampling
Repeatedly re-sample one variable at a time, conditioned on all other current values:
function GIBBS-ASK(X, e, bn, N)
Z ← non-evidence variables
x ← random assignment to Z; set e values
counts ← 0
for j = 1 to N do
for each Zᵢ in Z do
sample Zᵢ from P(Zᵢ | Markov-blanket(Zᵢ))
update x
counts[x[X]] += 1
return NORMALIZE(counts)
Markov blanket of Xᵢ = Parents(Xᵢ) ∪ Children(Xᵢ) ∪ Parents-of-Children(Xᵢ)
Key property: P(Xᵢ | all other vars) = P(Xᵢ | MarkovBlanket(Xᵢ)) — only local knowledge needed.
Why MCMC Works
The Markov chain with Gibbs transitions is ergodic (reaches any state from any state) and has the posterior P(Z | e) as its stationary distribution — samples converge to the posterior.
Burn-in: first few hundred samples may not represent the stationary distribution; discard them.
Comparison
| Method | Handles UNSAT evidence | Variance | Notes |
|---|---|---|---|
| Prior sampling | Yes | High (if P(e) small) | Simple baseline |
| Rejection sampling | Yes | O(1/P(e)) | Inefficient for rare evidence |
| Likelihood weighting | Yes | Better | Only weights evidence nodes |
| Gibbs sampling (MCMC) | Yes | Low (after mixing) | Samples from posterior |
Particle Filtering
A sequential MCMC method for temporal models (Ch.14):
- Maintain N particles (sampled states)
- For each new observation: weight particles by likelihood, resample
- Propagate particles forward using transition model
Particle filtering is the practical algorithm for online Bayesian inference in dynamic systems (used in robot localization, tracking).
Connection to RL
- MCMC sampling = exploration in model-based RL (sampling from the belief state)
- Particle filters = belief state representation in POMDPs (finite approximation of continuous belief)
- Variational inference (ELBO, VAE) is the deterministic alternative to MCMC — used in deep generative models