Approximate Inference: Sampling Methods

Why Approximate Inference?

Exact inference (VE, junction tree) is intractable for high-treewidth networks.

Monte Carlo methods: estimate probabilities by sampling.

Direct Sampling (Prior Sampling)

Generate samples from the full joint P(X₁,…,Xₙ) using the topological order:

function PRIOR-SAMPLE(bn) returns event
    x ← {}
    for each Xᵢ in topological order do
        x[Xᵢ] ← sample from P(Xᵢ | Parents(Xᵢ)) given x
    return x

For query P(X=x): count samples where X=x, divide by total samples.

Consistency: as N→∞, estimates converge to true probabilities.

Problem: for conditional queries P(X | e), most samples may not agree with evidence e → rejection sampling is inefficient.

Rejection Sampling

function REJECTION-SAMPLING(X, e, bn, N)
    counts ← init to 0
    for j = 1 to N do
        x ← PRIOR-SAMPLE(bn)
        if x is consistent with e then
            counts[x[X]] += 1
    return NORMALIZE(counts)

Problem: if P(e) is small, almost all samples are rejected. Variance = O(1/P(e)) → exponential in number of evidence variables.

Likelihood Weighting

Fix evidence variables to observed values; weight each sample by the probability of evidence:

function LIKELIHOOD-WEIGHTING(X, e, bn, N)
    W ← weighted counts
    for j = 1 to N do
        x, w ← WEIGHTED-SAMPLE(bn, e)
        W[x[X]] += w
    return NORMALIZE(W)

function WEIGHTED-SAMPLE(bn, e)
    x ← {}; w ← 1.0
    for each Xᵢ in topological order do
        if Xᵢ has observed value xᵢ in e then
            w ← w · P(xᵢ | Parents(Xᵢ))
            x[Xᵢ] ← xᵢ
        else
            x[Xᵢ] ← sample from P(Xᵢ | Parents(Xᵢ))
    return x, w

Advantage over rejection: never rejects; uses all samples.

Problem: evidence only influences upstream variables (parents); downstream non-evidence variables are sampled from the prior, not the posterior. Weights can become very small.

MCMC: Markov Chain Monte Carlo

Sample from the posterior P(non-evidence | evidence) directly using a Markov chain.

Gibbs Sampling

Repeatedly re-sample one variable at a time, conditioned on all other current values:

function GIBBS-ASK(X, e, bn, N)
    Z ← non-evidence variables
    x ← random assignment to Z; set e values
    counts ← 0
    for j = 1 to N do
        for each Zᵢ in Z do
            sample Zᵢ from P(Zᵢ | Markov-blanket(Zᵢ))
            update x
        counts[x[X]] += 1
    return NORMALIZE(counts)

Markov blanket of Xᵢ = Parents(Xᵢ) ∪ Children(Xᵢ) ∪ Parents-of-Children(Xᵢ)

Key property: P(Xᵢ | all other vars) = P(Xᵢ | MarkovBlanket(Xᵢ)) — only local knowledge needed.

Why MCMC Works

The Markov chain with Gibbs transitions is ergodic (reaches any state from any state) and has the posterior P(Z | e) as its stationary distribution — samples converge to the posterior.

Burn-in: first few hundred samples may not represent the stationary distribution; discard them.

Comparison

Method	Handles UNSAT evidence	Variance	Notes
Prior sampling	Yes	High (if P(e) small)	Simple baseline
Rejection sampling	Yes	O(1/P(e))	Inefficient for rare evidence
Likelihood weighting	Yes	Better	Only weights evidence nodes
Gibbs sampling (MCMC)	Yes	Low (after mixing)	Samples from posterior

Particle Filtering

A sequential MCMC method for temporal models (Ch.14):

Maintain N particles (sampled states)
For each new observation: weight particles by likelihood, resample
Propagate particles forward using transition model

Particle filtering is the practical algorithm for online Bayesian inference in dynamic systems (used in robot localization, tracking).

Connection to RL

MCMC sampling = exploration in model-based RL (sampling from the belief state)
Particle filters = belief state representation in POMDPs (finite approximation of continuous belief)
Variational inference (ELBO, VAE) is the deterministic alternative to MCMC — used in deep generative models