Learning Probabilistic Models

Chapter 20 — Learning Probabilistic Models Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 710–752

Statistical Learning Framework

Goal: learn the parameters (and sometimes structure) of a probabilistic model from data.

Maximum Likelihood Estimation (MLE):

θ_MLE = argmax_θ L(θ | data) = argmax_θ Σᵢ log P(xᵢ | θ)

Maximum A Posteriori (MAP):

θ_MAP = argmax_θ log P(θ) + Σᵢ log P(xᵢ | θ)

MAP = MLE with a regularization term log P(θ) (prior acts as regularizer).

Learning Naïve Bayes

For classification with features F₁,…,Fₙ and class C:

Model: P(C | F₁,…,Fₙ) ∝ P(C) · Π P(Fᵢ | C)

MLE parameters:

P̂(C=c) = count(C=c) / N
P̂(Fᵢ=f | C=c) = count(Fᵢ=f, C=c) / count(C=c)

Laplace smoothing (MAP with Dirichlet prior):

P̂(Fᵢ=f | C=c) = (count(Fᵢ=f, C=c) + 1) / (count(C=c) + |values(Fᵢ)|)

Learning Bayesian Networks

Known Structure, Known Variables

MLE: for each CPT entry, count occurrences in data. Closed-form solution.

Known Structure, Hidden Variables: EM Algorithm

Expectation-Maximization (EM):

function EM(data, bn) returns parameters θ
    θ ← random initial parameters
    repeat until convergence:
        -- E-step: compute expected counts
        for each example xᵢ do
            for each hidden node H do
                P(H | visible(xᵢ), θ) ← infer in current BN
        -- M-step: update parameters
        θ ← new MLE given expected counts from E-step
    return θ

EM converges to a local maximum of the likelihood (not necessarily global).

Key property: likelihood is non-decreasing at each EM iteration.

The EM Algorithm in Detail

Formal EM:

Let Z be hidden variables, X observed:

Q(θ | θ_old) = E_{Z|X,θ_old}[log P(X, Z | θ)]

E-step: compute Q(θ | θ_old) -- expected complete-data log likelihood
M-step: θ_new = argmax_θ Q(θ | θ_old)

Jensen’s inequality guarantees: L(θ_new) ≥ L(θ_old) — likelihood never decreases.

Learning Structure

Structure learning: find the best DAG for the data.

Score-based: define a scoring function (BIC, BDe score); search over DAG space. - BIC score: log L(data | θ_MLE) - (d/2)·log n — penalizes complexity - NP-hard to optimize over all DAGs → greedy search

Constraint-based: test conditional independence relations; build DAG consistent with them. - PC algorithm: test pairwise independence, then remove edges, orient using collider detection

Gaussian Mixture Models (GMMs)

Data from a mixture of k Gaussian distributions:

P(x) = Σᵢ πᵢ · N(x; μᵢ, Σᵢ)

EM for GMMs:

E-step: for each point x_n, compute responsibilities:

r_{nk} = πₖ · N(xₙ; μₖ, Σₖ) / Σⱼ πⱼ · N(xₙ; μⱼ, Σⱼ)

M-step:

Nₖ = Σₙ r_{nk}
μₖ = (1/Nₖ) Σₙ r_{nk} · xₙ
Σₖ = (1/Nₖ) Σₙ r_{nk} · (xₙ - μₖ)(xₙ - μₖ)ᵀ
πₖ = Nₖ / N

Nonparametric Methods

k-Nearest Neighbors (k-NN)

Predict label by majority vote of k nearest neighbors (by some distance metric).

No training phase — “instance-based learning.” Test: O(n·d) per query (unless indexed).

Converges to Bayes optimal as n→∞ with k=1. Sensitive to irrelevant features.

Kernel Density Estimation (KDE)

Non-parametric density estimation:

p̂(x) = (1/n) Σᵢ K((x - xᵢ)/h)

Where K is a kernel function (Gaussian, Epanechnikov) and h is the bandwidth.

Gaussian Processes (GPs)

A Gaussian Process is a distribution over functions:

f ~ GP(m(x), k(x,x'))

Where m(x) = mean function, k(x,x’) = covariance kernel.

Posterior given observations = another GP (conjugate):

f* | X, y, X* ~ N(μ*, Σ*)
μ* = m(X*) + K(X*,X) · [K(X,X) + σ²I]⁻¹ · (y - m(X))

Applications: Bayesian optimization (acquisition function needs posterior mean + variance).

Connection to RL / DynamICCL

EM for HMMs: learn NCCL state transition and emission models from observed performance traces
GMMs: model multimodal throughput distributions (different network topologies)
GPs: Bayesian optimization of NCCL parameters — GP surrogate model + acquisition function (UCB, EI) → sample-efficient hyperparameter tuning
Structure learning: discover causal relationships between NCCL parameters and performance