Learning Probabilistic Models
Chapter 20 — Learning Probabilistic Models Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 710–752
Statistical Learning Framework
Goal: learn the parameters (and sometimes structure) of a probabilistic model from data.
Maximum Likelihood Estimation (MLE):
θ_MLE = argmax_θ L(θ | data) = argmax_θ Σᵢ log P(xᵢ | θ)
Maximum A Posteriori (MAP):
θ_MAP = argmax_θ log P(θ) + Σᵢ log P(xᵢ | θ)
MAP = MLE with a regularization term log P(θ) (prior acts as regularizer).
Learning Naïve Bayes
For classification with features F₁,…,Fₙ and class C:
Model: P(C | F₁,…,Fₙ) ∝ P(C) · Π P(Fᵢ | C)
MLE parameters:
P̂(C=c) = count(C=c) / N
P̂(Fᵢ=f | C=c) = count(Fᵢ=f, C=c) / count(C=c)
Laplace smoothing (MAP with Dirichlet prior):
P̂(Fᵢ=f | C=c) = (count(Fᵢ=f, C=c) + 1) / (count(C=c) + |values(Fᵢ)|)
Learning Bayesian Networks
Known Structure, Known Variables
MLE: for each CPT entry, count occurrences in data. Closed-form solution.
Known Structure, Hidden Variables: EM Algorithm
Expectation-Maximization (EM):
function EM(data, bn) returns parameters θ
θ ← random initial parameters
repeat until convergence:
-- E-step: compute expected counts
for each example xᵢ do
for each hidden node H do
P(H | visible(xᵢ), θ) ← infer in current BN
-- M-step: update parameters
θ ← new MLE given expected counts from E-step
return θ
EM converges to a local maximum of the likelihood (not necessarily global).
Key property: likelihood is non-decreasing at each EM iteration.
The EM Algorithm in Detail
Formal EM:
Let Z be hidden variables, X observed:
Q(θ | θ_old) = E_{Z|X,θ_old}[log P(X, Z | θ)]
E-step: compute Q(θ | θ_old) -- expected complete-data log likelihood
M-step: θ_new = argmax_θ Q(θ | θ_old)
Jensen’s inequality guarantees: L(θ_new) ≥ L(θ_old) — likelihood never decreases.
Learning Structure
Structure learning: find the best DAG for the data.
Score-based: define a scoring function (BIC, BDe score); search over DAG space. - BIC score: log L(data | θ_MLE) - (d/2)·log n — penalizes complexity - NP-hard to optimize over all DAGs → greedy search
Constraint-based: test conditional independence relations; build DAG consistent with them. - PC algorithm: test pairwise independence, then remove edges, orient using collider detection
Gaussian Mixture Models (GMMs)
Data from a mixture of k Gaussian distributions:
P(x) = Σᵢ πᵢ · N(x; μᵢ, Σᵢ)
EM for GMMs:
E-step: for each point x_n, compute responsibilities:
r_{nk} = πₖ · N(xₙ; μₖ, Σₖ) / Σⱼ πⱼ · N(xₙ; μⱼ, Σⱼ)
M-step:
Nₖ = Σₙ r_{nk}
μₖ = (1/Nₖ) Σₙ r_{nk} · xₙ
Σₖ = (1/Nₖ) Σₙ r_{nk} · (xₙ - μₖ)(xₙ - μₖ)ᵀ
πₖ = Nₖ / N
Nonparametric Methods
k-Nearest Neighbors (k-NN)
Predict label by majority vote of k nearest neighbors (by some distance metric).
No training phase — “instance-based learning.” Test: O(n·d) per query (unless indexed).
Converges to Bayes optimal as n→∞ with k=1. Sensitive to irrelevant features.
Kernel Density Estimation (KDE)
Non-parametric density estimation:
p̂(x) = (1/n) Σᵢ K((x - xᵢ)/h)
Where K is a kernel function (Gaussian, Epanechnikov) and h is the bandwidth.
Gaussian Processes (GPs)
A Gaussian Process is a distribution over functions:
f ~ GP(m(x), k(x,x'))
Where m(x) = mean function, k(x,x’) = covariance kernel.
Posterior given observations = another GP (conjugate):
f* | X, y, X* ~ N(μ*, Σ*)
μ* = m(X*) + K(X*,X) · [K(X,X) + σ²I]⁻¹ · (y - m(X))
Applications: Bayesian optimization (acquisition function needs posterior mean + variance).
Connection to RL / DynamICCL
- EM for HMMs: learn NCCL state transition and emission models from observed performance traces
- GMMs: model multimodal throughput distributions (different network topologies)
- GPs: Bayesian optimization of NCCL parameters — GP surrogate model + acquisition function (UCB, EI) → sample-efficient hyperparameter tuning
- Structure learning: discover causal relationships between NCCL parameters and performance