Learning from Examples
Chapter 19 — Learning from Examples Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 652–709
The Machine Learning Framework
An agent learns from experience to improve its performance on a task.
Formal setup (supervised learning): - Training data: (x₁, y₁), …, (xₙ, yₙ) where xᵢ ∈ X, yᵢ ∈ Y - Hypothesis space H: set of possible functions h: X → Y - Loss function L(h(x), y): penalty for incorrect prediction - Goal: find h ∈ H that minimizes expected loss over the true distribution P(X,Y)
Types: - Classification: Y is discrete (spam/not spam, cat/dog) - Regression: Y is continuous (price, temperature) - Structured prediction: Y is structured (tree, sequence)
Bias-Variance Tradeoff
Generalization error = training error + overfitting penalty.
Bias: error due to wrong assumptions in H (underfitting — too simple) Variance: error due to sensitivity to small fluctuations in training data (overfitting — too complex)
Expected Error = Bias² + Variance + Irreducible Noise
More complex model → lower bias, higher variance.
Regularization reduces variance at the cost of slight bias increase.
Decision Trees
Recursive partitioning of X using axis-aligned splits.
function DECISION-TREE-LEARNING(examples, attributes, parent_examples)
if examples is empty: return MODE(parent_examples)
if all examples have same label: return that label
if attributes is empty: return MODE(examples)
A ← argmax_a IMPORTANCE(a, examples)
tree ← new decision node on A
for each value vₖ of A:
subtree ← DECISION-TREE-LEARNING(examples with A=vₖ, attributes-A, examples)
tree.add_branch(A=vₖ, subtree)
return tree
Splitting Criterion
Information gain (ID3):
Gain(A) = H(Y) - Σ_v (|examples with A=v| / |examples|) · H(Y | A=v)
H(Y) = -Σ_y P(y) log₂ P(y) -- entropy
Gini impurity (CART):
Gini(Y) = 1 - Σ_y P(y)²
Overfitting Prevention
- Pruning: remove branches that don’t improve on validation set (reduced-error pruning, χ² pruning)
- Minimum samples per leaf: stop splitting if fewer than k examples remain
Evaluation and Cross-Validation
Holdout: split data into train/test sets. Simple but wasteful.
k-fold cross-validation: partition data into k folds; train on k-1, test on 1; repeat k times; average scores.
Leave-one-out (LOO): k = n. Most data-efficient but expensive.
Metrics: - Accuracy: correct/total - Precision: TP/(TP+FP) - Recall: TP/(TP+FN) - F1 = 2·P·R/(P+R) - AUC-ROC: area under ROC curve
Linear Regression
Model: h_w(x) = w · x (dot product).
Loss: mean squared error L(w) = Σ (h_w(xᵢ) - yᵢ)².
Closed-form solution (normal equations): w* = (XᵀX)⁻¹ Xᵀy
Gradient descent: w ← w - α · ∇L(w) = w + α · Σ(yᵢ - w·xᵢ)·xᵢ
Regularization: - Ridge (L2): L + λ||w||² - Lasso (L1): L + λ||w||₁
Logistic Regression
For binary classification: predict P(y=1|x) using sigmoid:
h_w(x) = σ(w·x) = 1 / (1 + e^(-w·x))
Loss: cross-entropy / log-likelihood:
L(w) = -Σ [yᵢ log h_w(xᵢ) + (1-yᵢ) log(1-h_w(xᵢ))]
Gradient descent: same update rule as linear regression (due to sigmoid properties).
Support Vector Machines (SVMs)
Find the maximum margin hyperplane — maximize the gap between classes.
Primal formulation:
min_{w,b} ½||w||²
subject to yᵢ(w·xᵢ + b) ≥ 1 for all i
Soft margin SVM (with misclassification tolerance):
min ½||w||² + C·Σ ξᵢ
subject to yᵢ(w·xᵢ + b) ≥ 1 - ξᵢ, ξᵢ ≥ 0
Kernel trick: replace x·x’ with K(x,x’) to handle nonlinear boundaries: - Polynomial: K(x,x’) = (x·x’+c)^d - RBF: K(x,x’) = exp(-γ||x-x’||²) - Neural: K(x,x’) = tanh(κ·x·x’ + θ)
Ensemble Methods
Boosting (AdaBoost)
Iteratively add weak learners, weighting misclassified examples more heavily.
for t = 1 to T:
train hₜ on weighted examples
αₜ ← ½ ln((1-εₜ)/εₜ) -- learner weight
update weights: wᵢ ← wᵢ · exp(-αₜ yᵢ hₜ(xᵢ))
normalize weights
H(x) = sign(Σ αₜ hₜ(x))
Random Forests
Bagging (bootstrap aggregation) + random feature subsets for each tree. - Grows many trees on bootstrap samples - Each split considers only √d random features - Final prediction = majority vote / average
Connection to RL / DynamICCL
- Decision trees: interpretable NCCL parameter rules
- SVMs: classify network states (congested/healthy) for NCCL algorithm selection
- Ensemble methods: random forests for NCCL performance prediction (feature = network metrics)
- All supervised learning methods = function approximation in RL (value function, policy)