Learning from Examples

Chapter 19 — Learning from Examples Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 652–709


The Machine Learning Framework

An agent learns from experience to improve its performance on a task.

Formal setup (supervised learning): - Training data: (x₁, y₁), …, (xₙ, yₙ) where xᵢ ∈ X, yᵢ ∈ Y - Hypothesis space H: set of possible functions h: X → Y - Loss function L(h(x), y): penalty for incorrect prediction - Goal: find h ∈ H that minimizes expected loss over the true distribution P(X,Y)

Types: - Classification: Y is discrete (spam/not spam, cat/dog) - Regression: Y is continuous (price, temperature) - Structured prediction: Y is structured (tree, sequence)


Bias-Variance Tradeoff

Generalization error = training error + overfitting penalty.

Bias: error due to wrong assumptions in H (underfitting — too simple) Variance: error due to sensitivity to small fluctuations in training data (overfitting — too complex)

Expected Error = Bias² + Variance + Irreducible Noise

More complex model → lower bias, higher variance.

Regularization reduces variance at the cost of slight bias increase.


Decision Trees

Recursive partitioning of X using axis-aligned splits.

function DECISION-TREE-LEARNING(examples, attributes, parent_examples)
    if examples is empty: return MODE(parent_examples)
    if all examples have same label: return that label
    if attributes is empty: return MODE(examples)
    A ← argmax_a IMPORTANCE(a, examples)
    tree ← new decision node on A
    for each value vₖ of A:
        subtree ← DECISION-TREE-LEARNING(examples with A=vₖ, attributes-A, examples)
        tree.add_branch(A=vₖ, subtree)
    return tree

Splitting Criterion

Information gain (ID3):

Gain(A) = H(Y) - Σ_v (|examples with A=v| / |examples|) · H(Y | A=v)
H(Y) = -Σ_y P(y) log₂ P(y)    -- entropy

Gini impurity (CART):

Gini(Y) = 1 - Σ_y P(y)²

Overfitting Prevention


Evaluation and Cross-Validation

Holdout: split data into train/test sets. Simple but wasteful.

k-fold cross-validation: partition data into k folds; train on k-1, test on 1; repeat k times; average scores.

Leave-one-out (LOO): k = n. Most data-efficient but expensive.

Metrics: - Accuracy: correct/total - Precision: TP/(TP+FP) - Recall: TP/(TP+FN) - F1 = 2·P·R/(P+R) - AUC-ROC: area under ROC curve


Linear Regression

Model: h_w(x) = w · x (dot product).

Loss: mean squared error L(w) = Σ (h_w(xᵢ) - yᵢ)².

Closed-form solution (normal equations): w* = (XᵀX)⁻¹ Xᵀy

Gradient descent: w ← w - α · ∇L(w) = w + α · Σ(yᵢ - w·xᵢ)·xᵢ

Regularization: - Ridge (L2): L + λ||w||² - Lasso (L1): L + λ||w||₁


Logistic Regression

For binary classification: predict P(y=1|x) using sigmoid:

h_w(x) = σ(w·x) = 1 / (1 + e^(-w·x))

Loss: cross-entropy / log-likelihood:

L(w) = -Σ [yᵢ log h_w(xᵢ) + (1-yᵢ) log(1-h_w(xᵢ))]

Gradient descent: same update rule as linear regression (due to sigmoid properties).


Support Vector Machines (SVMs)

Find the maximum margin hyperplane — maximize the gap between classes.

Primal formulation:

min_{w,b} ½||w||²
subject to yᵢ(w·xᵢ + b) ≥ 1 for all i

Soft margin SVM (with misclassification tolerance):

min ½||w||² + C·Σ ξᵢ
subject to yᵢ(w·xᵢ + b) ≥ 1 - ξᵢ, ξᵢ ≥ 0

Kernel trick: replace x·x’ with K(x,x’) to handle nonlinear boundaries: - Polynomial: K(x,x’) = (x·x’+c)^d - RBF: K(x,x’) = exp(-γ||x-x’||²) - Neural: K(x,x’) = tanh(κ·x·x’ + θ)


Ensemble Methods

Boosting (AdaBoost)

Iteratively add weak learners, weighting misclassified examples more heavily.

for t = 1 to T:
    train hₜ on weighted examples
    αₜ ← ½ ln((1-εₜ)/εₜ)    -- learner weight
    update weights: wᵢ ← wᵢ · exp(-αₜ yᵢ hₜ(xᵢ))
    normalize weights
H(x) = sign(Σ αₜ hₜ(x))

Random Forests

Bagging (bootstrap aggregation) + random feature subsets for each tree. - Grows many trees on bootstrap samples - Each split considers only √d random features - Final prediction = majority vote / average


Connection to RL / DynamICCL