Quantifying Uncertainty: Probability Basics

Chapter 12 — Quantifying Uncertainty Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 409–450

Why Probability?

Logic-based agents fail when: - Partial observability: agent can’t know the full state - Noisy sensors: percepts don’t uniquely identify states - Stochastic outcomes: actions don’t have deterministic results - Complexity: too many variables to reason about exactly

Probability provides a principled framework for reasoning under uncertainty.

The Axioms of Probability (Kolmogorov)

A probability function P satisfies: 1. Non-negativity: P(A) ≥ 0 2. Normalization: P(Ω) = 1 (certain event) 3. Additivity: if A ∩ B = ∅ then P(A ∪ B) = P(A) + P(B)

From these, everything else follows: - P(¬A) = 1 - P(A) - P(A ∪ B) = P(A) + P(B) - P(A ∩ B) - P(A) ≤ 1

Random Variables

A random variable X maps outcomes to values.

Discrete: P(X=x) for each value x (probability mass function)
Continuous: f(x) (probability density; P(a≤X≤b) = ∫ₐᵇ f(x) dx)

Joint distribution: P(X=x, Y=y) for all combinations of values.

Marginal: P(X=x) = Σᵧ P(X=x, Y=y) — sum out Y.

Conditional Probability

P(A | B) = P(A ∧ B) / P(B)     for P(B) > 0

Product rule: P(A ∧ B) = P(A | B) · P(B) = P(B | A) · P(A)

Bayes’ Theorem

P(A | B) = P(B | A) · P(A) / P(B)

Or in full:

P(H | E) = P(E | H) · P(H) / P(E)

Where: - P(H): prior — belief before evidence - P(E | H): likelihood — probability of evidence given hypothesis - P(H | E): posterior — belief after evidence - P(E): normalizing constant = Σₕ P(E | H’) · P(H’)

The core of probabilistic AI: update beliefs from prior to posterior upon receiving evidence.

Full Joint Distribution

For n binary random variables: the full joint distribution requires 2^n - 1 numbers.

For n variables each with d values: d^n entries.

Problem: exponential in n → intractable for large n.

Solution: exploit independence and conditional independence to compactly represent the joint (Bayesian networks, Ch.13).

Independence

X and Y are independent (X ⊥ Y) if:

P(X, Y) = P(X) · P(Y)

Equivalently: P(X | Y) = P(X).

Independence dramatically reduces the representation cost: - 2 independent binary variables: 2 + 2 = 4 numbers (vs. 4 for joint) - 10 independent binary variables: 10 numbers (vs. 1023 for full joint)

Conditional Independence

X and Y are conditionally independent given Z (X ⊥ Y | Z) if:

P(X, Y | Z) = P(X | Z) · P(Y | Z)

Example (Naïve Bayes): Given the class label C, all features F₁,…,Fₙ are conditionally independent:

P(C | F₁,...,Fₙ) ∝ P(C) · Π P(Fᵢ | C)

Requires only n·d·|C| parameters instead of |C|·d^n.

The Diagnostic Problem (Wumpus World)

Agent knows: - P(Pit₁₃) = 0.2 (prior: 20% chance of pit) - Agent is at (1,2); (1,1) is not breezy; (2,2) is breezy

Updated via Bayes: - P(Pit₁₃ | Breeze₂₂, ¬Breeze₁₁) — full posterior

Classical logic gives only “possible” or “impossible”; probability gives a calibrated confidence that supports rational decision-making under uncertainty.