Utility Theory and Decision Networks
Chapter 16 — Making Simple Decisions Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 528–570
From Probability to Decisions
Probability tells us what is likely. To make decisions, we also need to know what is desirable.
Decision theory = probability theory + utility theory.
Principle: an agent should choose the action that maximizes expected utility (MEU):
a* = argmax_a Σ_s P(s | a) · U(s)
Utility Functions
A utility function U(s) maps states to real numbers representing desirability.
Axioms of utility (Von Neumann-Morgenstern): if preferences satisfy: 1. Orderability: A ≻ B, A ≺ B, or A ~ B 2. Transitivity: A ≻ B ∧ B ≻ C → A ≻ C 3. Continuity: A ≻ B ≻ C → ∃p [p, A; (1-p), C] ~ B 4. Substitutability: A ~ B → [p, A; (1-p), C] ~ [p, B; (1-p), C] 5. Monotonicity: A ≻ B → (p>q ↔︎ [p,A;(1-p),B] ≻ [q,A;(1-q),B]) 6. Decomposability: compound lotteries reduce to simple ones
Then there exists a utility function U such that:
A ≻ B iff EU(A) > EU(B)
U([p₁, S₁; ...; pₙ, Sₙ]) = Σᵢ pᵢ · U(Sᵢ)
Utility functions are unique up to affine transformations (like temperature scales).
Risk Attitudes
Expected monetary value (EMV): linear utility in money.
In practice, people have risk-averse preferences for large sums: - 50% chance of $1M vs. 100% chance of $400k → most prefer the 400k − UtilityfunctionU(x) = ln(x): concave → risk-averse
| Utility shape | Risk attitude |
|---|---|
| Concave (U’’ < 0) | Risk-averse |
| Linear (U’’ = 0) | Risk-neutral |
| Convex (U’’ > 0) | Risk-seeking |
Insurance = paying to avoid risk (risk-averse behavior is rational if utility is concave).
Multi-Attribute Utility
When outcomes have multiple dimensions (cost, safety, time):
Preference independence: X₁ and X₂ are preferentially independent of X₃ if: - Preferences between outcomes differing only in X₁, X₂ don’t depend on X₃
Mutual preferential independence: allows additive decomposition:
U(x₁, ..., xₙ) = Σᵢ wᵢ · Uᵢ(xᵢ)
Utility independence (stronger): allows multiplicative form when additive doesn’t hold.
Decision Networks (Influence Diagrams)
Extend Bayesian networks with: - Chance nodes (ovals): random variables (same as BN nodes) - Decision nodes (rectangles): variables the agent controls - Utility node (diamond): the objective function
[Weather] → [Rain] → (Utility)
↑
[Umbrella?] ──┘
Arcs into decision nodes = what information is available when the decision is made.
Evaluating a Decision Network
- Set the decision variable D to each possible value dᵢ
- Compute posterior P(parents of U | D=dᵢ, evidence)
- Compute expected utility EU(dᵢ) = Σ P(parents) · U(dᵢ, parents)
- Return decision with maximum EU
The Value of Information
How much is it worth to learn the value of a variable X before acting?
Value of perfect information (VPI):
VPI(X_j | e) = EU(best action after observing X_j | e) - EU(best action without observing X_j | e)
Computed by: 1. For each possible value xⱼ of Xⱼ, determine what the best action would be 2. Weight by P(Xⱼ=xⱼ | e) 3. Subtract current EU
Properties: - VPI ≥ 0 always (information cannot hurt in expectation) - VPI = 0 if best action is the same regardless of Xⱼ’s value - VPI is not additive in general: VPI(X,Y) ≠ VPI(X) + VPI(Y)
Application: decide which sensor to query next; which experiment to run. VPI guides active learning and information gathering.
Connection to RL
- MEU principle = foundation of all rational decision making
- MDP value function V*(s) = expected utility under the optimal policy
- Reward function R(s,a) = instantaneous utility of (state, action) pair
- VPI = motivation for exploration in RL: how much is it worth to try action a to learn its Q-value?
- The exploration bonus in UCB (bandit algorithms) is exactly an approximation of VPI
For DynamICCL: the RL reward = NCCL throughput = utility; the agent maximizes EU (expected throughput) by choosing NCCL parameters = decision variables.