Utility Theory and Decision Networks

Chapter 16 — Making Simple Decisions Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 528–570

From Probability to Decisions

Probability tells us what is likely. To make decisions, we also need to know what is desirable.

Decision theory = probability theory + utility theory.

Principle: an agent should choose the action that maximizes expected utility (MEU):

a* = argmax_a Σ_s P(s | a) · U(s)

Utility Functions

A utility function U(s) maps states to real numbers representing desirability.

Axioms of utility (Von Neumann-Morgenstern): if preferences satisfy: 1. Orderability: A ≻ B, A ≺ B, or A ~ B 2. Transitivity: A ≻ B ∧ B ≻ C → A ≻ C 3. Continuity: A ≻ B ≻ C → ∃p [p, A; (1-p), C] ~ B 4. Substitutability: A ~ B → [p, A; (1-p), C] ~ [p, B; (1-p), C] 5. Monotonicity: A ≻ B → (p>q ↔︎ [p,A;(1-p),B] ≻ [q,A;(1-q),B]) 6. Decomposability: compound lotteries reduce to simple ones

Then there exists a utility function U such that:

A ≻ B  iff  EU(A) > EU(B)
U([p₁, S₁; ...; pₙ, Sₙ]) = Σᵢ pᵢ · U(Sᵢ)

Utility functions are unique up to affine transformations (like temperature scales).

Risk Attitudes

Expected monetary value (EMV): linear utility in money.

In practice, people have risk-averse preferences for large sums: - 50% chance of $1M vs. 100% chance of $400k → most prefer the 400k − UtilityfunctionU(x) = ln(x): concave → risk-averse

Utility shape	Risk attitude
Concave (U’’ < 0)	Risk-averse
Linear (U’’ = 0)	Risk-neutral
Convex (U’’ > 0)	Risk-seeking

Insurance = paying to avoid risk (risk-averse behavior is rational if utility is concave).

Multi-Attribute Utility

When outcomes have multiple dimensions (cost, safety, time):

Preference independence: X₁ and X₂ are preferentially independent of X₃ if: - Preferences between outcomes differing only in X₁, X₂ don’t depend on X₃

Mutual preferential independence: allows additive decomposition:

U(x₁, ..., xₙ) = Σᵢ wᵢ · Uᵢ(xᵢ)

Utility independence (stronger): allows multiplicative form when additive doesn’t hold.

Decision Networks (Influence Diagrams)

Extend Bayesian networks with: - Chance nodes (ovals): random variables (same as BN nodes) - Decision nodes (rectangles): variables the agent controls - Utility node (diamond): the objective function

[Weather] → [Rain] → (Utility)
               ↑
[Umbrella?] ──┘

Arcs into decision nodes = what information is available when the decision is made.

Evaluating a Decision Network

Set the decision variable D to each possible value dᵢ
Compute posterior P(parents of U | D=dᵢ, evidence)
Compute expected utility EU(dᵢ) = Σ P(parents) · U(dᵢ, parents)
Return decision with maximum EU

The Value of Information

How much is it worth to learn the value of a variable X before acting?

Value of perfect information (VPI):

VPI(X_j | e) = EU(best action after observing X_j | e) - EU(best action without observing X_j | e)

Computed by: 1. For each possible value xⱼ of Xⱼ, determine what the best action would be 2. Weight by P(Xⱼ=xⱼ | e) 3. Subtract current EU

Properties: - VPI ≥ 0 always (information cannot hurt in expectation) - VPI = 0 if best action is the same regardless of Xⱼ’s value - VPI is not additive in general: VPI(X,Y) ≠ VPI(X) + VPI(Y)

Application: decide which sensor to query next; which experiment to run. VPI guides active learning and information gathering.

Connection to RL

MEU principle = foundation of all rational decision making
MDP value function V*(s) = expected utility under the optimal policy
Reward function R(s,a) = instantaneous utility of (state, action) pair
VPI = motivation for exploration in RL: how much is it worth to try action a to learn its Q-value?
The exploration bonus in UCB (bandit algorithms) is exactly an approximation of VPI

For DynamICCL: the RL reward = NCCL throughput = utility; the agent maximizes EU (expected throughput) by choosing NCCL parameters = decision variables.