Bayesian Networks

Chapter 13 — Probabilistic Reasoning Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 451–496

What is a Bayesian Network?

A Bayesian network (BN) is a directed acyclic graph (DAG) where: - Nodes: random variables - Directed edges: X → Y means “X directly influences Y” (X is a parent of Y) - Each node stores a conditional probability table (CPT): P(Xᵢ | Parents(Xᵢ))

The network represents the joint distribution compactly via the chain rule:

P(X₁, ..., Xₙ) = Π P(Xᵢ | Parents(Xᵢ))

Example: Alarm Network

Burglary  Earthquake
    \        /
     \      /
       Alarm
      /     \
JohnCalls  MaryCalls

Variables: B=Burglary, E=Earthquake, A=Alarm, J=JohnCalls, M=MaryCalls

CPTs: - P(B) = 0.001 - P(E) = 0.002 - P(A | B, E): 4 entries (2×2 parent combinations) - P(J | A): 2 entries - P(M | A): 2 entries

Total: 2 + 2 + 4 + 2 + 2 = 10 numbers vs. 2^5-1 = 31 for full joint.

Conditional Independence in BNs

Three key patterns:

1. Chain: A → B → C

A and C are d-separated given B: A ⊥ C | B
Information flows A → B → C when B is unobserved; blocked when B is observed

2. Fork (Common cause): A ← B → C

A and C are d-separated given B: A ⊥ C | B
B explains the correlation between A and C

3. Collider (V-structure): A → B ← C

A and C are d-separated when B is unobserved: A ⊥ C
Explaining away: observing B makes A and C correlated! (B is a collider)
Also active if any descendant of B is observed

D-Separation

d-separation (directed separation): a path between X and Y is blocked given evidence Z if: 1. Chain or Fork with middle node in Z: blocked 2. Collider with collider (and no descendants) in Z: blocked 3. Collider with collider OR descendant in Z: unblocked (explaining away)

If all paths between X and Y are blocked given Z, then X ⊥ Y | Z in the distribution.

Joint Distribution Factorization

For the Alarm network:

P(B,E,A,J,M) = P(B) · P(E) · P(A|B,E) · P(J|A) · P(M|A)

Any query can be computed from this factored form.

Query Types

Query	Description	Example
Marginal	P(X)	P(Burglary)
Posterior	P(X	e)
MAP	argmax_x P(x	e)
MPE	argmax_x₁..xₙ P(x₁..xₙ	e)

Compact Representations

Deterministic nodes

P(X | parents) = 1 for exactly one value — represented as a function.

Noisy-OR

Each cause Cᵢ independently can trigger X:

P(X=F | c₁,...,cₙ) = Π P(X=F | cᵢ) = Π (1 - qᵢ)

Where qᵢ = P(X=T | cᵢ alone).

Reduces CPT from 2^n entries to n entries.

Relevance to DynamICCL

BN for NCCL: - P(Congestion | NetworkLoad, NumGPUs, Algorithm) - P(Throughput | Congestion, BufferSize, ChunkSize) - Posterior inference: “given observed latency, what’s P(Algorithm=Ring is suboptimal)?”

Bayesian networks provide a principled uncertainty-aware model for NCCL performance — complementing the RL policy with probabilistic diagnostics.