Deep Learning
Chapter 21 — Deep Learning Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 753–809
Neural Networks: The Building Block
Neuron (perceptron): h_w(x) = g(w₀ + Σ wᵢ·xᵢ) where g is an activation function.
Common activations: | Function | Formula | Use | |—|—|—| | Sigmoid | 1/(1+e^{-x}) | Binary output (old) | | Tanh | (ex-e{-x})/(ex+e{-x}) | Zero-centered | | ReLU | max(0, x) | Hidden layers (current default) | | Leaky ReLU | max(αx, x) | Avoids dying ReLU | | GELU | x·Φ(x) | Transformers | | Softmax | e{xᵢ}/Σe{xⱼ} | Multi-class output |
Multilayer Perceptron (MLP) and Backpropagation
Forward pass: propagate input through layers:
a₀ = x
aₗ = gₗ(Wₗ · aₗ₋₁ + bₗ) for l = 1,...,L
ŷ = aL
Backpropagation: compute ∂L/∂W by chain rule, backward through layers:
δL = ∂L/∂aL ⊙ g'L(zL) -- output layer delta
δl = (Wl+1ᵀ · δl+1) ⊙ g'l(zl) -- hidden layer delta
∂L/∂Wl = δl · aₗ₋₁ᵀ -- gradient w.r.t. weights
SGD with momentum: w ← w + β·v - α·∇L; v ← β·v - α·∇L
Adam: adaptive learning rates with first and second moment estimates.
Convolutional Neural Networks (CNNs)
For spatial data (images, sequences with local structure).
Convolution layer: apply a filter W (kernel) at each position:
(I * W)[i,j] = Σ_{m,n} I[i+m, j+n] · W[m,n]
Key properties: - Parameter sharing: same weights at all positions - Translation equivariance: shifted input → shifted output - Local connectivity: each neuron sees only a local patch
Architecture: Conv → ReLU → Pool → … → Flatten → Dense
Pooling: max-pool or average-pool reduces spatial dimensions.
Modern CNNs: ResNet (skip connections), VGG, EfficientNet, ConvNeXt.
Recurrent Neural Networks (RNNs)
For sequential data. Hidden state hₜ depends on input xₜ and previous state hₜ₋₁:
hₜ = g(Wₕ · hₜ₋₁ + Wₓ · xₜ + b)
yₜ = Wᵧ · hₜ
Vanishing gradient problem: gradients decay exponentially through time → RNN forgets long-range dependencies.
LSTM (Long Short-Term Memory): uses gates to control information flow:
fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf) -- forget gate
iₜ = σ(Wi · [hₜ₋₁, xₜ] + bi) -- input gate
oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo) -- output gate
c̃ₜ = tanh(Wc · [hₜ₋₁, xₜ] + bc) -- candidate cell state
cₜ = fₜ⊙cₜ₋₁ + iₜ⊙c̃ₜ -- cell state update
hₜ = oₜ⊙tanh(cₜ) -- hidden state
GRU: simplified LSTM with fewer gates.
Attention and Transformers
Self-attention: each position attends to all other positions.
Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V
Where Q=queries, K=keys, V=values (all linear projections of input).
Multi-head attention: run h parallel attention heads, concatenate:
MultiHead(Q,K,V) = Concat(head₁,...,headₕ) · W^O
headᵢ = Attention(Q·Wᵢ^Q, K·Wᵢ^K, V·Wᵢ^V)
Transformer block:
x = x + MultiHeadAttn(LayerNorm(x)) -- self-attention sublayer
x = x + FFN(LayerNorm(x)) -- feed-forward sublayer
Transformers replaced RNNs for NLP tasks (BERT, GPT, T5) and now dominate vision (ViT), RL (Decision Transformer), and speech.
Generative Models
Variational Autoencoder (VAE)
Encoder: q(z|x) ≈ N(μ(x), σ²(x)) -- approximate posterior
Decoder: p(x|z) -- generative model
ELBO = E_{q(z|x)}[log p(x|z)] - KL(q(z|x)||p(z))
Generative Adversarial Network (GAN)
Generator G: z → x̂ (generate fake samples)
Discriminator D: x → [0,1] (distinguish real from fake)
min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]
Diffusion Models
Forward process: gradually add Gaussian noise to x₀ → xₜ. Reverse process: learn to denoise xₜ → xₜ₋₁. Used for: DALL-E 2, Stable Diffusion, SORA.
Regularization Techniques
| Technique | Mechanism |
|---|---|
| Dropout | Randomly zero activations during training |
| Batch normalization | Normalize layer activations |
| Weight decay | L2 penalty on weights |
| Data augmentation | Random transforms of training examples |
| Early stopping | Stop when validation loss increases |
Connection to RL / DynamICCL
- Deep Q-Networks (DQN): Q*(s,a) approximated by CNN/MLP → breakthrough in RL
- PPO, SAC, TD3: policy + value networks as deep NNs
- DynamICCL RL agent: MLP/Transformer maps NCCL state observations to parameter choices
- Transformers for sequence modeling of NCCL operation histories → context-aware decisions
- VAE / flow models: learn latent representation of network state for better generalization