Deep Learning

Chapter 21 — Deep Learning Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 753–809


Neural Networks: The Building Block

Neuron (perceptron): h_w(x) = g(w₀ + Σ wᵢ·xᵢ) where g is an activation function.

Common activations: | Function | Formula | Use | |—|—|—| | Sigmoid | 1/(1+e^{-x}) | Binary output (old) | | Tanh | (ex-e{-x})/(ex+e{-x}) | Zero-centered | | ReLU | max(0, x) | Hidden layers (current default) | | Leaky ReLU | max(αx, x) | Avoids dying ReLU | | GELU | x·Φ(x) | Transformers | | Softmax | e{xᵢ}/Σe{xⱼ} | Multi-class output |


Multilayer Perceptron (MLP) and Backpropagation

Forward pass: propagate input through layers:

a₀ = x
aₗ = gₗ(Wₗ · aₗ₋₁ + bₗ)   for l = 1,...,L
ŷ = aL

Backpropagation: compute ∂L/∂W by chain rule, backward through layers:

δL = ∂L/∂aL ⊙ g'L(zL)         -- output layer delta
δl = (Wl+1ᵀ · δl+1) ⊙ g'l(zl)  -- hidden layer delta
∂L/∂Wl = δl · aₗ₋₁ᵀ           -- gradient w.r.t. weights

SGD with momentum: w ← w + β·v - α·∇L; v ← β·v - α·∇L

Adam: adaptive learning rates with first and second moment estimates.


Convolutional Neural Networks (CNNs)

For spatial data (images, sequences with local structure).

Convolution layer: apply a filter W (kernel) at each position:

(I * W)[i,j] = Σ_{m,n} I[i+m, j+n] · W[m,n]

Key properties: - Parameter sharing: same weights at all positions - Translation equivariance: shifted input → shifted output - Local connectivity: each neuron sees only a local patch

Architecture: Conv → ReLU → Pool → … → Flatten → Dense

Pooling: max-pool or average-pool reduces spatial dimensions.

Modern CNNs: ResNet (skip connections), VGG, EfficientNet, ConvNeXt.


Recurrent Neural Networks (RNNs)

For sequential data. Hidden state hₜ depends on input xₜ and previous state hₜ₋₁:

hₜ = g(Wₕ · hₜ₋₁ + Wₓ · xₜ + b)
yₜ = Wᵧ · hₜ

Vanishing gradient problem: gradients decay exponentially through time → RNN forgets long-range dependencies.

LSTM (Long Short-Term Memory): uses gates to control information flow:

fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)    -- forget gate
iₜ = σ(Wi · [hₜ₋₁, xₜ] + bi)    -- input gate
oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo)    -- output gate
c̃ₜ = tanh(Wc · [hₜ₋₁, xₜ] + bc)  -- candidate cell state
cₜ = fₜ⊙cₜ₋₁ + iₜ⊙c̃ₜ            -- cell state update
hₜ = oₜ⊙tanh(cₜ)                 -- hidden state

GRU: simplified LSTM with fewer gates.


Attention and Transformers

Self-attention: each position attends to all other positions.

Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V

Where Q=queries, K=keys, V=values (all linear projections of input).

Multi-head attention: run h parallel attention heads, concatenate:

MultiHead(Q,K,V) = Concat(head₁,...,headₕ) · W^O
headᵢ = Attention(Q·Wᵢ^Q, K·Wᵢ^K, V·Wᵢ^V)

Transformer block:

x = x + MultiHeadAttn(LayerNorm(x))    -- self-attention sublayer
x = x + FFN(LayerNorm(x))             -- feed-forward sublayer

Transformers replaced RNNs for NLP tasks (BERT, GPT, T5) and now dominate vision (ViT), RL (Decision Transformer), and speech.


Generative Models

Variational Autoencoder (VAE)

Encoder: q(z|x) ≈ N(μ(x), σ²(x))   -- approximate posterior
Decoder: p(x|z)                       -- generative model
ELBO = E_{q(z|x)}[log p(x|z)] - KL(q(z|x)||p(z))

Generative Adversarial Network (GAN)

Generator G: z → x̂       (generate fake samples)
Discriminator D: x → [0,1]  (distinguish real from fake)
min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]

Diffusion Models

Forward process: gradually add Gaussian noise to x₀ → xₜ. Reverse process: learn to denoise xₜ → xₜ₋₁. Used for: DALL-E 2, Stable Diffusion, SORA.


Regularization Techniques

Technique Mechanism
Dropout Randomly zero activations during training
Batch normalization Normalize layer activations
Weight decay L2 penalty on weights
Data augmentation Random transforms of training examples
Early stopping Stop when validation loss increases

Connection to RL / DynamICCL