Deep Learning for NLP

Chapter 24 — Deep Learning for NLP Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 911–953

The Transformer Revolution

Pre-2017: RNNs/LSTMs dominated NLP. Problems: sequential computation (can’t parallelize), vanishing gradients, limited context window.

“Attention is All You Need” (Vaswani et al., 2017): replace RNNs entirely with self-attention → transformers.

Key benefits: - Parallelizable: all positions computed simultaneously - Long-range dependencies: direct attention from any position to any other - Scalable: larger models → better performance (scaling laws)

Transformer Architecture (Recall)

Encoder (e.g., BERT): bidirectional; processes full input at once. Decoder (e.g., GPT): autoregressive; generates left-to-right. Encoder-decoder (e.g., T5, BART): for seq2seq tasks.

Position encoding: since attention is permutation-invariant, add positional information:

PE(pos, 2i)   = sin(pos / 10000^{2i/d})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d})

Modern: rotary position encoding (RoPE), ALiBi, learned embeddings.

Pretraining and Fine-tuning

Transfer learning for NLP: 1. Pretrain on large unlabeled corpus (self-supervised) 2. Fine-tune on small labeled dataset for specific task

This transferred the burden from labeled data to unlabeled data (abundant).

BERT (Bidirectional Encoder Representations from Transformers, 2019)

Pretraining objectives: 1. Masked Language Modeling (MLM): mask 15% of tokens; predict them 2. Next Sentence Prediction (NSP): is sentence B the next sentence after A?

Fine-tuning: add task-specific head; train end-to-end on labeled data.

GPT (Generative Pretrained Transformer)

Pretraining: causal language modeling — predict next token.

P(w_t | w_1, ..., w_{t-1})

GPT-3 (2020, 175B parameters): few-shot learning — remarkable performance from just a few examples in the prompt. No gradient updates needed.

GPT-4 (2023): multimodal (text + images); RLHF alignment.

Large Language Models (LLMs)

Scaling laws (Kaplan et al., 2020): performance improves predictably with model size N, data size D, compute C:

L(N) ∝ N^{-α}    -- loss vs. parameters
L(D) ∝ D^{-β}    -- loss vs. data

Chinchilla scaling (Hoffmann et al., 2022): optimal compute budget splits 50% on model size, 50% on data.

Modern LLMs: GPT-4, Claude 3/4, Gemini, Llama 3, Mistral.

Instruction Tuning and RLHF

Instruction tuning: fine-tune on (instruction, response) pairs → model follows instructions better.

RLHF (Reinforcement Learning from Human Feedback): 1. Collect human preference comparisons (response A vs B) 2. Train reward model R(x, y) from preferences 3. Fine-tune LLM with PPO to maximize R(x, y)

max_π E_x,y~π [R(x,y)] - β · KL(π || π_ref)

The KL penalty prevents the policy from deviating too far from the reference model (prevents mode collapse).

DPO (Direct Preference Optimization): skip reward model; directly fine-tune from preferences.

Retrieval-Augmented Generation (RAG)

LLM hallucinations → augment with retrieved relevant documents: 1. Query → retrieve relevant documents from vector database 2. Prompt = query + retrieved docs → LLM generates answer

Embeddings for retrieval: FAISS, DPR (Dense Passage Retrieval).

Context Window and Long-Range Attention

Standard attention: O(n²) in sequence length.

Efficient attention variants: - Sparse attention (Longformer, BigBird): attend to local window + global tokens → O(n) - Flash attention: reorder computation for IO efficiency → same result, 4-8× faster - Linear attention: kernel approximation → O(n) - Mamba (SSM): state space models as RNN-like alternative → O(n) scaling

Connection to DynamICCL

LLM training IS the key workload that DynamICCL optimizes: - Data parallelism: each GPU has full model copy; NCCL AllReduce synchronizes gradients - Tensor parallelism: model split across GPUs; NCCL AllReduce for attention/MLP layers - Pipeline parallelism: layers distributed across GPUs; NCCL Send/Recv between stages - Flash attention: reduces memory, but creates different NCCL communication patterns - Larger models (GPT-4 scale) = more nodes = more NCCL overhead = bigger DynamICCL impact