Deep Learning for NLP
Chapter 24 — Deep Learning for NLP Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 911–953
The Transformer Revolution
Pre-2017: RNNs/LSTMs dominated NLP. Problems: sequential computation (can’t parallelize), vanishing gradients, limited context window.
“Attention is All You Need” (Vaswani et al., 2017): replace RNNs entirely with self-attention → transformers.
Key benefits: - Parallelizable: all positions computed simultaneously - Long-range dependencies: direct attention from any position to any other - Scalable: larger models → better performance (scaling laws)
Transformer Architecture (Recall)
Encoder (e.g., BERT): bidirectional; processes full input at once. Decoder (e.g., GPT): autoregressive; generates left-to-right. Encoder-decoder (e.g., T5, BART): for seq2seq tasks.
Position encoding: since attention is permutation-invariant, add positional information:
PE(pos, 2i) = sin(pos / 10000^{2i/d})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d})
Modern: rotary position encoding (RoPE), ALiBi, learned embeddings.
Pretraining and Fine-tuning
Transfer learning for NLP: 1. Pretrain on large unlabeled corpus (self-supervised) 2. Fine-tune on small labeled dataset for specific task
This transferred the burden from labeled data to unlabeled data (abundant).
BERT (Bidirectional Encoder Representations from Transformers, 2019)
Pretraining objectives: 1. Masked Language Modeling (MLM): mask 15% of tokens; predict them 2. Next Sentence Prediction (NSP): is sentence B the next sentence after A?
Fine-tuning: add task-specific head; train end-to-end on labeled data.
GPT (Generative Pretrained Transformer)
Pretraining: causal language modeling — predict next token.
P(w_t | w_1, ..., w_{t-1})
GPT-3 (2020, 175B parameters): few-shot learning — remarkable performance from just a few examples in the prompt. No gradient updates needed.
GPT-4 (2023): multimodal (text + images); RLHF alignment.
Large Language Models (LLMs)
Scaling laws (Kaplan et al., 2020): performance improves predictably with model size N, data size D, compute C:
L(N) ∝ N^{-α} -- loss vs. parameters
L(D) ∝ D^{-β} -- loss vs. data
Chinchilla scaling (Hoffmann et al., 2022): optimal compute budget splits 50% on model size, 50% on data.
Modern LLMs: GPT-4, Claude 3/4, Gemini, Llama 3, Mistral.
Instruction Tuning and RLHF
Instruction tuning: fine-tune on (instruction, response) pairs → model follows instructions better.
RLHF (Reinforcement Learning from Human Feedback): 1. Collect human preference comparisons (response A vs B) 2. Train reward model R(x, y) from preferences 3. Fine-tune LLM with PPO to maximize R(x, y)
max_π E_x,y~π [R(x,y)] - β · KL(π || π_ref)
The KL penalty prevents the policy from deviating too far from the reference model (prevents mode collapse).
DPO (Direct Preference Optimization): skip reward model; directly fine-tune from preferences.
Retrieval-Augmented Generation (RAG)
LLM hallucinations → augment with retrieved relevant documents: 1. Query → retrieve relevant documents from vector database 2. Prompt = query + retrieved docs → LLM generates answer
Embeddings for retrieval: FAISS, DPR (Dense Passage Retrieval).
Context Window and Long-Range Attention
Standard attention: O(n²) in sequence length.
Efficient attention variants: - Sparse attention (Longformer, BigBird): attend to local window + global tokens → O(n) - Flash attention: reorder computation for IO efficiency → same result, 4-8× faster - Linear attention: kernel approximation → O(n) - Mamba (SSM): state space models as RNN-like alternative → O(n) scaling
Connection to DynamICCL
LLM training IS the key workload that DynamICCL optimizes: - Data parallelism: each GPU has full model copy; NCCL AllReduce synchronizes gradients - Tensor parallelism: model split across GPUs; NCCL AllReduce for attention/MLP layers - Pipeline parallelism: layers distributed across GPUs; NCCL Send/Recv between stages - Flash attention: reduces memory, but creates different NCCL communication patterns - Larger models (GPT-4 scale) = more nodes = more NCCL overhead = bigger DynamICCL impact