Natural Language Processing
Chapter 23 — Natural Language Processing Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 868–910
Language as a Formal System
Natural language is ambiguous, context-dependent, and compositional. NLP aims to understand and generate language computationally.
Core tasks: - Language modeling: P(word | context) - Parsing: syntactic structure of sentences - Information extraction: extract structured facts - Machine translation: translate between languages - Question answering: answer questions given text - Sentiment analysis: classify opinion/emotion
Language Models
A language model assigns probabilities to sequences:
P(w₁, w₂, ..., wₙ) = Π P(wₜ | w₁:ₜ₋₁)
N-gram model: approximate by fixed-length context:
P(wₜ | w₁:ₜ₋₁) ≈ P(wₜ | wₜ₋ₙ₊₁:ₜ₋₁)
Perplexity: geometric mean inverse probability — measures how well the model predicts held-out text:
PP(W) = P(w₁,...,wₙ)^{-1/n} = 2^{H(W)}
Smoothing: handle unseen n-grams (Laplace, Kneser-Ney, Good-Turing).
Word Representations
One-Hot Encoding
Each word = a d-dimensional binary vector; no similarity between words.
Word Embeddings (Word2Vec, GloVe)
Dense low-dimensional representations learned from co-occurrence statistics: - Words with similar meanings → nearby embeddings - Arithmetic analogies: King - Man + Woman ≈ Queen
Word2Vec (Mikolov, 2013): - CBOW: predict center word from context - Skip-gram: predict context words from center - Trained by negative sampling
GloVe: factorize co-occurrence matrix.
Evaluation: word similarity benchmarks, analogy tests.
Syntactic Parsing
Context-Free Grammar (CFG):
S → NP VP
NP → Det N | N
VP → V NP | V
Det → "the" | "a"
N → "dog" | "cat"
V → "chases"
CYK algorithm: O(n³) dynamic programming parser for CNF grammars.
Probabilistic CFG (PCFG): rule probabilities P(α → β); parse with Viterbi to find most probable parse.
Dependency parsing: arcs from head to dependent; O(n) linear-time algorithms (shift-reduce, arc-eager).
Sequence Models for NLP
RNNs process sequences left-to-right; limited by vanishing gradients.
LSTMs handle longer dependencies.
Attention mechanism (pre-Transformer): allows model to focus on relevant input positions when producing each output.
context_t = Σ α_{t,s} · h_s
α_{t,s} = softmax(score(q_t, k_s))
Machine Translation (seq2seq)
Encoder-decoder architecture: 1. Encoder: compress source sentence into context vector 2. Decoder: generate target sentence token by token
Attention in MT (Bahdanau, 2015): instead of fixed context, decoder attends to all encoder states.
BLEU score: precision-based n-gram overlap metric for MT evaluation.
Information Extraction
- Named Entity Recognition (NER): identify person, organization, location mentions
- Relation extraction: (Apple, founder-of, Steve Jobs)
- Coreference resolution: “John said he was tired” → he = John
Typically modeled as sequence labeling (IOB tagging) with neural models.
Sentiment Analysis
Classify text as positive/negative/neutral.
- Lexicon-based: SentimentWordNet, VADER
- Neural: fine-tune language model on labeled data
Aspect-based: “The food was great but the service was slow” → (food, +) and (service, -)
Connection to DynamICCL
NLP is not directly relevant to NCCL optimization, but: - Log parsing: NLP techniques extract patterns from distributed training logs - RLHF: alignment technique that combines RL with human feedback — same RL infrastructure as DynamICCL - Language model training IS the workload being optimized by NCCL (LLM training uses NCCL for tensor parallelism + data parallelism)