Natural Language Processing

Chapter 23 — Natural Language Processing Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 868–910

Language as a Formal System

Natural language is ambiguous, context-dependent, and compositional. NLP aims to understand and generate language computationally.

Core tasks: - Language modeling: P(word | context) - Parsing: syntactic structure of sentences - Information extraction: extract structured facts - Machine translation: translate between languages - Question answering: answer questions given text - Sentiment analysis: classify opinion/emotion

Language Models

A language model assigns probabilities to sequences:

P(w₁, w₂, ..., wₙ) = Π P(wₜ | w₁:ₜ₋₁)

N-gram model: approximate by fixed-length context:

P(wₜ | w₁:ₜ₋₁) ≈ P(wₜ | wₜ₋ₙ₊₁:ₜ₋₁)

Perplexity: geometric mean inverse probability — measures how well the model predicts held-out text:

PP(W) = P(w₁,...,wₙ)^{-1/n} = 2^{H(W)}

Smoothing: handle unseen n-grams (Laplace, Kneser-Ney, Good-Turing).

Word Representations

One-Hot Encoding

Each word = a d-dimensional binary vector; no similarity between words.

Word Embeddings (Word2Vec, GloVe)

Dense low-dimensional representations learned from co-occurrence statistics: - Words with similar meanings → nearby embeddings - Arithmetic analogies: King - Man + Woman ≈ Queen

Word2Vec (Mikolov, 2013): - CBOW: predict center word from context - Skip-gram: predict context words from center - Trained by negative sampling

GloVe: factorize co-occurrence matrix.

Evaluation: word similarity benchmarks, analogy tests.

Syntactic Parsing

Context-Free Grammar (CFG):

S → NP VP
NP → Det N | N
VP → V NP | V
Det → "the" | "a"
N → "dog" | "cat"
V → "chases"

CYK algorithm: O(n³) dynamic programming parser for CNF grammars.

Probabilistic CFG (PCFG): rule probabilities P(α → β); parse with Viterbi to find most probable parse.

Dependency parsing: arcs from head to dependent; O(n) linear-time algorithms (shift-reduce, arc-eager).

Sequence Models for NLP

RNNs process sequences left-to-right; limited by vanishing gradients.

LSTMs handle longer dependencies.

Attention mechanism (pre-Transformer): allows model to focus on relevant input positions when producing each output.

context_t = Σ α_{t,s} · h_s
α_{t,s} = softmax(score(q_t, k_s))

Machine Translation (seq2seq)

Encoder-decoder architecture: 1. Encoder: compress source sentence into context vector 2. Decoder: generate target sentence token by token

Attention in MT (Bahdanau, 2015): instead of fixed context, decoder attends to all encoder states.

BLEU score: precision-based n-gram overlap metric for MT evaluation.

Information Extraction

Named Entity Recognition (NER): identify person, organization, location mentions
Relation extraction: (Apple, founder-of, Steve Jobs)
Coreference resolution: “John said he was tired” → he = John

Typically modeled as sequence labeling (IOB tagging) with neural models.

Sentiment Analysis

Classify text as positive/negative/neutral.

Lexicon-based: SentimentWordNet, VADER
Neural: fine-tune language model on labeled data

Aspect-based: “The food was great but the service was slow” → (food, +) and (service, -)

Connection to DynamICCL

NLP is not directly relevant to NCCL optimization, but: - Log parsing: NLP techniques extract patterns from distributed training logs - RLHF: alignment technique that combines RL with human feedback — same RL infrastructure as DynamICCL - Language model training IS the workload being optimized by NCCL (LLM training uses NCCL for tensor parallelism + data parallelism)