Computer Vision
Chapter 25 — Computer Vision Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 954–1003
The Computer Vision Problem
Transform raw pixel data (images, video) into high-level understanding: - Image classification: what is in this image? - Object detection: where are objects? (bounding boxes + class labels) - Segmentation: pixel-level labeling - 3D reconstruction: infer 3D structure from 2D images - Image generation: synthesize realistic images
Image Representation
Pixel grid: W × H × C where C=3 (RGB), values 0-255.
Preprocessing: normalize to [0,1] or [-1,1]; resize to standard size.
Data augmentation: random crop, flip, color jitter, rotation → increases effective training set.
CNN Architectures for Vision
LeNet (1998): first practical CNN
Conv → Pool → Conv → Pool → Dense → Output
AlexNet (2012): deep learning breakthrough
8 layers; ReLU; Dropout; data augmentation; GPUs. Top-5 error on ImageNet: 15% (vs 26% for hand-crafted features).
VGG (2014): deeper with small 3×3 filters
16-19 layers; 3×3 convs only; very regular structure.
ResNet (2015): skip connections
y = F(x, {Wᵢ}) + x -- residual block
152 layers! Skip connections allow gradients to flow directly → no vanishing gradient.
Modern: EfficientNet, ConvNeXt, ViT (Vision Transformer)
ViT (2020): split image into patches → treat as sequence → apply transformer. Outperforms CNNs at large scale; requires more data.
Object Detection
Region-Based: R-CNN family
- R-CNN: selective search proposals → CNN features → SVM classifier + regressor
- Fast R-CNN: share CNN computation; RoI pooling
- Faster R-CNN: Region Proposal Network (RPN) replaces selective search → end-to-end
Single-Shot: YOLO, SSD
Process image once; predict all boxes + classes simultaneously. - Much faster (real-time detection) - Slightly lower accuracy for small objects
Anchor-free: FCOS, CenterNet
Predict object center + size directly without predefined anchors.
Semantic and Instance Segmentation
Semantic segmentation: label each pixel (no instance separation). - FCN: fully convolutional network; upsampling via transposed convolution - U-Net: encoder-decoder with skip connections; great for medical imaging - DeepLab: atrous/dilated convolutions for multiscale context
Instance segmentation: label each pixel AND separate instances. - Mask R-CNN: Faster R-CNN + mask branch; O(n) instances
Panoptic segmentation: combines semantic + instance.
3D Vision
Stereo vision: two cameras → disparity → depth. Structure from Motion (SfM): multiple views → 3D point cloud + camera poses. SLAM (Simultaneous Localization and Mapping): real-time 3D reconstruction while moving.
Depth estimation from single image: learned priors from training data.
NeRF (Neural Radiance Fields, 2020): neural network represents 3D scene as implicit function; synthesize novel views.
Image Generation
VAEs: generate blurry images (due to pixel-wise loss). GANs: sharp images; mode collapse problem; hard to train. Diffusion models (2020-present): - Forward process: gradually add Gaussian noise - Reverse process: learn to denoise - Guidance: text-conditioned with CLIP embeddings
DALL-E 2, Stable Diffusion, Midjourney: text-to-image via diffusion.
CLIP (Contrastive Language-Image Pretraining)
Train image encoder + text encoder to maximize similarity of matching pairs:
Loss = -Σᵢ log(exp(sim(iᵢ,tᵢ)) / Σⱼ exp(sim(iᵢ,tⱼ)))
Result: zero-shot image classification — classify by comparing to text descriptions.
Connection to DynamICCL
Computer vision is not directly related to NCCL, but: - CNN training is a major DynamICCL workload (ResNet, ViT training on GPU clusters) - Different model architectures have different communication patterns: - Data parallel CNN: AllReduce of all gradients after each batch - Distributed ViT with tensor parallelism: more complex AllReduce patterns per attention layer - The communication volume scales with model size and batch size — DynamICCL must adapt