Computer Vision

Chapter 25 — Computer Vision Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 954–1003

The Computer Vision Problem

Transform raw pixel data (images, video) into high-level understanding: - Image classification: what is in this image? - Object detection: where are objects? (bounding boxes + class labels) - Segmentation: pixel-level labeling - 3D reconstruction: infer 3D structure from 2D images - Image generation: synthesize realistic images

Image Representation

Pixel grid: W × H × C where C=3 (RGB), values 0-255.

Preprocessing: normalize to [0,1] or [-1,1]; resize to standard size.

Data augmentation: random crop, flip, color jitter, rotation → increases effective training set.

CNN Architectures for Vision

LeNet (1998): first practical CNN

Conv → Pool → Conv → Pool → Dense → Output

AlexNet (2012): deep learning breakthrough

8 layers; ReLU; Dropout; data augmentation; GPUs. Top-5 error on ImageNet: 15% (vs 26% for hand-crafted features).

VGG (2014): deeper with small 3×3 filters

16-19 layers; 3×3 convs only; very regular structure.

ResNet (2015): skip connections

y = F(x, {Wᵢ}) + x    -- residual block

152 layers! Skip connections allow gradients to flow directly → no vanishing gradient.

Modern: EfficientNet, ConvNeXt, ViT (Vision Transformer)

ViT (2020): split image into patches → treat as sequence → apply transformer. Outperforms CNNs at large scale; requires more data.

Object Detection

Region-Based: R-CNN family

R-CNN: selective search proposals → CNN features → SVM classifier + regressor
Fast R-CNN: share CNN computation; RoI pooling
Faster R-CNN: Region Proposal Network (RPN) replaces selective search → end-to-end

Single-Shot: YOLO, SSD

Process image once; predict all boxes + classes simultaneously. - Much faster (real-time detection) - Slightly lower accuracy for small objects

Anchor-free: FCOS, CenterNet

Predict object center + size directly without predefined anchors.

Semantic and Instance Segmentation

Semantic segmentation: label each pixel (no instance separation). - FCN: fully convolutional network; upsampling via transposed convolution - U-Net: encoder-decoder with skip connections; great for medical imaging - DeepLab: atrous/dilated convolutions for multiscale context

Instance segmentation: label each pixel AND separate instances. - Mask R-CNN: Faster R-CNN + mask branch; O(n) instances

Panoptic segmentation: combines semantic + instance.

3D Vision

Stereo vision: two cameras → disparity → depth. Structure from Motion (SfM): multiple views → 3D point cloud + camera poses. SLAM (Simultaneous Localization and Mapping): real-time 3D reconstruction while moving.

Depth estimation from single image: learned priors from training data.

NeRF (Neural Radiance Fields, 2020): neural network represents 3D scene as implicit function; synthesize novel views.

Image Generation

VAEs: generate blurry images (due to pixel-wise loss). GANs: sharp images; mode collapse problem; hard to train. Diffusion models (2020-present): - Forward process: gradually add Gaussian noise - Reverse process: learn to denoise - Guidance: text-conditioned with CLIP embeddings

DALL-E 2, Stable Diffusion, Midjourney: text-to-image via diffusion.

CLIP (Contrastive Language-Image Pretraining)

Train image encoder + text encoder to maximize similarity of matching pairs:

Loss = -Σᵢ log(exp(sim(iᵢ,tᵢ)) / Σⱼ exp(sim(iᵢ,tⱼ)))

Result: zero-shot image classification — classify by comparing to text descriptions.

Connection to DynamICCL

Computer vision is not directly related to NCCL, but: - CNN training is a major DynamICCL workload (ResNet, ViT training on GPU clusters) - Different model architectures have different communication patterns: - Data parallel CNN: AllReduce of all gradients after each batch - Distributed ViT with tensor parallelism: more complex AllReduce patterns per attention layer - The communication volume scales with model size and batch size — DynamICCL must adapt