Local Search in Continuous Spaces
The Problem
Many optimization problems have continuous state spaces (real-valued parameters). Discrete neighborhood moves don’t apply directly.
Examples: - Minimizing loss in neural network training - Optimal robot joint angles - Economic optimization models
Gradient Descent (Steepest Ascent/Descent)
For a differentiable objective function f(x), the gradient ∇f(x) points in the direction of steepest ascent.
Gradient ascent (maximization):
x ← x + α · ∇f(x)
Gradient descent (minimization):
x ← x - α · ∇f(x)
Where α is the step size (learning rate).
Step Size Choice
| α too large | Overshoots, may diverge |
|---|---|
| α too small | Slow convergence |
| α adaptive | Line search: find α that maximizes f along gradient direction |
Line search: at each step, find the optimal α by 1D optimization along the gradient direction. More accurate but more expensive per step.
Newton-Raphson Method
Uses second-order information (Hessian H = ∇²f) for much faster convergence near the optimum:
x ← x - H⁻¹(x) · ∇f(x)
- Converges quadratically (vs. linear for gradient descent) near the optimum
- Cost: computing and inverting H is O(n³) for n parameters — expensive in high dimensions
- Fix: Quasi-Newton methods (L-BFGS) approximate H⁻¹ using gradient history — O(n) memory
Stochastic Gradient Descent (SGD)
In ML: f(x) = Σᵢ loss(x, example_i). Computing the full gradient requires all data.
SGD: estimate gradient from one (or a mini-batch of) randomly sampled examples:
x ← x - α · ∇loss(x, sample)
- Much cheaper per step
- Noisy gradient = implicit exploration (can escape local minima)
- Convergence to local minimum with diminishing step sizes
Constrained Optimization
When variables must satisfy constraints g(x) = 0 or h(x) ≤ 0:
Lagrange Multipliers (equality constraints)
Maximize f(x) subject to g(x) = 0:
Solve: ∇f(x) = λ · ∇g(x)
The gradient of f must be parallel to the gradient of g at the optimum.
Linear Programming (LP)
Linear objective + linear inequality constraints. Solved in polynomial time via interior-point methods (or simplex in practice).
Properties Summary
| Method | Derivative required | Convergence | Cost/step |
|---|---|---|---|
| Gradient descent | 1st | Linear | O(n) |
| Newton-Raphson | 1st + 2nd | Quadratic | O(n³) |
| L-BFGS | 1st | Super-linear | O(n) |
| SGD | 1st (approx) | Sub-linear | O(1) |
Connection to RL
- Policy gradient methods (REINFORCE, PPO, A3C) ARE gradient ascent on expected reward J(θ)
- Actor-Critic methods approximate both ∇J(θ) and the value function
- DynamICCL’s RL component uses policy gradient to optimize NCCL communication parameters — directly applies gradient methods in continuous (or continuous-relaxed) parameter spaces