Local Search in Continuous Spaces

The Problem

Many optimization problems have continuous state spaces (real-valued parameters). Discrete neighborhood moves don’t apply directly.

Examples: - Minimizing loss in neural network training - Optimal robot joint angles - Economic optimization models


Gradient Descent (Steepest Ascent/Descent)

For a differentiable objective function f(x), the gradient ∇f(x) points in the direction of steepest ascent.

Gradient ascent (maximization):

x ← x + α · ∇f(x)

Gradient descent (minimization):

x ← x - α · ∇f(x)

Where α is the step size (learning rate).

Step Size Choice

α too large Overshoots, may diverge
α too small Slow convergence
α adaptive Line search: find α that maximizes f along gradient direction

Line search: at each step, find the optimal α by 1D optimization along the gradient direction. More accurate but more expensive per step.


Newton-Raphson Method

Uses second-order information (Hessian H = ∇²f) for much faster convergence near the optimum:

x ← x - H⁻¹(x) · ∇f(x)

Stochastic Gradient Descent (SGD)

In ML: f(x) = Σᵢ loss(x, example_i). Computing the full gradient requires all data.

SGD: estimate gradient from one (or a mini-batch of) randomly sampled examples:

x ← x - α · ∇loss(x, sample)

Constrained Optimization

When variables must satisfy constraints g(x) = 0 or h(x) ≤ 0:

Lagrange Multipliers (equality constraints)

Maximize f(x) subject to g(x) = 0:

Solve: ∇f(x) = λ · ∇g(x)

The gradient of f must be parallel to the gradient of g at the optimum.

Linear Programming (LP)

Linear objective + linear inequality constraints. Solved in polynomial time via interior-point methods (or simplex in practice).


Properties Summary

Method Derivative required Convergence Cost/step
Gradient descent 1st Linear O(n)
Newton-Raphson 1st + 2nd Quadratic O(n³)
L-BFGS 1st Super-linear O(n)
SGD 1st (approx) Sub-linear O(1)

Connection to RL