Local Search in Continuous Spaces

The Problem

Many optimization problems have continuous state spaces (real-valued parameters). Discrete neighborhood moves don’t apply directly.

Examples: - Minimizing loss in neural network training - Optimal robot joint angles - Economic optimization models

For a differentiable objective function f(x), the gradient ∇f(x) points in the direction of steepest ascent.

Gradient ascent (maximization):

x ← x + α · ∇f(x)

Gradient descent (minimization):

x ← x - α · ∇f(x)

Where α is the step size (learning rate).

α too large	Overshoots, may diverge
α too small	Slow convergence
α adaptive	Line search: find α that maximizes f along gradient direction

Line search: at each step, find the optimal α by 1D optimization along the gradient direction. More accurate but more expensive per step.

Uses second-order information (Hessian H = ∇²f) for much faster convergence near the optimum:

x ← x - H⁻¹(x) · ∇f(x)

Converges quadratically (vs. linear for gradient descent) near the optimum
Cost: computing and inverting H is O(n³) for n parameters — expensive in high dimensions
Fix: Quasi-Newton methods (L-BFGS) approximate H⁻¹ using gradient history — O(n) memory

In ML: f(x) = Σᵢ loss(x, example_i). Computing the full gradient requires all data.

SGD: estimate gradient from one (or a mini-batch of) randomly sampled examples:

x ← x - α · ∇loss(x, sample)

When variables must satisfy constraints g(x) = 0 or h(x) ≤ 0:

Maximize f(x) subject to g(x) = 0:

Solve: ∇f(x) = λ · ∇g(x)

The gradient of f must be parallel to the gradient of g at the optimum.

Linear objective + linear inequality constraints. Solved in polynomial time via interior-point methods (or simplex in practice).

Method	Derivative required	Convergence	Cost/step
Gradient descent	1st	Linear	O(n)
Newton-Raphson	1st + 2nd	Quadratic	O(n³)
L-BFGS	1st	Super-linear	O(n)
SGD	1st (approx)	Sub-linear	O(1)

Policy gradient methods (REINFORCE, PPO, A3C) ARE gradient ascent on expected reward J(θ)
Actor-Critic methods approximate both ∇J(θ) and the value function
DynamICCL’s RL component uses policy gradient to optimize NCCL communication parameters — directly applies gradient methods in continuous (or continuous-relaxed) parameter spaces