πŸ“š Chapter V - Part 1: Gradient & Direct Search Methods

Course Author: Dr. Mohamed Abdou SOUIDI
Academic Year: 2025-20266

🎯Introduction to Unconstrained Optimization

What is Unconstrained Optimization?

Unconstrained optimization is the mathematical problem of finding the minimum (or maximum) of a function without any constraints on the variables. The general formulation is:

\[\min_{x \in \mathbb{R}^n} f(x)\]

where \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is the objective function we want to minimize.

πŸ”‘ Key Concept

At a local minimum \(x^*\), the gradient must be zero: \(\nabla f(x^*) = 0\) (first-order necessary condition).

Real-World Applications

πŸ€– Machine Learning

Training neural networks by minimizing loss functions (cross-entropy, MSE)

πŸ“ˆ Finance

Portfolio optimization to maximize returns while managing risk

🏭 Engineering Design

Minimizing cost, weight, or material usage in structural design

πŸ”¬ Scientific Computing

Parameter estimation and curve fitting in data analysis

Key Mathematical Concepts

1. Gradient (βˆ‡f)

The gradient is a vector of partial derivatives indicating the direction of steepest ascent:

\[\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}\]

2. Hessian (βˆ‡Β²f)

The Hessian is a matrix of second-order partial derivatives describing the curvature:

\[\nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}\]

Optimality Conditions

βœ… First-Order Necessary Condition

If \(x^*\) is a local minimum, then \(\nabla f(x^*) = 0\)

βœ… Second-Order Sufficient Condition

If \(\nabla f(x^*) = 0\) and \(\nabla^2 f(x^*)\) is positive definite, then \(x^*\) is a strict local minimum

πŸ“ŠGradient Methods

2.1 Gradient Descent

Gradient descent is the most fundamental optimization algorithm. It iteratively moves in the direction opposite to the gradient:

\[x_{k+1} = x_k - \alpha \nabla f(x_k)\]

where \(\alpha > 0\) is the learning rate (step size).

Algorithm: Gradient Descent
1. Initialize \(x_0\), set learning rate \(\alpha\), tolerance \(\epsilon\)
2. for k = 0, 1, 2, ... do
3. Compute gradient \(g_k = \nabla f(x_k)\)
4. if \(\|g_k\| < \epsilon\) then
5. return \(x_k\) (converged)
6. end if
7. Update: \(x_{k+1} = x_k - \alpha g_k\)
8. end for

Interactive Visualization: Simple 1D Function

Explore the function: \(f(x) = x^2 - 4x + 3\)

2.50

Function Analysis

Function: f(x) = xΒ² - 4x + 3
Derivative (Gradient): f'(x) = 2x - 4
Current Point:
x = 2.50
Function Value:
f(x) = -0.75
Gradient Value:
f'(x) = 1.00
πŸ“Š Analysis: Move the slider to explore the function. The gradient shows the slope direction.
βœ“ Optimal Point: x* = 2, f(x*) = -1
Gradient is zero at minimum: f'(2) = 0
Distance to Optimal
0.50
Gradient Magnitude
|1.00|
Descent Direction
← LEFT

πŸ’‘ Tip: Use the slider to move along the x-axis. Watch how the gradient changes. At the minimum (x=2), the gradient equals zero.

Numerical Example: Step-by-Step Calculation

Problem: Minimize \(f(x) = x^2 - 4x + 3\) with \(f'(x) = 2x - 4\)

Parameters: \(x_0 = 5.0\), \(\alpha = 0.2\) (learning rate)

πŸ’‘ Step-by-Step Calculations

Iteration 0:

  • \(x_0 = 5.0\)
  • \(f(x_0) = 5^2 - 4(5) + 3 = 25 - 20 + 3 = 8.0\)
  • \(\nabla f(x_0) = 2(5) - 4 = 6.0\)
  • \(x_1 = x_0 - \alpha \nabla f(x_0) = 5.0 - 0.2(6.0) = 5.0 - 1.2 = 3.8\)

Iteration 1:

  • \(x_1 = 3.8\)
  • \(f(x_1) = 3.8^2 - 4(3.8) + 3 = 14.44 - 15.2 + 3 = 1.76\)
  • \(\nabla f(x_1) = 2(3.8) - 4 = 3.6\)
  • \(x_2 = 3.8 - 0.2(3.6) = 3.8 - 0.72 = 3.08\)
Iteration k xk f(xk) βˆ‡f(xk) xk+1
0 5.000 8.000 6.000 3.800
1 3.800 1.760 3.600 3.080
2 3.080 0.314 2.160 2.648
3 2.648 0.041 1.296 2.389
4 2.389 0.002 0.778 2.233

βœ“ Converges to optimal solution: \(x^* = 2.0\), \(f(x^*) = -1.0\)

Interactive Demo: Effect of Learning Rate

Function: \(f(x) = x^2 - 4x + 3\), Starting point: \(x_0 = 5.0\)

0.20

Algorithm Status

Learning Rate:
Ξ± = 0.20

Status:
βœ“ Converged
Iterations to Convergence
13
Final x Value
2.00
Final f(x)
-1.00
πŸ“Š Recommendation: Good balance between speed and stability

Convergence Comparison

Ξ± Value Iterations Behavior Status
0.05 (Too small) ~30 Very slow ⚠ Inefficient
0.20 (Optimal) 13 Optimal βœ“ Recommended
0.60 (Too large) ~4 Oscillates ⚠ Unstable

πŸ”‘ Key Insights about Learning Rate

  • Too small (Ξ± < 0.1): Slow convergence, many iterations needed
  • Optimal (Ξ± β‰ˆ 0.2): Fast, stable convergence
  • Too large (Ξ± > 0.5): Oscillations or divergence

Convergence Analysis

Learning Rate Label Iterations Behavior
0.05 Too Small 30 Very slow convergence
0.20 Optimal 13 Fast and stable
0.60 Too Large 4 Oscillations, may diverge

2.2 Gradient Descent with Momentum

Momentum accelerates gradient descent by accumulating a velocity vector that dampens oscillations and speeds up convergence in consistent directions.

\[v_{k+1} = \beta v_k - \alpha \nabla f(x_k)\] \[x_{k+1} = x_k + v_{k+1}\]

where \(\beta \in [0, 1)\) is the momentum coefficient (typically 0.9).

Algorithm: Momentum-based Gradient Descent
1. Initialize \(x_0\), \(v_0 = 0\), set \(\alpha\), \(\beta\)
2. for k = 0, 1, 2, ... do
3. Compute gradient \(g_k = \nabla f(x_k)\)
4. Update velocity: \(v_{k+1} = \beta v_k - \alpha g_k\)
5. Update position: \(x_{k+1} = x_k + v_{k+1}\)
6. end for

βœ… Advantages of Momentum

  • Accelerates convergence in ravine-like regions
  • Dampens oscillations across narrow valleys
  • Helps escape shallow local minima
  • Typically uses \(\beta = 0.9\) for good results

Numerical Example: Momentum vs Standard GD

Problem: Same function \(f(x) = x^2 - 4x + 3\), \(x_0 = 5.0\), \(\alpha = 0.2\), \(\beta = 0.9\)

Iteration Standard GD Momentum GD
k xk f(xk) xk vk f(xk)
0 5.000 8.000 5.000 0.000 8.000
1 3.800 1.760 3.800 -1.200 1.760
2 3.080 0.314 2.960 -1.560 0.082
3 2.648 0.041 2.248 -1.496 0.004
4 2.389 0.002 1.952 -1.196 0.000

πŸš€ Momentum converges ~40% faster! Reaches f(x) < 0.01 in 3 iterations vs 5 for standard GD

2.3 Nesterov Accelerated Gradient (NAG)

NAG improves upon classical momentum by computing the gradient at the "lookahead" position:

\[v_{k+1} = \beta v_k - \alpha \nabla f(x_k + \beta v_k)\] \[x_{k+1} = x_k + v_{k+1}\]

πŸ”‘ The "Lookahead" Concept

NAG evaluates the gradient not at the current position, but at the position where momentum would take us. This provides better anticipation of the landscape ahead, often leading to faster convergence.

2.4 Modern Adaptive Methods

Adam (Adaptive Moment Estimation)

Combines momentum with adaptive learning rates for each parameter:

\[m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t\] \[v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\] \[x_{t+1} = x_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\]

Default parameters: \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\epsilon = 10^{-8}\)

RMSprop (Root Mean Square Propagation)

Adapts learning rate based on recent gradient magnitudes:

\[E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta)g_t^2\] \[x_{t+1} = x_t - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} g_t\]

Typical value: \(\beta = 0.9\)

AdaGrad (Adaptive Gradient)

Accumulates squared gradients for parameter-specific learning rates:

\[G_t = G_{t-1} + g_t^2\] \[x_{t+1} = x_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} g_t\]

Note: Learning rate decreases over time; good for sparse data

When to Use Each Method?

Method Best Use Case Pros Cons
Gradient Descent Simple, smooth functions Simple, guaranteed convergence Slow, sensitive to learning rate
Momentum Ill-conditioned problems Faster than GD, dampens oscillations Adds hyperparameter \(\beta\)
NAG When momentum is too aggressive Better anticipation, faster convergence Slightly more complex
Adam Deep learning (default choice) Adaptive, robust, few tuning needed May not converge in some cases

πŸ”Direct Methods (Derivative-Free)

πŸ“– What Are Direct Methods?

Direct methods (also called derivative-free or black-box methods) optimize functions without computing gradients. They only require function evaluations: f(x). This is essential when:

  • Gradients are unavailable or expensive to compute
  • Function is non-smooth or noisy
  • Function is a black-box simulation
  • Automatic differentiation is not feasible

3.1 Nelder-Mead Simplex Method

Core Concept:

πŸ”Ί What is a Simplex?

A simplex is a geometric shape with n+1 vertices in n-dimensional space:

  • 1D: 2 points on a line
  • 2D: Triangle with 3 vertices
  • 3D: Tetrahedron with 4 vertices

The simplex evolves through the search space using 4 operations: Reflection, Expansion, Contraction, and Shrink.

The Four Operations (DETAILED EXPLANATION):

1️⃣ REFLECTION (Ξ± = 1.0)

\[x_r = \bar{x} + \alpha(\bar{x} - x_{n+1})\] where xΜ„ is the centroid of the best n points

Interpretation: Move the worst point through the centroid in the opposite direction

When to use: Always try first

Success condition: f(x_best) ≀ f(x_r) < f(x_second_worst)

Action if successful: Replace worst vertex with reflected point

2️⃣ EXPANSION (Ξ³ = 2.0)

\[x_e = \bar{x} + \gamma(x_r - \bar{x})\]

Interpretation: Reflection was very good, so go even further in that direction

When to use: If reflected point is better than current best (f(x_r) < f(x_best))

Success condition: f(x_e) < f(x_r)

Action if successful: Replace worst with expanded point; otherwise use reflected point

3️⃣ CONTRACTION (ρ = 0.5)

\[x_c = \bar{x} + \rho(x_{n+1} - \bar{x})\]

Interpretation: Reflection failed, so move closer to centroid but not full retreat

When to use: If reflected point is poor (f(x_r) β‰₯ f(x_second_worst))

Success condition: f(x_c) < f(x_{n+1})

Action if successful: Replace worst with contracted point

4️⃣ SHRINK (Οƒ = 0.5)

\[x_i = x_{best} + \sigma(x_i - x_{best}) \quad \text{for all } i\]

Interpretation: Nothing worked, so scale the entire simplex toward the best point

When to use: Contraction failed (last resort)

Result: All vertices move closer to the best vertex

Effect: Reduces simplex size, enabling finer search

Complete Pseudocode:

Algorithm: Nelder-Mead Simplex Method
1. INITIALIZATION:
Create simplex S = {x₁, xβ‚‚, ..., x_{n+1}}
Evaluate f(x_i) for all vertices
Set parameters: Ξ±=1.0, Ξ³=2.0, ρ=0.5, Οƒ=0.5
2. MAIN LOOP (while not converged):
a. SORT: Order vertices by f-value
x₁ = best (lowest f), x_{n+1} = worst (highest f)
b. CENTROID: Compute xΜ„ = (1/n)Β·Ξ£(x₁...x_n)
c. REFLECTION:
x_r = xΜ„ + Ξ±(xΜ„ - x_{n+1})
Evaluate f(x_r)
d. DECISION LOGIC:
IF f(x₁) ≀ f(x_r) < f(x_n):
Replace x_{n+1} with x_r, CONTINUE
ELSE IF f(x_r) < f(x₁): (Very good!)
EXPANSION: x_e = xΜ„ + Ξ³(x_r - xΜ„)
IF f(x_e) < f(x_r): Replace x_{n+1} with x_e
ELSE: Replace x_{n+1} with x_r
ELSE: (Reflection failed)
CONTRACTION: x_c = xΜ„ + ρ(x_{n+1} - xΜ„)
IF f(x_c) < f(x_{n+1}): Replace x_{n+1} with x_c
ELSE: SHRINK all vertices toward x₁
x_i ← x₁ + Οƒ(x_i - x₁) for i=2...n+1
3. CONVERGENCE CHECK:
IF (max |f(x_i) - f(xΜ„)|) < Ξ΅: STOP
4. OUTPUT: Return x₁ (best vertex)

Detailed Numerical Example: Step-by-Step Walkthrough

Problem: Minimize \(f(x_1, x_2) = (x_1-2)^2 + (x_2-1)^2\)
Optimal Point: \((2, 1)\) with \(f(2,1) = 0\)
Algorithm: Nelder-Mead Simplex Method

πŸ”’ INITIALIZATION: Setting up Initial Simplex

Goal: Start with 3 vertices representing candidate solutions in \(\mathbb{R}^2\)

Vertex Position \(f(x)\) Status
A \((0.0, 0.0)\) 5.0 WORST
B \((1.0, 0.0)\) 2.0 βœ“ BEST
C \((0.0, 1.0)\) 4.0 Middle
πŸ“ Initial Observation: Vertex B at \((1.0, 0.0)\) gives the lowest function value (2.0), making it the current best.
Vertex A at \((0.0, 0.0)\) is the worst with \(f = 5.0\), so it will be moved in the next iteration.

πŸ”„ ITERATION 1: Reflection and Expansion

Step 1️⃣ - Sort Vertices by Function Value:
Best: \(B(1.0, 0.0)\) β†’ \(f = 2.0\) βœ“
Middle: \(C(0.0, 1.0)\) β†’ \(f = 4.0\)
Worst: \(A(0.0, 0.0)\) β†’ \(f = 5.0\) βœ—
Step 2️⃣ - Compute Centroid of Best Two Vertices:
\(\bar{x} = \frac{B + C}{2}\)
\(\bar{x} = \frac{(1.0, 0.0) + (0.0, 1.0)}{2}\)
\(\bar{x} = \frac{(1.0, 1.0)}{2}\)
\(\bar{x} = (0.5, 0.5)\)
Step 3️⃣ - REFLECTION: Move worst point through centroid
\(x_r = \bar{x} + \alpha(\bar{x} - A)\) where \(\alpha = 1.0\)
\(x_r = (0.5, 0.5) + 1.0 \cdot [(0.5, 0.5) - (0.0, 0.0)]\)
\(x_r = (0.5, 0.5) + (0.5, 0.5)\)
\(x_r = (1.0, 1.0)\)

Evaluate: \(f(x_r) = (1.0-2)^2 + (1.0-1)^2\)
\(f(x_r) = 1 + 0 = 1.0\)
βœ“ REFLECTION ACCEPTED! \((1.0 < 2.0)\) Reflected point is better than best!
Step 4️⃣ - EXPANSION: Try going further in this direction
\(x_e = \bar{x} + \gamma(x_r - \bar{x})\) where \(\gamma = 2.0\)
\(x_e = (0.5, 0.5) + 2.0 \cdot [(1.0, 1.0) - (0.5, 0.5)]\)
\(x_e = (0.5, 0.5) + 2.0 \cdot (0.5, 0.5)\)
\(x_e = (0.5, 0.5) + (1.0, 1.0)\)
\(x_e = (1.5, 1.5)\)

Evaluate: \(f(x_e) = (1.5-2)^2 + (1.5-1)^2\)
\(f(x_e) = 0.25 + 0.25\)
\(f(x_e) = 0.5\)
βœ“βœ“ EXPANSION ACCEPTED! \((0.5 < 1.0)\) Expansion is even better!
πŸ“Š ITERATION 1 RESULT:
Old worst vertex \(A(0.0, 0.0)\) with \(f=5.0\)
β†’ Replaced with \(E(1.5, 1.5)\) with \(f=0.5\)

New Simplex:
β€’ \(E(1.5, 1.5)\) \(f=0.5\) βœ“ NEW BEST
β€’ \(B(1.0, 0.0)\) \(f=2.0\)
β€’ \(C(0.0, 1.0)\) \(f=4.0\)

πŸ”„ ITERATION 2: Continuing Convergence

Current Simplex State:
Best: \(E(1.5, 1.5)\) \(f = 0.5\)
Middle: \(B(1.0, 0.0)\) \(f = 2.0\)
Worst: \(C(0.0, 1.0)\) \(f = 4.0\)
Reflection Calculation:
Centroid of E and B:
\(\bar{x} = \frac{(1.5, 1.5) + (1.0, 0.0)}{2} = (1.25, 0.75)\)

\(x_r = (1.25, 0.75) + 1.0 \cdot [(1.25, 0.75) - (0.0, 1.0)]\)
\(x_r = (1.25, 0.75) + (1.25, -0.25)\)
\(x_r = (2.5, 0.5)\)

\(f(x_r) = (2.5-2)^2 + (0.5-1)^2\)
\(f(x_r) = 0.25 + 0.25\)
\(f(x_r) = 0.5\)
⚑ ITERATION 2 DECISION:
\(f(x_r) = 0.5 = f(\text{best})\)
Reflection equals current best β†’ βœ“ Accept reflection

Simplex continues to improve toward \((2.0, 1.0)\)

βœ… CONVERGENCE: After 5-8 Iterations

Iteration Best Vertex \(f(x)\) Best Distance to Optimum
0 \((1.0, 0.0)\) 2.0 1.414
1 \((1.5, 1.5)\) 0.5 0.707
2-3 \(\approx(1.8, 0.9)\) 0.05 0.2
5-8 \((2.0, 1.0)\) 0.0 0.0
🎯 OPTIMAL SOLUTION FOUND!
Final Solution: \(x^* = (2.0, 1.0)\)
Minimum Value: \(f(x^*) = 0.0\)
Convergence: 5-8 iterations
All vertices clustered around optimal point
πŸ’‘ How Simplex Evolves: Triangle shrinks and moves toward optimum through systematic reflection, expansion, and contraction. Each operation strategically repositions worst vertex to explore function landscape efficiently.
⚑ Computational Efficiency: No gradients required! Only function evaluations needed. Great for non-smooth, black-box optimization. Works in any dimension with \(2\times\) performance penalty.

Advantages & Disadvantages:

βœ… Advantages

  • No gradients needed (black-box)
  • Simple logic, easy to implement
  • Works on non-smooth functions
  • Handles noisy functions
  • Intuitive geometric interpretation

⚠️ Disadvantages

  • Slow in high dimensions (n > 50)
  • May converge to non-stationary points
  • No convergence rate guarantees
  • Sensitive to initial simplex
  • Can degenerate (simplex collapses)

Interactive Visualization: Simplex Movement

Objective: Minimize \(f(x_1, x_2) = (x_1-2)^2 + (x_2-1)^2\)

0/5

Step Information

Phase: Initialization
Operation: Initial Setup
Best f(x): 2.00
Description: Initial simplex with 3 vertices
βœ“ Decision: Simplex created
Vertices:
Centroid
(0.50, 0.50)
Iteration
0
Distance to Optimum
2.24

3.2 Hooke-Jeeves Pattern Search

Pattern search alternates between exploratory moves (coordinate search) and pattern moves (exploiting successful directions).

Algorithm: Hooke-Jeeves
1. Initialize base point \(x_0\), step size \(\Delta\)
2. for k = 0, 1, 2, ... do
3. Exploratory Search:
4. for each coordinate i = 1 to n do
5. Try \(x + \Delta e_i\) and \(x - \Delta e_i\)
6. Keep best improvement
7. Let \(x_{new}\) be the result
8. if improvement found then
9. Pattern Move: \(x_{next} = x_{new} + (x_{new} - x_k)\)
10. Set \(x_{k+1} = x_{next}\)
11. else
12. Reduce step size: \(\Delta \leftarrow \Delta/2\)
13. end for

Interactive Visualization: Hooke-Jeeves Pattern Search

Objective: Minimize \(f(x_1, x_2) = (x_1-2)^2 + 4(x_2-1)^2\)

0/6

πŸ“Š Step Information

Phase: Initialization
Operation: Initial Setup
Best \(f(x)\): 8.0000
Current Point: \((0.00, 0.00)\)
βœ“ Decision: Starting algorithm
πŸ“ Description: Hooke-Jeeves pattern search initialization at origin. Starting point (0, 0) with initial step size Ξ” = 0.5
Step Size \(\Delta\)
0.50
Iteration
0
Distance to Optimum
2.83
Status
Init

πŸ“ Hooke-Jeeves Algorithm: Two-Phase Strategy

⭐ Phase 1: Exploratory
  • Try movement Β±\(\Delta\) for each coordinate
  • Test: \(x \pm \Delta \cdot e_i\)
  • Keep best improvement found
  • Success β†’ triggers pattern phase
⭐ Phase 2: Pattern
  • Pattern direction: \(p = x_{\text{new}} - x_{\text{base}}\)
  • Accelerate: try \(x + p\)
  • If better β†’ accept, continue
  • If worse β†’ reduce \(\Delta\)

βœ… When to Use Direct Methods

  • Black-box functions: Simulations, external codes
  • Non-smooth functions: Gradient undefined or discontinuous
  • Noisy functions: Stochastic simulations, experiments
  • Low dimensions: Generally n < 20 for good performance

Comparison of Direct Methods

Method Strategy Convergence Best For
Nelder-Mead Simplex transformation No guarantee Smooth, low-dim (n<10)
Hooke-Jeeves Coordinate + pattern search Convergent with conditions Medium-dim (n<20)
Powell's Method Conjugate directions Quadratic in n steps Quadratic-like functions

πŸŽ“Second-Order Methods

Second-order methods use curvature information (Hessian matrix) to achieve faster convergence than first-order gradient methods.

4.1 Newton's Method

Newton's method uses both gradient and Hessian to make quadratically convergent steps:

\[x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)\]
Algorithm: Newton's Method
1. Initialize \(x_0\), set tolerance \(\epsilon\)
2. for k = 0, 1, 2, ... do
3. Compute gradient: \(g_k = \nabla f(x_k)\)
4. Compute Hessian: \(H_k = \nabla^2 f(x_k)\)
5. Solve linear system: \(H_k p_k = -g_k\)
6. Update: \(x_{k+1} = x_k + p_k\)
7. if \(\|g_k\| < \epsilon\) then return \(x_k\)
8. end for

Numerical Example: Newton's Quadratic Convergence

Problem: \(f(x) = x^2 - 4x + 3\)

Derivatives: \(f'(x) = 2x - 4\), \(f''(x) = 2\)

Iteration k xk f(xk) βˆ‡f(xk) βˆ‡Β²f(xk) xk+1
0 5.000 8.000 6.000 2.000 2.000
1 2.000 -1.000 0.000 2.000 2.000

🎯 Calculation Detail for Iteration 0:

\(x_1 = x_0 - \frac{f'(x_0)}{f''(x_0)} = 5.0 - \frac{6.0}{2.0} = 5.0 - 3.0 = 2.0\)

Result: Newton's method converges in just 1 iteration for quadratic functions!

βœ… Advantages of Newton's Method

  • Quadratic convergence: Error decreases as \(\|e_{k+1}\| \approx C\|e_k\|^2\)
  • One-step convergence for quadratic functions
  • No step size tuning: Uses full Newton step
  • Scale invariant: Not affected by coordinate scaling

⚠️ Computational Challenges

  • O(nΒ³) cost to compute and invert Hessian
  • O(nΒ²) memory to store Hessian
  • Hessian may not be positive definite far from minimum
  • Requires exact second derivatives

Example: Newton vs Gradient Descent

Function: \(f(x) = 0.5x^2 - 2x\) | Derivative: \(f'(x) = x - 2\) | Second Derivative: \(f''(x) = 1\)

0/8

Newton's Method

Iteration:
0

Position:
x = 0.00

f(x):
0.00

Gradient Descent (\(\alpha=0.2\))

Iteration:
0

Position:
x = 0.00

f(x):
0.00
πŸ“Š Convergence Speed
Newton:
1 iteration

Gradient Descent:
~8 iterations
πŸ“ Current Step: Starting at x = 0 with f(x) = 0. Watch how Newton's method converges in just 1 iteration, while gradient descent requires many steps.

⚑ Newton's Method

Update Formula:
\(x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}\)

Convergence Rate:
Quadratic (very fast)

Cost per Iteration:
\(O(n^3)\) - Compute Hessian inverse

Requirements:
First AND second derivatives needed

πŸ“‰ Gradient Descent

Update Formula:
\(x_{k+1} = x_k - \alpha \cdot f'(x_k)\)

Convergence Rate:
Linear (slower)

Cost per Iteration:
\(O(n)\) - Only compute gradient

Requirements:
Only first derivative needed

4.2 Quasi-Newton Methods: BFGS

BFGS (Broyden-Fletcher-Goldfarb-Shanno) approximates the Hessian inverse using gradient information, avoiding expensive Hessian computation.

\[x_{k+1} = x_k - \alpha_k B_k^{-1} \nabla f(x_k)\]

where \(B_k\) is updated using the BFGS formula:

\[B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}\]

with \(s_k = x_{k+1} - x_k\) and \(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\).

Algorithm: BFGS
1. Initialize \(x_0\), \(B_0 = I\) (identity matrix)
2. for k = 0, 1, 2, ... do
3. Compute search direction: \(p_k = -B_k^{-1} \nabla f(x_k)\)
4. Line search: find \(\alpha_k\) satisfying Wolfe conditions
5. Update: \(x_{k+1} = x_k + \alpha_k p_k\)
6. Compute \(s_k = x_{k+1} - x_k\), \(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\)
7. Update BFGS approximation: \(B_{k+1}\)
8. end for

πŸ”‘ Why BFGS is Popular

  • Superlinear convergence: Faster than gradient methods
  • O(nΒ²) complexity: Much cheaper than Newton
  • Automatically maintains positive definiteness
  • Default choice for medium-scale unconstrained optimization

4.3 Conjugate Gradient Method

The Conjugate Gradient (CG) method generates search directions that are conjugate with respect to the Hessian, achieving optimal convergence for quadratic functions.

\[x_{k+1} = x_k + \alpha_k p_k\] \[p_{k+1} = -\nabla f(x_{k+1}) + \beta_k p_k\]

Common choices for \(\beta_k\):

βœ… Conjugate Gradient Advantages

  • Exact convergence in n steps for n-dimensional quadratic functions
  • O(n) memory: Only stores vectors, not matrices
  • Ideal for large-scale problems (n > 1000)
  • No matrix operations: Only needs gradient evaluations

πŸ“ŠMethod Comparison & Selection Guide

Choosing the right optimization method is critical for success. This section provides comprehensive guidance on selecting among the methods covered in Part 1.

Comprehensive Algorithm Comparison

Method Gradient Hessian Convergence Cost/Iter Memory Best For
Gradient Descent βœ“ βœ— Linear O(1/k) O(n) O(n) Large-scale, smooth, convex
Momentum βœ“ βœ— Linear+ (faster) O(n) O(n) Ill-conditioned, ravines
NAG βœ“ βœ— Optimal O(1/kΒ²) O(n) O(n) Convex, high performance
Adam βœ“ βœ— Adaptive O(n) O(n) Deep learning, sparse data
Newton βœ“ βœ“ Quadratic O(nΒ³) O(nΒ²) Small n, high accuracy
BFGS βœ“ Approx Superlinear O(nΒ²) O(nΒ²) Medium-scale (n=100-10000)
Conjugate Gradient βœ“ Via products n iterations O(n) O(n) Large quadratic systems
Nelder-Mead βœ— βœ— Sublinear O(n) O(n) Black-box, n<20, non-smooth
Hooke-Jeeves βœ— βœ— Sublinear O(n) O(n) Separable, n<50

Decision Flowchart: Which Method to Use?

START: Analyze Your Problem
Can you compute gradients efficiently?
NO - Use Direct Methods
n < 10: Nelder-Mead
n < 50: Hooke-Jeeves
Otherwise: Part 2 methods
YES - Continue
Gradient-based methods available
What is problem dimension n?
Small (n < 1000)
Newton or BFGS
(if Hessian cheap)
Medium (1K-10K)
BFGS or
Conjugate Gradient
Large (n > 10K)
Gradient Descent
or Momentum
Is function convex?
YES - Convex
NAG for fastest
Or any gradient method
NO - Non-convex
Momentum/Adam
Or try Part 2 methods

Quick Selection Guidelines

βœ… Use Gradient Methods When:

  • Function is smooth and differentiable
  • Gradients are cheap to compute (analytic derivatives)
  • Need guaranteed convergence to local minimum
  • Problem is large-scale (n > 100)
  • Convex objective (global optimum guaranteed)

⚠️ Use Direct Methods When:

  • Gradients unavailable or very expensive
  • Function is noisy or non-smooth
  • Black-box optimization (simulation)
  • Small dimension (n < 20)
  • Quick prototyping needed

Practical Recommendations

πŸ”‘ Key Recommendations

  1. Start simple: Try gradient descent first if gradients available
  2. Add momentum: If convergence is slow or oscillatory
  3. Use Adam for ML: Default choice for training neural networks
  4. BFGS for medium-scale: When n = 100-10,000 and you need fast convergence
  5. Nelder-Mead for black-box: When n < 20 and no gradients
  6. Multiple restarts: Run from different initial points to avoid local minima
  7. Monitor convergence: Always plot objective vs iteration
  8. For global optimization: See Part 2 for population-based methods

⚠️ Common Pitfalls to Avoid

  • Learning rate too large: Causes divergence in gradient methods
  • Learning rate too small: Wastes time with slow convergence
  • Ignoring scaling: Normalize inputs for better conditioning
  • No convergence check: Always test stopping criteria
  • Wrong method for problem type: Use decision flowchart above
  • Single initial point: May find poor local minimum in non-convex problems

Typical Hyperparameters

Method Parameter Typical Range Recommended Start
Gradient Descent Learning rate Ξ± 0.001 - 0.1 0.01 (tune with experiments)
Momentum Ξ² (momentum) 0.5 - 0.99 0.9
NAG Ξ² (momentum) 0.5 - 0.99 0.9
Adam β₁, Ξ²β‚‚ - 0.9, 0.999 (rarely change)
BFGS Line search tol 10⁻⁴ - 10⁻⁸ 10⁻⁢
Nelder-Mead Ξ±, Ξ³, ρ, Οƒ - 1.0, 2.0, 0.5, 0.5 (standard)

✍️ Practice Exercises

Apply what you've learned with these interactive exercises. Experiment with parameters and observe how different methods behave.

Exercise 1: Rosenbrock's Banana Function

The Rosenbrock function is a classic optimization benchmark with a curved valley:

\(f(x, y) = (1-x)^2 + 100(y-x^2)^2\)

Minimum at: \((1, 1)\) with \(f(1, 1) = 0\)
Challenge: The curved valley makes this hard for gradient descent

Try Different Methods:

Click a button to run the optimization...

Exercise 2: Learning Rate Tuning Challenge

Find the optimal learning rate for gradient descent on:

\(f(x) = x^4 - 3x^3 + 2x^2\)

Optimal learning rate: ~0.05-0.15 for smooth convergence

Adjust Learning Rate:

Too small β†’ slow | Too large β†’ diverges
Iterations
β€”
Converged?
β€”
Score
β€”

Key Takeaways from Exercises

πŸ”‘ What You Should Have Learned:

  1. Rosenbrock: Curved valleys challenge gradient methods; momentum helps significantly
  2. Parameter tuning: Learning rate dramatically affects convergence; too large causes divergence
  3. No universal winner: Different problems require different methods

βœ… Practice Recommendations

Summary: What You've Learned in Part 1

βœ“ Gradient-Based Methods

βœ“ Direct Search Methods

βœ“ Second-Order Methods

➜ What's Missing?