Chapter V: Unconstrained Optimization Programming

🎯Introduction to Unconstrained Optimization

What is Unconstrained Optimization?

Unconstrained optimization is the mathematical problem of finding the minimum (or maximum) of a function without any constraints on the variables. The general formulation is:

\[\min_{x \in \mathbb{R}^n} f(x)\]

where \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is the objective function we want to minimize.

🔑 Key Concept

At a local minimum \(x^*\), the gradient must be zero: \(\nabla f(x^*) = 0\) (first-order necessary condition).

Real-World Applications

🤖 Machine Learning

Training neural networks by minimizing loss functions (cross-entropy, MSE)

📈 Finance

Portfolio optimization to maximize returns while managing risk

🏭 Engineering Design

Minimizing cost, weight, or material usage in structural design

🔬 Scientific Computing

Parameter estimation and curve fitting in data analysis

Key Mathematical Concepts

1. Gradient (∇f)

The gradient is a vector of partial derivatives indicating the direction of steepest ascent:

\[\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}\]

2. Hessian (∇²f)

The Hessian is a matrix of second-order partial derivatives describing the curvature:

\[\nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}\]

Optimality Conditions

✅ First-Order Necessary Condition

If \(x^*\) is a local minimum, then \(\nabla f(x^*) = 0\)

✅ Second-Order Sufficient Condition

If \(\nabla f(x^*) = 0\) and \(\nabla^2 f(x^*)\) is positive definite, then \(x^*\) is a strict local minimum

📊Gradient Methods

2.1 Gradient Descent

Gradient descent is the most fundamental optimization algorithm. It iteratively moves in the direction opposite to the gradient:

\[x_{k+1} = x_k - \alpha \nabla f(x_k)\]

where \(\alpha > 0\) is the learning rate (step size).

Algorithm: Gradient Descent
Initialize \(x_0\), set learning rate \(\alpha\), tolerance \(\epsilon\)
for k = 0, 1, 2, ... do
Compute gradient \(g_k = \nabla f(x_k)\)
if \(\|g_k\| < \epsilon\) then
return \(x_k\) (converged)
end if
Update: \(x_{k+1} = x_k - \alpha g_k\)
end for

Interactive Visualization: Simple 1D Function

Explore the function: \(f(x) = x^2 - 4x + 3\)

x value: 2.50

Function Analysis

Function: f(x) = x² - 4x + 3

Derivative (Gradient): f'(x) = 2x - 4

Current Point:

                    x = 2.50
                

Function Value:

                    f(x) = -0.75
                

Gradient Value:

                    f'(x) = 1.00
                

📊 Analysis: Move the slider to explore the function. The gradient shows the slope direction.

✓ Optimal Point: x* = 2, f(x*) = -1
Gradient is zero at minimum: f'(2) = 0

Distance to Optimal

0.50

Gradient Magnitude

|1.00|

Descent Direction

← LEFT

💡 Tip: Use the slider to move along the x-axis. Watch how the gradient changes. At the minimum (x=2), the gradient equals zero.

Numerical Example: Step-by-Step Calculation

Problem: Minimize \(f(x) = x^2 - 4x + 3\) with \(f'(x) = 2x - 4\)

Parameters: \(x_0 = 5.0\), \(\alpha = 0.2\) (learning rate)

💡 Step-by-Step Calculations

Iteration 0:

\(x_0 = 5.0\)
\(f(x_0) = 5^2 - 4(5) + 3 = 25 - 20 + 3 = 8.0\)
\(\nabla f(x_0) = 2(5) - 4 = 6.0\)
\(x_1 = x_0 - \alpha \nabla f(x_0) = 5.0 - 0.2(6.0) = 5.0 - 1.2 = 3.8\)

Iteration 1:

\(x_1 = 3.8\)
\(f(x_1) = 3.8^2 - 4(3.8) + 3 = 14.44 - 15.2 + 3 = 1.76\)
\(\nabla f(x_1) = 2(3.8) - 4 = 3.6\)
\(x_2 = 3.8 - 0.2(3.6) = 3.8 - 0.72 = 3.08\)

Iteration k	x_k	f(x_k)	∇f(x_k)	x_k+1
0	5.000	8.000	6.000	3.800
1	3.800	1.760	3.600	3.080
2	3.080	0.314	2.160	2.648
3	2.648	0.041	1.296	2.389
4	2.389	0.002	0.778	2.233

✓ Converges to optimal solution: \(x^* = 2.0\), \(f(x^*) = -1.0\)

Interactive Demo: Effect of Learning Rate

Function: \(f(x) = x^2 - 4x + 3\), Starting point: \(x_0 = 5.0\)

Learning Rate (α): 0.20

Algorithm Status

Learning Rate:
α = 0.20

Status:
✓ Converged

Iterations to Convergence

13

Final x Value

2.00

Final f(x)

-1.00

📊 Recommendation: Good balance between speed and stability

Convergence Comparison

α Value	Iterations	Behavior	Status
0.05 (Too small)	~30	Very slow	⚠ Inefficient
0.20 (Optimal)	13	Optimal	✓ Recommended
0.60 (Too large)	~4	Oscillates	⚠ Unstable

🔑 Key Insights about Learning Rate

Too small (α < 0.1): Slow convergence, many iterations needed
Optimal (α ≈ 0.2): Fast, stable convergence
Too large (α > 0.5): Oscillations or divergence

Convergence Analysis

Learning Rate	Label	Iterations	Behavior
0.05	Too Small	30	Very slow convergence
0.20	Optimal	13	Fast and stable
0.60	Too Large	4	Oscillations, may diverge

2.2 Gradient Descent with Momentum

Momentum accelerates gradient descent by accumulating a velocity vector that dampens oscillations and speeds up convergence in consistent directions.

\[v_{k+1} = \beta v_k - \alpha \nabla f(x_k)\] \[x_{k+1} = x_k + v_{k+1}\]

where \(\beta \in [0, 1)\) is the momentum coefficient (typically 0.9).

Algorithm: Momentum-based Gradient Descent
Initialize \(x_0\), \(v_0 = 0\), set \(\alpha\), \(\beta\)
for k = 0, 1, 2, ... do
Compute gradient \(g_k = \nabla f(x_k)\)
Update velocity: \(v_{k+1} = \beta v_k - \alpha g_k\)
Update position: \(x_{k+1} = x_k + v_{k+1}\)
end for

✅ Advantages of Momentum

Accelerates convergence in ravine-like regions
Dampens oscillations across narrow valleys
Helps escape shallow local minima
Typically uses \(\beta = 0.9\) for good results

Numerical Example: Momentum vs Standard GD

Problem: Same function \(f(x) = x^2 - 4x + 3\), \(x_0 = 5.0\), \(\alpha = 0.2\), \(\beta = 0.9\)

Iteration	Standard GD		Momentum GD
k	x_k	f(x_k)	x_k	v_k	f(x_k)
0	5.000	8.000	5.000	0.000	8.000
1	3.800	1.760	3.800	-1.200	1.760
2	3.080	0.314	2.960	-1.560	0.082
3	2.648	0.041	2.248	-1.496	0.004
4	2.389	0.002	1.952	-1.196	0.000

🚀 Momentum converges ~40% faster! Reaches f(x) < 0.01 in 3 iterations vs 5 for standard GD

2.3 Nesterov Accelerated Gradient (NAG)

NAG improves upon classical momentum by computing the gradient at the "lookahead" position:

\[v_{k+1} = \beta v_k - \alpha \nabla f(x_k + \beta v_k)\] \[x_{k+1} = x_k + v_{k+1}\]

🔑 The "Lookahead" Concept

NAG evaluates the gradient not at the current position, but at the position where momentum would take us. This provides better anticipation of the landscape ahead, often leading to faster convergence.

2.4 Modern Adaptive Methods

Adam (Adaptive Moment Estimation)

Combines momentum with adaptive learning rates for each parameter:

\[m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t\] \[v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\] \[x_{t+1} = x_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\]

Default parameters: \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\epsilon = 10^{-8}\)

RMSprop (Root Mean Square Propagation)

Adapts learning rate based on recent gradient magnitudes:

\[E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta)g_t^2\] \[x_{t+1} = x_t - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} g_t\]

Typical value: \(\beta = 0.9\)

AdaGrad (Adaptive Gradient)

Accumulates squared gradients for parameter-specific learning rates:

\[G_t = G_{t-1} + g_t^2\] \[x_{t+1} = x_t - \frac{\alpha}{\sqrt{G_t + \epsilon}} g_t\]

Note: Learning rate decreases over time; good for sparse data

When to Use Each Method?

Method	Best Use Case	Pros	Cons
Gradient Descent	Simple, smooth functions	Simple, guaranteed convergence	Slow, sensitive to learning rate
Momentum	Ill-conditioned problems	Faster than GD, dampens oscillations	Adds hyperparameter \(\beta\)
NAG	When momentum is too aggressive	Better anticipation, faster convergence	Slightly more complex
Adam	Deep learning (default choice)	Adaptive, robust, few tuning needed	May not converge in some cases

🔍Direct Methods (Derivative-Free)

📖 What Are Direct Methods?

Direct methods (also called derivative-free or black-box methods) optimize functions without computing gradients. They only require function evaluations: f(x). This is essential when:

Gradients are unavailable or expensive to compute
Function is non-smooth or noisy
Function is a black-box simulation
Automatic differentiation is not feasible

3.1 Nelder-Mead Simplex Method

Core Concept:

🔺 What is a Simplex?

A simplex is a geometric shape with n+1 vertices in n-dimensional space:

1D: 2 points on a line
2D: Triangle with 3 vertices
3D: Tetrahedron with 4 vertices

The simplex evolves through the search space using 4 operations: Reflection, Expansion, Contraction, and Shrink.

The Four Operations (DETAILED EXPLANATION):

1️⃣ REFLECTION (α = 1.0)

\[x_r = \bar{x} + \alpha(\bar{x} - x_{n+1})\] where x̄ is the centroid of the best n points

Interpretation: Move the worst point through the centroid in the opposite direction

When to use: Always try first

Success condition: f(x_best) ≤ f(x_r) < f(x_second_worst)

Action if successful: Replace worst vertex with reflected point

2️⃣ EXPANSION (γ = 2.0)

\[x_e = \bar{x} + \gamma(x_r - \bar{x})\]

Interpretation: Reflection was very good, so go even further in that direction

When to use: If reflected point is better than current best (f(x_r) < f(x_best))

Success condition: f(x_e) < f(x_r)

Action if successful: Replace worst with expanded point; otherwise use reflected point

3️⃣ CONTRACTION (ρ = 0.5)

\[x_c = \bar{x} + \rho(x_{n+1} - \bar{x})\]

Interpretation: Reflection failed, so move closer to centroid but not full retreat

When to use: If reflected point is poor (f(x_r) ≥ f(x_second_worst))

Success condition: f(x_c) < f(x_{n+1})

Action if successful: Replace worst with contracted point

4️⃣ SHRINK (σ = 0.5)

\[x_i = x_{best} + \sigma(x_i - x_{best}) \quad \text{for all } i\]

Interpretation: Nothing worked, so scale the entire simplex toward the best point

When to use: Contraction failed (last resort)

Result: All vertices move closer to the best vertex

Effect: Reduces simplex size, enabling finer search

Complete Pseudocode:

Algorithm: Nelder-Mead Simplex Method
1. INITIALIZATION:
Create simplex S = {x₁, x₂, ..., x_{n+1}}
Evaluate f(x_i) for all vertices
Set parameters: α=1.0, γ=2.0, ρ=0.5, σ=0.5
2. MAIN LOOP (while not converged):
a. SORT: Order vertices by f-value
x₁ = best (lowest f), x_{n+1} = worst (highest f)
b. CENTROID: Compute x̄ = (1/n)·Σ(x₁...x_n)
c. REFLECTION:
x_r = x̄ + α(x̄ - x_{n+1})
Evaluate f(x_r)
d. DECISION LOGIC:
IF f(x₁) ≤ f(x_r) < f(x_n):
Replace x_{n+1} with x_r, CONTINUE
ELSE IF f(x_r) < f(x₁): (Very good!)
EXPANSION: x_e = x̄ + γ(x_r - x̄)
IF f(x_e) < f(x_r): Replace x_{n+1} with x_e
ELSE: Replace x_{n+1} with x_r
ELSE: (Reflection failed)
CONTRACTION: x_c = x̄ + ρ(x_{n+1} - x̄)
IF f(x_c) < f(x_{n+1}): Replace x_{n+1} with x_c
ELSE: SHRINK all vertices toward x₁
x_i ← x₁ + σ(x_i - x₁) for i=2...n+1
3. CONVERGENCE CHECK:
IF (max |f(x_i) - f(x̄)|) < ε: STOP
4. OUTPUT: Return x₁ (best vertex)

Detailed Numerical Example: Step-by-Step Walkthrough

Problem: Minimize \(f(x_1, x_2) = (x_1-2)^2 + (x_2-1)^2\)
Optimal Point: \((2, 1)\) with \(f(2,1) = 0\)
Algorithm: Nelder-Mead Simplex Method

🔢 INITIALIZATION: Setting up Initial Simplex

Goal: Start with 3 vertices representing candidate solutions in \(\mathbb{R}^2\)

Vertex	Position	\(f(x)\)	Status
A	\((0.0, 0.0)\)	5.0	WORST
B	\((1.0, 0.0)\)	2.0	✓ BEST
C	\((0.0, 1.0)\)	4.0	Middle

📍 Initial Observation: Vertex B at \((1.0, 0.0)\) gives the lowest function value (2.0), making it the current best.
Vertex A at \((0.0, 0.0)\) is the worst with \(f = 5.0\), so it will be moved in the next iteration.

🔄 ITERATION 1: Reflection and Expansion

Step 1️⃣ - Sort Vertices by Function Value:

            Best:    \(B(1.0, 0.0)\)   → \(f = 2.0\) ✓

            Middle:  \(C(0.0, 1.0)\)   → \(f = 4.0\)

            Worst:   \(A(0.0, 0.0)\)   → \(f = 5.0\) ✗

Step 2️⃣ - Compute Centroid of Best Two Vertices:

            \(\bar{x} = \frac{B + C}{2}\)

            \(\bar{x} = \frac{(1.0, 0.0) + (0.0, 1.0)}{2}\)

            \(\bar{x} = \frac{(1.0, 1.0)}{2}\)

            \(\bar{x} = (0.5, 0.5)\)

Step 3️⃣ - REFLECTION: Move worst point through centroid

            \(x_r = \bar{x} + \alpha(\bar{x} - A)\)    where \(\alpha = 1.0\)

            \(x_r = (0.5, 0.5) + 1.0 \cdot [(0.5, 0.5) - (0.0, 0.0)]\)

            \(x_r = (0.5, 0.5) + (0.5, 0.5)\)

            \(x_r = (1.0, 1.0)\)

            Evaluate: \(f(x_r) = (1.0-2)^2 + (1.0-1)^2\)

            \(f(x_r) = 1 + 0 = 1.0\)

✓ REFLECTION ACCEPTED! \((1.0 < 2.0)\) Reflected point is better than best!

Step 4️⃣ - EXPANSION: Try going further in this direction

            \(x_e = \bar{x} + \gamma(x_r - \bar{x})\)    where \(\gamma = 2.0\)

            \(x_e = (0.5, 0.5) + 2.0 \cdot [(1.0, 1.0) - (0.5, 0.5)]\)

            \(x_e = (0.5, 0.5) + 2.0 \cdot (0.5, 0.5)\)

            \(x_e = (0.5, 0.5) + (1.0, 1.0)\)

            \(x_e = (1.5, 1.5)\)

            Evaluate: \(f(x_e) = (1.5-2)^2 + (1.5-1)^2\)

            \(f(x_e) = 0.25 + 0.25\)

            \(f(x_e) = 0.5\)

✓✓ EXPANSION ACCEPTED! \((0.5 < 1.0)\) Expansion is even better!

📊 ITERATION 1 RESULT:

Old worst vertex \(A(0.0, 0.0)\) with \(f=5.0\)
→ Replaced with \(E(1.5, 1.5)\) with \(f=0.5\)

New Simplex:
• \(E(1.5, 1.5)\) \(f=0.5\) ✓ NEW BEST
• \(B(1.0, 0.0)\) \(f=2.0\)
• \(C(0.0, 1.0)\) \(f=4.0\)

🔄 ITERATION 2: Continuing Convergence

Current Simplex State:

Best:	\(E(1.5, 1.5)\)	\(f = 0.5\)
Middle:	\(B(1.0, 0.0)\)	\(f = 2.0\)
Worst:	\(C(0.0, 1.0)\)	\(f = 4.0\)

Reflection Calculation:

            Centroid of E and B:

            \(\bar{x} = \frac{(1.5, 1.5) + (1.0, 0.0)}{2} = (1.25, 0.75)\)

            \(x_r = (1.25, 0.75) + 1.0 \cdot [(1.25, 0.75) - (0.0, 1.0)]\)

            \(x_r = (1.25, 0.75) + (1.25, -0.25)\)

            \(x_r = (2.5, 0.5)\)

            \(f(x_r) = (2.5-2)^2 + (0.5-1)^2\)

            \(f(x_r) = 0.25 + 0.25\)

            \(f(x_r) = 0.5\)

⚡ ITERATION 2 DECISION:

\(f(x_r) = 0.5 = f(\text{best})\)
Reflection equals current best → ✓ Accept reflection

Simplex continues to improve toward \((2.0, 1.0)\)

✅ CONVERGENCE: After 5-8 Iterations

Iteration	Best Vertex	\(f(x)\) Best	Distance to Optimum
0	\((1.0, 0.0)\)	2.0	1.414
1	\((1.5, 1.5)\)	0.5	0.707
2-3	\(\approx(1.8, 0.9)\)	0.05	0.2
5-8	\((2.0, 1.0)\)	0.0	0.0

🎯 OPTIMAL SOLUTION FOUND!

Final Solution: \(x^* = (2.0, 1.0)\)
Minimum Value: \(f(x^*) = 0.0\)
Convergence: 5-8 iterations
All vertices clustered around optimal point

💡 How Simplex Evolves: Triangle shrinks and moves toward optimum through systematic reflection, expansion, and contraction. Each operation strategically repositions worst vertex to explore function landscape efficiently.

⚡ Computational Efficiency: No gradients required! Only function evaluations needed. Great for non-smooth, black-box optimization. Works in any dimension with \(2\times\) performance penalty.

Advantages & Disadvantages:

✅ Advantages

No gradients needed (black-box)
Simple logic, easy to implement
Works on non-smooth functions
Handles noisy functions
Intuitive geometric interpretation

⚠️ Disadvantages

Slow in high dimensions (n > 50)
May converge to non-stationary points
No convergence rate guarantees
Sensitive to initial simplex
Can degenerate (simplex collapses)

Interactive Visualization: Simplex Movement

Objective: Minimize \(f(x_1, x_2) = (x_1-2)^2 + (x_2-1)^2\)

Iteration: 0/5

Step Information

Phase: Initialization

Operation: Initial Setup

Best f(x): 2.00

Description: Initial simplex with 3 vertices

✓ Decision: Simplex created

Vertices:

Centroid

(0.50, 0.50)

Iteration

Distance to Optimum

2.24

3.2 Hooke-Jeeves Pattern Search

Pattern search alternates between exploratory moves (coordinate search) and pattern moves (exploiting successful directions).

Algorithm: Hooke-Jeeves
Initialize base point \(x_0\), step size \(\Delta\)
for k = 0, 1, 2, ... do
Exploratory Search:
for each coordinate i = 1 to n do
Try \(x + \Delta e_i\) and \(x - \Delta e_i\)
Keep best improvement
Let \(x_{new}\) be the result
if improvement found then
Pattern Move: \(x_{next} = x_{new} + (x_{new} - x_k)\)
Set \(x_{k+1} = x_{next}\)
else
Reduce step size: \(\Delta \leftarrow \Delta/2\)
end for

Interactive Visualization: Hooke-Jeeves Pattern Search

Objective: Minimize \(f(x_1, x_2) = (x_1-2)^2 + 4(x_2-1)^2\)

Iteration: 0/6

📊 Step Information

Phase: Initialization

Operation: Initial Setup

Best \(f(x)\): 8.0000

Current Point: \((0.00, 0.00)\)

✓ Decision: Starting algorithm

📝 Description: Hooke-Jeeves pattern search initialization at origin. Starting point (0, 0) with initial step size Δ = 0.5

Step Size \(\Delta\)

0.50

Iteration

Distance to Optimum

2.83

Status

Init

📍 Hooke-Jeeves Algorithm: Two-Phase Strategy

⭐ Phase 1: Exploratory

Try movement ±\(\Delta\) for each coordinate
Test: \(x \pm \Delta \cdot e_i\)
Keep best improvement found
Success → triggers pattern phase

⭐ Phase 2: Pattern

Pattern direction: \(p = x_{\text{new}} - x_{\text{base}}\)
Accelerate: try \(x + p\)
If better → accept, continue
If worse → reduce \(\Delta\)

✅ When to Use Direct Methods

Black-box functions: Simulations, external codes
Non-smooth functions: Gradient undefined or discontinuous
Noisy functions: Stochastic simulations, experiments
Low dimensions: Generally n < 20 for good performance

Comparison of Direct Methods

Method	Strategy	Convergence	Best For
Nelder-Mead	Simplex transformation	No guarantee	Smooth, low-dim (n<10)
Hooke-Jeeves	Coordinate + pattern search	Convergent with conditions	Medium-dim (n<20)
Powell's Method	Conjugate directions	Quadratic in n steps	Quadratic-like functions

🎓Second-Order Methods

Second-order methods use curvature information (Hessian matrix) to achieve faster convergence than first-order gradient methods.

4.1 Newton's Method

Newton's method uses both gradient and Hessian to make quadratically convergent steps:

\[x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)\]

Algorithm: Newton's Method
Initialize \(x_0\), set tolerance \(\epsilon\)
for k = 0, 1, 2, ... do
Compute gradient: \(g_k = \nabla f(x_k)\)
Compute Hessian: \(H_k = \nabla^2 f(x_k)\)
Solve linear system: \(H_k p_k = -g_k\)
Update: \(x_{k+1} = x_k + p_k\)
if \(\|g_k\| < \epsilon\) then return \(x_k\)
end for

Numerical Example: Newton's Quadratic Convergence

Problem: \(f(x) = x^2 - 4x + 3\)

Derivatives: \(f'(x) = 2x - 4\), \(f''(x) = 2\)

Iteration k	x_k	f(x_k)	∇f(x_k)	∇²f(x_k)	x_k+1
0	5.000	8.000	6.000	2.000	2.000
1	2.000	-1.000	0.000	2.000	2.000

🎯 Calculation Detail for Iteration 0:

\(x_1 = x_0 - \frac{f'(x_0)}{f''(x_0)} = 5.0 - \frac{6.0}{2.0} = 5.0 - 3.0 = 2.0\)

Result: Newton's method converges in just 1 iteration for quadratic functions!

✅ Advantages of Newton's Method

Quadratic convergence: Error decreases as \(\|e_{k+1}\| \approx C\|e_k\|^2\)
One-step convergence for quadratic functions
No step size tuning: Uses full Newton step
Scale invariant: Not affected by coordinate scaling

⚠️ Computational Challenges

O(n³) cost to compute and invert Hessian
O(n²) memory to store Hessian
Hessian may not be positive definite far from minimum
Requires exact second derivatives

Example: Newton vs Gradient Descent

Function: \(f(x) = 0.5x^2 - 2x\) | Derivative: \(f'(x) = x - 2\) | Second Derivative: \(f''(x) = 1\)

Iteration: 0/8

Newton's Method

Iteration:
0

Position:
x = 0.00

f(x):
0.00

Gradient Descent (\(\alpha=0.2\))

Iteration:
0

Position:
x = 0.00

f(x):
0.00

📊 Convergence Speed

Newton:
1 iteration

Gradient Descent:
~8 iterations

📝 Current Step: Starting at x = 0 with f(x) = 0. Watch how Newton's method converges in just 1 iteration, while gradient descent requires many steps.

⚡ Newton's Method

Update Formula:
\(x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}\)

Convergence Rate:
Quadratic (very fast)

Cost per Iteration:
\(O(n^3)\) - Compute Hessian inverse

Requirements:
First AND second derivatives needed

📉 Gradient Descent

Update Formula:
\(x_{k+1} = x_k - \alpha \cdot f'(x_k)\)

Convergence Rate:
Linear (slower)

Cost per Iteration:
\(O(n)\) - Only compute gradient

Requirements:
Only first derivative needed

4.2 Quasi-Newton Methods: BFGS

BFGS (Broyden-Fletcher-Goldfarb-Shanno) approximates the Hessian inverse using gradient information, avoiding expensive Hessian computation.

\[x_{k+1} = x_k - \alpha_k B_k^{-1} \nabla f(x_k)\]

where \(B_k\) is updated using the BFGS formula:

\[B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}\]

with \(s_k = x_{k+1} - x_k\) and \(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\).

Algorithm: BFGS
Initialize \(x_0\), \(B_0 = I\) (identity matrix)
for k = 0, 1, 2, ... do
Compute search direction: \(p_k = -B_k^{-1} \nabla f(x_k)\)
Line search: find \(\alpha_k\) satisfying Wolfe conditions
Update: \(x_{k+1} = x_k + \alpha_k p_k\)
Compute \(s_k = x_{k+1} - x_k\), \(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\)
Update BFGS approximation: \(B_{k+1}\)
end for

🔑 Why BFGS is Popular

Superlinear convergence: Faster than gradient methods
O(n²) complexity: Much cheaper than Newton
Automatically maintains positive definiteness
Default choice for medium-scale unconstrained optimization

4.3 Conjugate Gradient Method

The Conjugate Gradient (CG) method generates search directions that are conjugate with respect to the Hessian, achieving optimal convergence for quadratic functions.

\[x_{k+1} = x_k + \alpha_k p_k\] \[p_{k+1} = -\nabla f(x_{k+1}) + \beta_k p_k\]

Common choices for \(\beta_k\):

Fletcher-Reeves: \(\beta_k^{FR} = \frac{\|\nabla f(x_{k+1})\|^2}{\|\nabla f(x_k)\|^2}\)
Polak-Ribière: \(\beta_k^{PR} = \frac{\nabla f(x_{k+1})^T (\nabla f(x_{k+1}) - \nabla f(x_k))}{\|\nabla f(x_k)\|^2}\)

✅ Conjugate Gradient Advantages

Exact convergence in n steps for n-dimensional quadratic functions
O(n) memory: Only stores vectors, not matrices
Ideal for large-scale problems (n > 1000)
No matrix operations: Only needs gradient evaluations

📊Method Comparison & Selection Guide

Choosing the right optimization method is critical for success. This section provides comprehensive guidance on selecting among the methods covered in Part 1.

Comprehensive Algorithm Comparison

Method	Gradient	Hessian	Convergence	Cost/Iter	Memory	Best For
Gradient Descent	✓	✗	Linear O(1/k)	O(n)	O(n)	Large-scale, smooth, convex
Momentum	✓	✗	Linear+ (faster)	O(n)	O(n)	Ill-conditioned, ravines
NAG	✓	✗	Optimal O(1/k²)	O(n)	O(n)	Convex, high performance
Adam	✓	✗	Adaptive	O(n)	O(n)	Deep learning, sparse data
Newton	✓	✓	Quadratic	O(n³)	O(n²)	Small n, high accuracy
BFGS	✓	Approx	Superlinear	O(n²)	O(n²)	Medium-scale (n=100-10000)
Conjugate Gradient	✓	Via products	n iterations	O(n)	O(n)	Large quadratic systems
Nelder-Mead	✗	✗	Sublinear	O(n)	O(n)	Black-box, n<20, non-smooth
Hooke-Jeeves	✗	✗	Sublinear	O(n)	O(n)	Separable, n<50

Decision Flowchart: Which Method to Use?

START: Analyze Your Problem

↓

Can you compute gradients efficiently?

↓

NO - Use Direct Methods

n < 10: Nelder-Mead
n < 50: Hooke-Jeeves
Otherwise: Part 2 methods

YES - Continue

Gradient-based methods available

↓

What is problem dimension n?

↓

Small (n < 1000)

Newton or BFGS
(if Hessian cheap)

Medium (1K-10K)

BFGS or
Conjugate Gradient

Large (n > 10K)

Gradient Descent
or Momentum

↓

Is function convex?

↓

YES - Convex

NAG for fastest
Or any gradient method

NO - Non-convex

Momentum/Adam
Or try Part 2 methods

Quick Selection Guidelines

✅ Use Gradient Methods When:

Function is smooth and differentiable
Gradients are cheap to compute (analytic derivatives)
Need guaranteed convergence to local minimum
Problem is large-scale (n > 100)
Convex objective (global optimum guaranteed)

⚠️ Use Direct Methods When:

Gradients unavailable or very expensive
Function is noisy or non-smooth
Black-box optimization (simulation)
Small dimension (n < 20)
Quick prototyping needed

Practical Recommendations

🔑 Key Recommendations

Start simple: Try gradient descent first if gradients available
Add momentum: If convergence is slow or oscillatory
Use Adam for ML: Default choice for training neural networks
BFGS for medium-scale: When n = 100-10,000 and you need fast convergence
Nelder-Mead for black-box: When n < 20 and no gradients
Multiple restarts: Run from different initial points to avoid local minima
Monitor convergence: Always plot objective vs iteration
For global optimization: See Part 2 for population-based methods

⚠️ Common Pitfalls to Avoid

Learning rate too large: Causes divergence in gradient methods
Learning rate too small: Wastes time with slow convergence
Ignoring scaling: Normalize inputs for better conditioning
No convergence check: Always test stopping criteria
Wrong method for problem type: Use decision flowchart above
Single initial point: May find poor local minimum in non-convex problems

Typical Hyperparameters

Method	Parameter	Typical Range	Recommended Start
Gradient Descent	Learning rate α	0.001 - 0.1	0.01 (tune with experiments)
Momentum	β (momentum)	0.5 - 0.99	0.9
NAG	β (momentum)	0.5 - 0.99	0.9
Adam	β₁, β₂	-	0.9, 0.999 (rarely change)
BFGS	Line search tol	10⁻⁴ - 10⁻⁸	10⁻⁶
Nelder-Mead	α, γ, ρ, σ	-	1.0, 2.0, 0.5, 0.5 (standard)

✍️ Practice Exercises

Apply what you've learned with these interactive exercises. Experiment with parameters and observe how different methods behave.

Exercise 1: Rosenbrock's Banana Function

The Rosenbrock function is a classic optimization benchmark with a curved valley:

\(f(x, y) = (1-x)^2 + 100(y-x^2)^2\)

Minimum at: \((1, 1)\) with \(f(1, 1) = 0\)
Challenge: The curved valley makes this hard for gradient descent

Try Different Methods:

Click a button to run the optimization...

Exercise 2: Learning Rate Tuning Challenge

Find the optimal learning rate for gradient descent on:

\(f(x) = x^4 - 3x^3 + 2x^2\)

Optimal learning rate: ~0.05-0.15 for smooth convergence

Adjust Learning Rate:

Learning Rate (\(\alpha\)): 0.10

Too small → slow | Too large → diverges

Iterations

—

Converged?

—

Score

—

Key Takeaways from Exercises

🔑 What You Should Have Learned:

Rosenbrock: Curved valleys challenge gradient methods; momentum helps significantly
Parameter tuning: Learning rate dramatically affects convergence; too large causes divergence
No universal winner: Different problems require different methods

✅ Practice Recommendations

Implement these algorithms from scratch in your favorite language
Test on additional benchmark functions (Ackley, Beale, Sphere)
Visualize optimization paths in 2D
Compare computational time vs solution quality
Try hybrid methods: start with metaheuristics, refine with gradient descent

📚 Chapter V - Part 1: Gradient & Direct Search Methods