π―Introduction to Unconstrained Optimization
What is Unconstrained Optimization?
Unconstrained optimization is the mathematical problem of finding the minimum (or maximum) of a function without any constraints on the variables. The general formulation is:
\[\min_{x \in \mathbb{R}^n} f(x)\]
where \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is the objective function we want to minimize.
π Key Concept
At a local minimum \(x^*\), the gradient must be zero: \(\nabla f(x^*) = 0\) (first-order necessary condition).
Real-World Applications
π€ Machine Learning
Training neural networks by minimizing loss functions (cross-entropy, MSE)
π Finance
Portfolio optimization to maximize returns while managing risk
π Engineering Design
Minimizing cost, weight, or material usage in structural design
π¬ Scientific Computing
Parameter estimation and curve fitting in data analysis
Key Mathematical Concepts
1. Gradient (βf)
The gradient is a vector of partial derivatives indicating the direction of steepest ascent:
\[\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}\]
2. Hessian (βΒ²f)
The Hessian is a matrix of second-order partial derivatives describing the curvature:
\[\nabla^2 f(x) = \begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\
\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\
\vdots & \vdots & \ddots
\end{bmatrix}\]
Optimality Conditions
β
First-Order Necessary Condition
If \(x^*\) is a local minimum, then \(\nabla f(x^*) = 0\)
β
Second-Order Sufficient Condition
If \(\nabla f(x^*) = 0\) and \(\nabla^2 f(x^*)\) is positive definite, then \(x^*\) is a strict local minimum
πGradient Methods
2.1 Gradient Descent
Gradient descent is the most fundamental optimization algorithm. It iteratively moves in the direction opposite to the gradient:
\[x_{k+1} = x_k - \alpha \nabla f(x_k)\]
where \(\alpha > 0\) is the learning rate (step size).
Algorithm: Gradient Descent
1. Initialize \(x_0\), set learning rate \(\alpha\), tolerance \(\epsilon\)
2. for k = 0, 1, 2, ... do
3. Compute gradient \(g_k = \nabla f(x_k)\)
4. if \(\|g_k\| < \epsilon\) then
5. return \(x_k\) (converged)
6. end if
7. Update: \(x_{k+1} = x_k - \alpha g_k\)
8. end for
Interactive Visualization: Simple 1D Function
Explore the function: \(f(x) = x^2 - 4x + 3\)
Function Analysis
Function:
f(x) = xΒ² - 4x + 3
Derivative (Gradient):
f'(x) = 2x - 4
Function Value:
f(x) = -0.75
Gradient Value:
f'(x) = 1.00
π Analysis:
Move the slider to explore the function. The gradient shows the slope direction.
β Optimal Point:
x* = 2, f(x*) = -1
Gradient is zero at minimum: f'(2) = 0
Gradient Magnitude
|1.00|
Descent Direction
β LEFT
π‘ Tip: Use the slider to move along the x-axis. Watch how the gradient changes. At the minimum (x=2), the gradient equals zero.
Numerical Example: Step-by-Step Calculation
Problem: Minimize \(f(x) = x^2 - 4x + 3\) with \(f'(x) = 2x - 4\)
Parameters: \(x_0 = 5.0\), \(\alpha = 0.2\) (learning rate)
π‘ Step-by-Step Calculations
Iteration 0:
- \(x_0 = 5.0\)
- \(f(x_0) = 5^2 - 4(5) + 3 = 25 - 20 + 3 = 8.0\)
- \(\nabla f(x_0) = 2(5) - 4 = 6.0\)
- \(x_1 = x_0 - \alpha \nabla f(x_0) = 5.0 - 0.2(6.0) = 5.0 - 1.2 = 3.8\)
Iteration 1:
- \(x_1 = 3.8\)
- \(f(x_1) = 3.8^2 - 4(3.8) + 3 = 14.44 - 15.2 + 3 = 1.76\)
- \(\nabla f(x_1) = 2(3.8) - 4 = 3.6\)
- \(x_2 = 3.8 - 0.2(3.6) = 3.8 - 0.72 = 3.08\)
| Iteration k |
xk |
f(xk) |
βf(xk) |
xk+1 |
| 0 |
5.000 |
8.000 |
6.000 |
3.800 |
| 1 |
3.800 |
1.760 |
3.600 |
3.080 |
| 2 |
3.080 |
0.314 |
2.160 |
2.648 |
| 3 |
2.648 |
0.041 |
1.296 |
2.389 |
| 4 |
2.389 |
0.002 |
0.778 |
2.233 |
β Converges to optimal solution: \(x^* = 2.0\), \(f(x^*) = -1.0\)
Interactive Demo: Effect of Learning Rate
Function: \(f(x) = x^2 - 4x + 3\), Starting point: \(x_0 = 5.0\)
Algorithm Status
Learning Rate:
Ξ± = 0.20
Status:
β Converged
Iterations to Convergence
13
π Recommendation:
Good balance between speed and stability
Convergence Comparison
| Ξ± Value |
Iterations |
Behavior |
Status |
| 0.05 (Too small) |
~30 |
Very slow |
β Inefficient |
| 0.20 (Optimal) |
13 |
Optimal |
β Recommended |
| 0.60 (Too large) |
~4 |
Oscillates |
β Unstable |
π Key Insights about Learning Rate
- Too small (Ξ± < 0.1): Slow convergence, many iterations needed
- Optimal (Ξ± β 0.2): Fast, stable convergence
- Too large (Ξ± > 0.5): Oscillations or divergence
Convergence Analysis
| Learning Rate |
Label |
Iterations |
Behavior |
| 0.05 |
Too Small |
30 |
Very slow convergence |
| 0.20 |
Optimal |
13 |
Fast and stable |
| 0.60 |
Too Large |
4 |
Oscillations, may diverge |
2.2 Gradient Descent with Momentum
Momentum accelerates gradient descent by accumulating a velocity vector that dampens oscillations and speeds up convergence in consistent directions.
\[v_{k+1} = \beta v_k - \alpha \nabla f(x_k)\]
\[x_{k+1} = x_k + v_{k+1}\]
where \(\beta \in [0, 1)\) is the momentum coefficient (typically 0.9).
Algorithm: Momentum-based Gradient Descent
1. Initialize \(x_0\), \(v_0 = 0\), set \(\alpha\), \(\beta\)
2. for k = 0, 1, 2, ... do
3. Compute gradient \(g_k = \nabla f(x_k)\)
4. Update velocity: \(v_{k+1} = \beta v_k - \alpha g_k\)
5. Update position: \(x_{k+1} = x_k + v_{k+1}\)
6. end for
β
Advantages of Momentum
- Accelerates convergence in ravine-like regions
- Dampens oscillations across narrow valleys
- Helps escape shallow local minima
- Typically uses \(\beta = 0.9\) for good results
Numerical Example: Momentum vs Standard GD
Problem: Same function \(f(x) = x^2 - 4x + 3\), \(x_0 = 5.0\), \(\alpha = 0.2\), \(\beta = 0.9\)
| Iteration |
Standard GD |
Momentum GD |
| k |
xk |
f(xk) |
xk |
vk |
f(xk) |
| 0 |
5.000 |
8.000 |
5.000 |
0.000 |
8.000 |
| 1 |
3.800 |
1.760 |
3.800 |
-1.200 |
1.760 |
| 2 |
3.080 |
0.314 |
2.960 |
-1.560 |
0.082 |
| 3 |
2.648 |
0.041 |
2.248 |
-1.496 |
0.004 |
| 4 |
2.389 |
0.002 |
1.952 |
-1.196 |
0.000 |
π Momentum converges ~40% faster! Reaches f(x) < 0.01 in 3 iterations vs 5 for standard GD
2.3 Nesterov Accelerated Gradient (NAG)
NAG improves upon classical momentum by computing the gradient at the "lookahead" position:
\[v_{k+1} = \beta v_k - \alpha \nabla f(x_k + \beta v_k)\]
\[x_{k+1} = x_k + v_{k+1}\]
π The "Lookahead" Concept
NAG evaluates the gradient not at the current position, but at the position where momentum would take us. This provides better anticipation of the landscape ahead, often leading to faster convergence.
πDirect Methods (Derivative-Free)
π What Are Direct Methods?
Direct methods (also called derivative-free or black-box methods) optimize functions without computing gradients. They only require function evaluations: f(x). This is essential when:
- Gradients are unavailable or expensive to compute
- Function is non-smooth or noisy
- Function is a black-box simulation
- Automatic differentiation is not feasible
3.1 Nelder-Mead Simplex Method
Core Concept:
πΊ What is a Simplex?
A simplex is a geometric shape with n+1 vertices in n-dimensional space:
- 1D: 2 points on a line
- 2D: Triangle with 3 vertices
- 3D: Tetrahedron with 4 vertices
The simplex evolves through the search space using 4 operations: Reflection, Expansion, Contraction, and Shrink.
The Four Operations (DETAILED EXPLANATION):
1οΈβ£ REFLECTION (Ξ± = 1.0)
\[x_r = \bar{x} + \alpha(\bar{x} - x_{n+1})\]
where xΜ is the centroid of the best n points
Interpretation: Move the worst point through the centroid in the opposite direction
When to use: Always try first
Success condition: f(x_best) β€ f(x_r) < f(x_second_worst)
Action if successful: Replace worst vertex with reflected point
2οΈβ£ EXPANSION (Ξ³ = 2.0)
\[x_e = \bar{x} + \gamma(x_r - \bar{x})\]
Interpretation: Reflection was very good, so go even further in that direction
When to use: If reflected point is better than current best (f(x_r) < f(x_best))
Success condition: f(x_e) < f(x_r)
Action if successful: Replace worst with expanded point; otherwise use reflected point
3οΈβ£ CONTRACTION (Ο = 0.5)
\[x_c = \bar{x} + \rho(x_{n+1} - \bar{x})\]
Interpretation: Reflection failed, so move closer to centroid but not full retreat
When to use: If reflected point is poor (f(x_r) β₯ f(x_second_worst))
Success condition: f(x_c) < f(x_{n+1})
Action if successful: Replace worst with contracted point
4οΈβ£ SHRINK (Ο = 0.5)
\[x_i = x_{best} + \sigma(x_i - x_{best}) \quad \text{for all } i\]
Interpretation: Nothing worked, so scale the entire simplex toward the best point
When to use: Contraction failed (last resort)
Result: All vertices move closer to the best vertex
Effect: Reduces simplex size, enabling finer search
Complete Pseudocode:
Algorithm: Nelder-Mead Simplex Method
1. INITIALIZATION:
Create simplex S = {xβ, xβ, ..., x_{n+1}}
Evaluate f(x_i) for all vertices
Set parameters: Ξ±=1.0, Ξ³=2.0, Ο=0.5, Ο=0.5
2. MAIN LOOP (while not converged):
a. SORT: Order vertices by f-value
xβ = best (lowest f), x_{n+1} = worst (highest f)
b. CENTROID: Compute xΜ = (1/n)Β·Ξ£(xβ...x_n)
c. REFLECTION:
x_r = xΜ + Ξ±(xΜ - x_{n+1})
Evaluate f(x_r)
d. DECISION LOGIC:
IF f(xβ) β€ f(x_r) < f(x_n):
Replace x_{n+1} with x_r, CONTINUE
ELSE IF f(x_r) < f(xβ): (Very good!)
EXPANSION: x_e = xΜ + Ξ³(x_r - xΜ)
IF f(x_e) < f(x_r): Replace x_{n+1} with x_e
ELSE: Replace x_{n+1} with x_r
ELSE: (Reflection failed)
CONTRACTION: x_c = xΜ + Ο(x_{n+1} - xΜ)
IF f(x_c) < f(x_{n+1}): Replace x_{n+1} with x_c
ELSE: SHRINK all vertices toward xβ
x_i β xβ + Ο(x_i - xβ) for i=2...n+1
3. CONVERGENCE CHECK:
IF (max |f(x_i) - f(xΜ)|) < Ξ΅: STOP
4. OUTPUT: Return xβ (best vertex)
Detailed Numerical Example: Step-by-Step Walkthrough
Problem: Minimize \(f(x_1, x_2) = (x_1-2)^2 + (x_2-1)^2\)
Optimal Point: \((2, 1)\) with \(f(2,1) = 0\)
Algorithm: Nelder-Mead Simplex Method
π’ INITIALIZATION: Setting up Initial Simplex
Goal: Start with 3 vertices representing candidate solutions in \(\mathbb{R}^2\)
| Vertex |
Position |
\(f(x)\) |
Status |
| A |
\((0.0, 0.0)\) |
5.0 |
WORST |
| B |
\((1.0, 0.0)\) |
2.0 |
β BEST |
| C |
\((0.0, 1.0)\) |
4.0 |
Middle |
π Initial Observation:
Vertex B at \((1.0, 0.0)\) gives the lowest function value (2.0), making it the current best.
Vertex A at \((0.0, 0.0)\) is the worst with \(f = 5.0\), so it will be moved in the next iteration.
π ITERATION 1: Reflection and Expansion
Step 1οΈβ£ - Sort Vertices by Function Value:
Best: \(B(1.0, 0.0)\) β \(f = 2.0\) β
Middle: \(C(0.0, 1.0)\) β \(f = 4.0\)
Worst: \(A(0.0, 0.0)\) β \(f = 5.0\) β
Step 2οΈβ£ - Compute Centroid of Best Two Vertices:
\(\bar{x} = \frac{B + C}{2}\)
\(\bar{x} = \frac{(1.0, 0.0) + (0.0, 1.0)}{2}\)
\(\bar{x} = \frac{(1.0, 1.0)}{2}\)
\(\bar{x} = (0.5, 0.5)\)
Step 3οΈβ£ - REFLECTION: Move worst point through centroid
\(x_r = \bar{x} + \alpha(\bar{x} - A)\) where \(\alpha = 1.0\)
\(x_r = (0.5, 0.5) + 1.0 \cdot [(0.5, 0.5) - (0.0, 0.0)]\)
\(x_r = (0.5, 0.5) + (0.5, 0.5)\)
\(x_r = (1.0, 1.0)\)
Evaluate: \(f(x_r) = (1.0-2)^2 + (1.0-1)^2\)
\(f(x_r) = 1 + 0 = 1.0\)
β REFLECTION ACCEPTED! \((1.0 < 2.0)\) Reflected point is better than best!
Step 4οΈβ£ - EXPANSION: Try going further in this direction
\(x_e = \bar{x} + \gamma(x_r - \bar{x})\) where \(\gamma = 2.0\)
\(x_e = (0.5, 0.5) + 2.0 \cdot [(1.0, 1.0) - (0.5, 0.5)]\)
\(x_e = (0.5, 0.5) + 2.0 \cdot (0.5, 0.5)\)
\(x_e = (0.5, 0.5) + (1.0, 1.0)\)
\(x_e = (1.5, 1.5)\)
Evaluate: \(f(x_e) = (1.5-2)^2 + (1.5-1)^2\)
\(f(x_e) = 0.25 + 0.25\)
\(f(x_e) = 0.5\)
ββ EXPANSION ACCEPTED! \((0.5 < 1.0)\) Expansion is even better!
π ITERATION 1 RESULT:
Old worst vertex \(A(0.0, 0.0)\) with \(f=5.0\)
β Replaced with \(E(1.5, 1.5)\) with \(f=0.5\)
New Simplex:
β’ \(E(1.5, 1.5)\) \(f=0.5\) β NEW BEST
β’ \(B(1.0, 0.0)\) \(f=2.0\)
β’ \(C(0.0, 1.0)\) \(f=4.0\)
π ITERATION 2: Continuing Convergence
Current Simplex State:
| Best: |
\(E(1.5, 1.5)\) |
\(f = 0.5\) |
| Middle: |
\(B(1.0, 0.0)\) |
\(f = 2.0\) |
| Worst: |
\(C(0.0, 1.0)\) |
\(f = 4.0\) |
Reflection Calculation:
Centroid of E and B:
\(\bar{x} = \frac{(1.5, 1.5) + (1.0, 0.0)}{2} = (1.25, 0.75)\)
\(x_r = (1.25, 0.75) + 1.0 \cdot [(1.25, 0.75) - (0.0, 1.0)]\)
\(x_r = (1.25, 0.75) + (1.25, -0.25)\)
\(x_r = (2.5, 0.5)\)
\(f(x_r) = (2.5-2)^2 + (0.5-1)^2\)
\(f(x_r) = 0.25 + 0.25\)
\(f(x_r) = 0.5\)
β‘ ITERATION 2 DECISION:
\(f(x_r) = 0.5 = f(\text{best})\)
Reflection equals current best β β Accept reflection
Simplex continues to improve toward \((2.0, 1.0)\)
β
CONVERGENCE: After 5-8 Iterations
| Iteration |
Best Vertex |
\(f(x)\) Best |
Distance to Optimum |
| 0 |
\((1.0, 0.0)\) |
2.0 |
1.414 |
| 1 |
\((1.5, 1.5)\) |
0.5 |
0.707 |
| 2-3 |
\(\approx(1.8, 0.9)\) |
0.05 |
0.2 |
| 5-8 |
\((2.0, 1.0)\) |
0.0 |
0.0 |
π― OPTIMAL SOLUTION FOUND!
Final Solution: \(x^* = (2.0, 1.0)\)
Minimum Value: \(f(x^*) = 0.0\)
Convergence: 5-8 iterations
All vertices clustered around optimal point
π‘ How Simplex Evolves:
Triangle shrinks and moves toward optimum through systematic reflection, expansion, and contraction. Each operation strategically repositions worst vertex to explore function landscape efficiently.
β‘ Computational Efficiency:
No gradients required! Only function evaluations needed. Great for non-smooth, black-box optimization. Works in any dimension with \(2\times\) performance penalty.
Advantages & Disadvantages:
β
Advantages
- No gradients needed (black-box)
- Simple logic, easy to implement
- Works on non-smooth functions
- Handles noisy functions
- Intuitive geometric interpretation
β οΈ Disadvantages
- Slow in high dimensions (n > 50)
- May converge to non-stationary points
- No convergence rate guarantees
- Sensitive to initial simplex
- Can degenerate (simplex collapses)
Interactive Visualization: Simplex Movement
Objective: Minimize \(f(x_1, x_2) = (x_1-2)^2 + (x_2-1)^2\)
Step Information
Phase:
Initialization
Operation:
Initial Setup
Best f(x):
2.00
Description:
Initial simplex with 3 vertices
β Decision:
Simplex created
3.2 Hooke-Jeeves Pattern Search
Pattern search alternates between exploratory moves (coordinate search) and pattern moves (exploiting successful directions).
Algorithm: Hooke-Jeeves
1. Initialize base point \(x_0\), step size \(\Delta\)
2. for k = 0, 1, 2, ... do
3. Exploratory Search:
4. for each coordinate i = 1 to n do
5. Try \(x + \Delta e_i\) and \(x - \Delta e_i\)
6. Keep best improvement
7. Let \(x_{new}\) be the result
8. if improvement found then
9. Pattern Move: \(x_{next} = x_{new} + (x_{new} - x_k)\)
10. Set \(x_{k+1} = x_{next}\)
11. else
12. Reduce step size: \(\Delta \leftarrow \Delta/2\)
13. end for
Interactive Visualization: Hooke-Jeeves Pattern Search
Objective: Minimize \(f(x_1, x_2) = (x_1-2)^2 + 4(x_2-1)^2\)
π Step Information
Phase:
Initialization
Operation:
Initial Setup
Best \(f(x)\):
8.0000
Current Point:
\((0.00, 0.00)\)
β Decision:
Starting algorithm
π Description:
Hooke-Jeeves pattern search initialization at origin. Starting point (0, 0) with initial step size Ξ = 0.5
Step Size \(\Delta\)
0.50
π Hooke-Jeeves Algorithm: Two-Phase Strategy
β Phase 1: Exploratory
- Try movement Β±\(\Delta\) for each coordinate
- Test: \(x \pm \Delta \cdot e_i\)
- Keep best improvement found
- Success β triggers pattern phase
β Phase 2: Pattern
- Pattern direction: \(p = x_{\text{new}} - x_{\text{base}}\)
- Accelerate: try \(x + p\)
- If better β accept, continue
- If worse β reduce \(\Delta\)
β
When to Use Direct Methods
- Black-box functions: Simulations, external codes
- Non-smooth functions: Gradient undefined or discontinuous
- Noisy functions: Stochastic simulations, experiments
- Low dimensions: Generally n < 20 for good performance
Comparison of Direct Methods
| Method |
Strategy |
Convergence |
Best For |
| Nelder-Mead |
Simplex transformation |
No guarantee |
Smooth, low-dim (n<10) |
| Hooke-Jeeves |
Coordinate + pattern search |
Convergent with conditions |
Medium-dim (n<20) |
| Powell's Method |
Conjugate directions |
Quadratic in n steps |
Quadratic-like functions |
πSecond-Order Methods
Second-order methods use curvature information (Hessian matrix) to achieve faster convergence than first-order gradient methods.
4.1 Newton's Method
Newton's method uses both gradient and Hessian to make quadratically convergent steps:
\[x_{k+1} = x_k - [\nabla^2 f(x_k)]^{-1} \nabla f(x_k)\]
Algorithm: Newton's Method
1. Initialize \(x_0\), set tolerance \(\epsilon\)
2. for k = 0, 1, 2, ... do
3. Compute gradient: \(g_k = \nabla f(x_k)\)
4. Compute Hessian: \(H_k = \nabla^2 f(x_k)\)
5. Solve linear system: \(H_k p_k = -g_k\)
6. Update: \(x_{k+1} = x_k + p_k\)
7. if \(\|g_k\| < \epsilon\) then return \(x_k\)
8. end for
Numerical Example: Newton's Quadratic Convergence
Problem: \(f(x) = x^2 - 4x + 3\)
Derivatives: \(f'(x) = 2x - 4\), \(f''(x) = 2\)
| Iteration k |
xk |
f(xk) |
βf(xk) |
βΒ²f(xk) |
xk+1 |
| 0 |
5.000 |
8.000 |
6.000 |
2.000 |
2.000 |
| 1 |
2.000 |
-1.000 |
0.000 |
2.000 |
2.000 |
π― Calculation Detail for Iteration 0:
\(x_1 = x_0 - \frac{f'(x_0)}{f''(x_0)} = 5.0 - \frac{6.0}{2.0} = 5.0 - 3.0 = 2.0\)
Result: Newton's method converges in just 1 iteration for quadratic functions!
β
Advantages of Newton's Method
- Quadratic convergence: Error decreases as \(\|e_{k+1}\| \approx C\|e_k\|^2\)
- One-step convergence for quadratic functions
- No step size tuning: Uses full Newton step
- Scale invariant: Not affected by coordinate scaling
β οΈ Computational Challenges
- O(nΒ³) cost to compute and invert Hessian
- O(nΒ²) memory to store Hessian
- Hessian may not be positive definite far from minimum
- Requires exact second derivatives
Example: Newton vs Gradient Descent
Function: \(f(x) = 0.5x^2 - 2x\) | Derivative: \(f'(x) = x - 2\) | Second Derivative: \(f''(x) = 1\)
Newton's Method
Iteration:
0
Position:
x = 0.00
f(x):
0.00
Gradient Descent (\(\alpha=0.2\))
Iteration:
0
Position:
x = 0.00
f(x):
0.00
π Convergence Speed
Newton:
1 iteration
Gradient Descent:
~8 iterations
π Current Step:
Starting at x = 0 with f(x) = 0. Watch how Newton's method converges in just 1 iteration, while gradient descent requires many steps.
β‘ Newton's Method
Update Formula:
\(x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}\)
Convergence Rate:
Quadratic (very fast)
Cost per Iteration:
\(O(n^3)\) - Compute Hessian inverse
Requirements:
First AND second derivatives needed
π Gradient Descent
Update Formula:
\(x_{k+1} = x_k - \alpha \cdot f'(x_k)\)
Convergence Rate:
Linear (slower)
Cost per Iteration:
\(O(n)\) - Only compute gradient
Requirements:
Only first derivative needed
4.2 Quasi-Newton Methods: BFGS
BFGS (Broyden-Fletcher-Goldfarb-Shanno) approximates the Hessian inverse using gradient information, avoiding expensive Hessian computation.
\[x_{k+1} = x_k - \alpha_k B_k^{-1} \nabla f(x_k)\]
where \(B_k\) is updated using the BFGS formula:
\[B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}\]
with \(s_k = x_{k+1} - x_k\) and \(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\).
Algorithm: BFGS
1. Initialize \(x_0\), \(B_0 = I\) (identity matrix)
2. for k = 0, 1, 2, ... do
3. Compute search direction: \(p_k = -B_k^{-1} \nabla f(x_k)\)
4. Line search: find \(\alpha_k\) satisfying Wolfe conditions
5. Update: \(x_{k+1} = x_k + \alpha_k p_k\)
6. Compute \(s_k = x_{k+1} - x_k\), \(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\)
7. Update BFGS approximation: \(B_{k+1}\)
8. end for
π Why BFGS is Popular
- Superlinear convergence: Faster than gradient methods
- O(nΒ²) complexity: Much cheaper than Newton
- Automatically maintains positive definiteness
- Default choice for medium-scale unconstrained optimization
4.3 Conjugate Gradient Method
The Conjugate Gradient (CG) method generates search directions that are conjugate with respect to the Hessian, achieving optimal convergence for quadratic functions.
\[x_{k+1} = x_k + \alpha_k p_k\]
\[p_{k+1} = -\nabla f(x_{k+1}) + \beta_k p_k\]
Common choices for \(\beta_k\):
- Fletcher-Reeves: \(\beta_k^{FR} = \frac{\|\nabla f(x_{k+1})\|^2}{\|\nabla f(x_k)\|^2}\)
- Polak-Ribière: \(\beta_k^{PR} = \frac{\nabla f(x_{k+1})^T (\nabla f(x_{k+1}) - \nabla f(x_k))}{\|\nabla f(x_k)\|^2}\)
β
Conjugate Gradient Advantages
- Exact convergence in n steps for n-dimensional quadratic functions
- O(n) memory: Only stores vectors, not matrices
- Ideal for large-scale problems (n > 1000)
- No matrix operations: Only needs gradient evaluations