In this post, we basically extend the discussions in Optimization Learning Notes 6 on Gradient Descent Method (GDM).
More on Descent Direction
We have already seen that Descent Direction is a important concept of GDM. However, why do we choose the negative gradient as the direction at the first place?
Steepest Descent Directions
Let be continuously differentiable and let be given with . Let denote the solution of the optimization problem:
Every vector of the form , is called the steepest descent direction of in .
It turns out that
is the unique optimal solution.
By the Cauchy-Schwarz inequality, we have
The equality is satisfied if and only if and are linearly dependent.
For every , we then have
This is true iff
- The Steepest Descent Directions vary with the definition of norm and here we are only taking about the Euclidean Norm. And for -norm,
More on Step Sizes
In Notes 6, we listed 3 Step Size Strategies:
- Constant Step Size
- Exact Line Search
- Backtracking/Armijo Line Search
Another interesting strategy to consider is the Diminishing Step Sizes:
Let be continuously differentiable and let be a descent direction of at . Let be given. Then there exists such that the inequality
holds for all .
Consider the Taylor's Expansion,
for and hence,
Since is a descent direction, we obtain
So we finish the proof.
More on Convergence
Now, we have some tough works to do with the Convergence of Gradient Method. Let us look at more results of global Convergence and provide more rigorous proofs for our previous outcomes.
Basic Global Convergence
Let be continuous differentiable and let be generated by the gradient method for solving:
with one of the following step size strategies:
- exact line search
- Armijo line search with
Then, is non-increasing and every accumulation point of is a stationary point of .
This result in already shown in Notes 6, but let us take a look at how we can derive this from scratch.
Proof of ALS
We start with an arbitrary accumulation point of . With either ELS or ALS, we can see that
Since the sequence is monotonically non-increasing, it must converges to some limit . Let
be a subsequence of that converges to .
Since is continuous,
However, , and by the uniqueness of the limit:
We first consider Armijo Line Search:
It is easy to spot that if we sum up the inequalities for each , we will get a rolling sum on the left hand side. Hence,
is bounded. Therefore, the sequence
must converge to zero.
Assuming that . More precisely,
We then obtain
Since is the Armijo step size, then by definition, its previous step does not conform the Armijo Step Size Condition, i.e.
Setting and using the mean value theorem, there exists such that
Combining the two formulas, we then get
It is easy to see,
This is contradict to our assumption. Consequently, we now know that
Proof of ELS
Denote the step sizes for Armijo line search as and the step sizes for exact line search as .
By definition, ELS minimizes , therefore
Then we can finish the proof of ELS with the same way of ALS.
So our Basic Global Convergence result describes
The current result is weak in several aspects:
- can be an empty set
- It does not guarantee the convergence of the whole sequence.
Global Convergence Under Convexity
Let be convex and continuously differentiable and let be generated by the GDM utilizing the following step size strategy:
- Armijo line search (backtracking) with and
Suppose that the solution set is nonempty. Then, converges to a global solution of the program .
Since is non-increasing and infima set is nonempty, we can infer that
For every , we have
Again, we meet our old friend rolling sum. Sum them up, we get
Therefore, must be bounded. We infer that there exists an accumulation point of . Since every accumulation point is a stationary point, is stationary. Let be a subsequence converging to . Then, for every there exists such that
So specializing to , we obtain:
Then, it is easy to see, converges to
Quasi Fejér Monotonicity
We have encountered rolling sum for several times. It turns out not to be a coincidence. Global Convergence Under Convexity is actually closely related to Quasi Fejér Monotonicity, which gives us a rolling sum structure.
A sequence is called Quasi Fejér convergent to a set if it holds that
where are nonnegative sequences satisfying
Convergence of Quasi Fejér Sequences
Suppose we have a sequence is called Quasi Fejér convergent to a set . Then, is bounded. Moreover, if an accumulation point of belongs to , then we have
Convergence Analysis under Lipschitz Continuity
In Notes 6, we have shown the basic applications of Lipschitz Continuity. If is Lipschitz Continuous, then we can derive a quadratic upper bound on the function :
This is a direct result from the Descent Lemma:
Suppose that , then it follows:
Lipschitz Continuity and Boundedness
So now it is to calculate the quadratic surrogate function if we know the parameter . For twice continuous differentiable Lipschitz continuous function (), is given by the upperbound of the notm of Hessian. The theorem is well-stated in Notes 6, and now we give a proof to it.
We use a very similar trick as the proof of Descent Lemma.
Let us first suppose
To proof the other direction, choose in arbitrary direction. With Taylor Expansion:
Dividing , we then know
Since the spectral norm is an induced norm,
Global Convergence Under Lipschitz Continuity
Let and let be the sequence generated by the gradient method with one of the following step size strategies:
- constant step size
- exact line search
- Armijo line search (backtracking) with and
- diminishing step sizes, i.e., and .
Then, we either have or converges to a finite value and it holds that as . (In particular, very accumulation point of is a stationary point in this case).
Proof of Constant Step Size
With Descent Lemma,
As long as , is decreasing and it must converge to some . We can ignore the unbounded cases. Since we meet the rolling sum again, and why not sum them up again?
As , we can see that convergent, so
Proof of Armijo Linear Search and Exact Line Search
Notice that when
the constant is always satisfies the requirement of ALS!
We always have .
Applying the the rolling sum trick with ALS:
This also gives the convergence.
As for ELS,
We can proof it by a simple subtitution.
Proof of Diminishing Step Sizes
We still have
Since , there exists such that
for all sufficiently large.
Hence, should be monotonically decreasing for sufficiently large. We then know that must
converge to some . Still, we only consider the bounded case, where by rolling summation, we get
Otherwise if , we have
We still need to show
Assume on the contrary that we can find infinite sequences such that and
Now, we sum block by block,
With then know
converges to .
Consider the Hölder’s inequality:
We then obtain:
Since , we can see
Finally, by the inverse triangle inequality
This leads to a contradiction.
Isolated Accumulation Points
is the Isolated Accumulation Point if there is a neighborhood of that does not contain any other accumulation points of .
[TO BE CONT.]