# Selected Topics on Gradient Descent Method

book Public Date: 2020-10-03 11:46:24.715138

update Update Date: 2020-10-03 11:55:58.532449

In this post, we basically extend the discussions in Optimization Learning Notes 6 on Gradient Descent Method (GDM).

## More on Descent Direction

We have already seen that Descent Direction is a important concept of GDM. However, why do we choose the negative gradient as the direction at the first place?

### Steepest Descent Directions

Let $f: \mathbb R^n \to \mathbb R$ be continuously differentiable and let $x \in \mathbb R^n$ be given with $\nabla f(x) \ne 0$. Let $d \in \mathbb R^n$ denote the solution of the optimization problem: $\min_{|d| = 1} \nabla f(x)^\top d$ Every vector of the form $s = \lambda d, \lambda > 0$, is called the steepest descent direction of $f$ in $x$.

It turns out that $d = -\frac{\nabla f(x)}{|\nabla f(x)|}$ is the unique optimal solution.

By the Cauchy-Schwarz inequality, we have

$| \nabla f(x)^\top d| \le | \nabla f(x)| | d|$ The equality is satisfied if and only if $\nabla f(x)$ and $d$ are linearly dependent.

For every $d \in \mathbb R^n, | d| = 1$, we then have $\nabla f(x)^\top d \ge - | \nabla f(x)| | d| = - | \nabla f(x)|$ This is true iff $d = -\frac{\nabla f(x)}{|\nabla f(x)|}$

• The Steepest Descent Directions vary with the definition of norm and here we are only taking about the Euclidean Norm. And for $A$-norm, $s = -\lambda A^{-1}\nabla f(x), \lambda \gt 0$

## More on Step Sizes

In Notes 6, we listed 3 Step Size Strategies:

• Constant Step Size
• Exact Line Search
• Backtracking/Armijo Line Search

Another interesting strategy to consider is the Diminishing Step Sizes: $(a_k)$k, s.t., a_k\to 0, \sum{k = 0}^\infty a_k = \infty

### Armijo Condition

Let $f$ be continuously differentiable and let $d \in \mathbb R^n \backslash {0}$ be a descent direction of $f$ at $x$. Let $\gamma \in (0, 1)$ be given. Then there exists $\varepsilon > 0$ such that the inequality $f(x + \alpha d) - f(x) \le \gamma \alpha \nabla f(x)^\top d$ holds for all $\alpha \in [0, \varepsilon]$.

#### Proof

Consider the Taylor's Expansion, $f(x + \alpha d) - f(x) \le \alpha \nabla f(x)^\top d + o(| \alpha d|)$ for $\alpha \to 0$ and hence, $f(x + \alpha d) - f(x) - \gamma \alpha\nabla f(x)^\top d = (1 - \gamma)\alpha \nabla f(x)^\top d + o(|\alpha d|)$ Since $d$ is a descent direction, we obtain $\lim_{\alpha \downarrow 0} \frac{(1-\gamma) \alpha \nabla f(x)^\top d + o(|\alpha d|)}{\alpha} = (1 - \gamma)\nabla f(x)^\top d + o(|\alpha d|) \lt 0$ So we finish the proof.

## More on Convergence

Now, we have some tough works to do with the Convergence of Gradient Method. Let us look at more results of global Convergence and provide more rigorous proofs for our previous outcomes.

### Basic Global Convergence

Let $f: \mathbb R^n \mapsto \mathbb R$ be continuous differentiable and let $(x_k)^k$ be generated by the gradient method for solving:

$\min_x f(x) ~s.t. ~x \in \mathbb R^n$

with one of the following step size strategies:

• exact line search
• Armijo line search with $\sigma, \gamma \in (0, 1)$

Then, $(f(x^k))_k$ is non-increasing and every accumulation point of $(x^k)_k$ is a stationary point of $f$.

This result in already shown in Notes 6, but let us take a look at how we can derive this from scratch.

#### Proof of ALS

We start with an arbitrary accumulation point $x^\ast$ of $(x^k)_k$. With either ELS or ALS, we can see that $f(x^{k + 1}) \le f(x^k), \forall k.$

Since the sequence $(f(x^k)_k)$ is monotonically non-increasing, it must converges to some limit $\xi \in \mathbb R \cup {-\infty}$. Let

$(x^{k_l})_{l \in \mathbb N}$ be a subsequence of $(x^k)_k$ that converges to $x^\ast$.

Since $f$ is continuous, $\lim_{l \to \infty} f(x^{k_l}) = f(x^\ast)$ However, $f(x^k) \to \xi$, and by the uniqueness of the limit: $\xi = f(x^\ast)$ We first consider Armijo Line Search: $f(x^i) - f(x^{i + 1}) \ge \gamma \alpha_i | \nabla f(x^i)|^2$ It is easy to spot that if we sum up the inequalities for each $i$, we will get a rolling sum on the left hand side. Hence, $f(x_0) - f(x^\ast) = \lim_{k \to \infty} f(x_0) - f(x^k) = \lim_{k\to \infty} \sum_{k = 0}^{k - 1}f(x^i) - f(x^{i + 1}) \ge \sum_{i = 0}^\infty \gamma \alpha_i |\nabla f(x^i)|^2$ Obviously, $f(x_0) - f(x^\ast)$ is bounded. Therefore, the sequence $\left(\alpha_i | \nabla f(x^i)|\right)_i$ must converge to zero.

Assuming that $| \nabla f(x^\ast)| \ne 0$. More precisely, $\exists \epsilon \gt 0, \mathcal L \subseteq \mathbb N, | \nabla f(x^{k_\ell})| \gt \epsilon, \forall \ell \in \mathcal L$ We then obtain $\alpha_{k_l} \to 0$ Since $\alpha_{k_l}$ is the Armijo step size, then by definition, its previous step does not conform the Armijo Step Size Condition, i.e. $f(x^{k_l} + \beta^{-1} \alpha_{k_l} d^{k_l}) - f(x^{k_l}) \ge -\gamma \beta^{-1}\alpha_{k_l} | \nabla f(x^{k_l})|^2$ Setting $h^{k_l} := -\beta^{-1} \alpha_{k_l} \nabla f(x^{k_l})$ and using the mean value theorem, there exists $t_{k_l} \in [0, 1]$ such that $f(x^{k_l} + h^{k_l}) = f(x^{k_l}) + \nabla f(x^{k_l} + t_{k_l}h^{k_l})^\top [-\beta^{-1}\alpha_{k_l} \nabla f(x^{k_l})]$ Combining the two formulas, we then get $\nabla f(x^{k_l} + t_{k_l}h^{k_l})^\top \nabla f(x^{k_l}) \le \gamma |\nabla f(x^{k_l})|^2$ It is easy to see, $\lim_{l \to \infty, l \in \mathcal L} \nabla f(x^{k_l} + t_{k_l}h^{k_l})^\top \nabla f(x^{k_l}) = |\nabla f(x^\ast)|^2$ which yields $(1 - \gamma) |\nabla f(x^{k_l})|^2 \le 0$ This is contradict to our assumption. Consequently, we now know that $\nabla f(x^\ast) = 0$

#### Proof of ELS

Denote the step sizes for Armijo line search as $\alpha_{k}^a$ and the step sizes for exact line search as $\alpha_{k}^e$.

By definition, ELS minimizes $f(x^{k} + \alpha_kd^k)$, therefore $f(x^k + \alpha_k^e d^k) - f(x^k) \le f(x^k + \alpha_k^a d^k) - f(x^k) \le - \gamma \alpha_k^a | \nabla f(x^k)|^2$ Then we can finish the proof of ELS with the same way of ALS.

### Convergence Analysis

We introduce

• $\mathcal X^* := {x \in \mathbb R^n : f(x) = \inf_{x \in \mathbb R^n} f(x)}$

• $\mathcal S := { x \in \mathbb R^n: \nabla f(x) = 0}$

• $\mathfrak A := {x \in \mathbb R^n : \exists (k_\ell)_\ell ~s.t.~x^{k_l} \to x, \ell \to \infty}$

So our Basic Global Convergence result describes $\mathfrak A \subseteq \mathcal S$ The current result is weak in several aspects:

• $\mathfrak A$ can be an empty set
• It does not guarantee the convergence of the whole sequence.

### Global Convergence Under Convexity

Let $f: \mathbb R^n \to \mathbb R$ be convex and continuously differentiable and let $(x^k)_k$ be generated by the GDM utilizing the following step size strategy:

• Armijo line search (backtracking) with $\sigma, \gamma \in (0, 1)$ and $s > 0$

Suppose that the solution set $\mathcal X^\ast$ is nonempty. Then, $(x_k)^k$ converges to a global solution of the program $\min_{x \in \mathbb R^n} f(x)$.

Since $(f(x^k))_k$ is non-increasing and infima set $\mathcal X^\ast$ is nonempty, we can infer that $\exist f^\ast \in \mathbb R, (f(x^k))_k \to f^\ast$ For every $x \in \mathcal X^\ast$, we have \begin{aligned} | x^{k + 1} - x|^2 & = | x^{k + 1} - x^k + x^k - x| \ & = | x^k - x - \alpha_k\nabla(x^k)| \ & = | x^k - x|^2 - 2\alpha_k \left< \nabla f(x^k), x^k - x\right> + \alpha^2_k | \nabla f(x^k)|^2 \ &\le | x^k - x|^2 + 2\alpha_k[f(x) - f(x^k)] + s\alpha_k |\nabla f(x^k) |^2 \ &\le | x^k - x|^2 + \frac{s}{\gamma}[f(x) - f(x^k)] \end{aligned} Again, we meet our old friend rolling sum. Sum them up, we get $| x^{\ell + 1} - x|^2 \le | x^0 - x|^2 + \frac{s}{\gamma} [f(x^0) - f(x^{\ell + 1})] \le |x^0 - x |^2 + \frac{s}{\gamma} [f(x^0) - f^\ast]$ Therefore, $(x^k)_k$ must be bounded. We infer that there exists an accumulation point $\overline x \in \mathfrak A$ of $(x^k)_k$. Since every accumulation point is a stationary point, $\overline x$ is stationary. Let $(x^{k_l})_l$ be a subsequence converging to $\overline x$. Then, for every $\epsilon > 0$ there exists $L \in \mathbb N$ such that $| x^{k_l} - \overline x| < \frac{\epsilon}{2} \text{ and } f(x^{k_l}) - f^\ast < \frac{\gamma \epsilon}{2s}$ So specializing $x$ to $\overline x$, we obtain: $| x^{k + 1} - \overline x|^2 \le | x^k - x|^2 + \frac{s}{\gamma}[f(x) - f(x^k)] < \epsilon$ Then, it is easy to see, $(x^k)_k$ converges to $\overline x$

### Quasi Fejér Monotonicity

We have encountered rolling sum for several times. It turns out not to be a coincidence. Global Convergence Under Convexity is actually closely related to Quasi Fejér Monotonicity, which gives us a rolling sum structure.

A sequence $(y^k)_k$ is called Quasi Fejér convergent to a set $\mathcal Y \subset \mathbb R^n$ if it holds that $| y^{k+1} - y|^2 \le (1 + \beta_k)| y^k - y|^2 + \gamma_k, \forall k \in \mathbb N, \forall y \in \mathcal Y$ where $(\beta_k), (\gamma_k)$ are nonnegative sequences satisfying $\sum^\infty \beta_k = 0 \ \sum^\infty \gamma_k = 0$

#### Convergence of Quasi Fejér Sequences

Suppose we have a sequence $(y^k)_k$ is called Quasi Fejér convergent to a set $\mathcal Y \subset \mathbb R^n$. Then, $(y^k)_k$ is bounded. Moreover, if an accumulation point $y$ of $(y^k)_k$ belongs to $\mathcal Y$, then we have

$\lim_{k\to \infty} y^k = y$

### Convergence Analysis under Lipschitz Continuity

In Notes 6, we have shown the basic applications of Lipschitz Continuity. If $\nabla f$ is Lipschitz Continuous, then we can derive a quadratic upper bound on the function $f$: $f(y) \le f(x) + \left<\nabla f(x), y - x\right> + \frac{1}{2}L | y - x|^2, \forall x, y \in \mathbb R^n$ This is a direct result from the Descent Lemma:

#### Descent Lemma

Suppose that $f \in C_L^{1,1}(\mathbb R^n)$, then it follows: $|f(y) - f(x) - \left<\nabla f(x), y - x\right>|\le \frac{L}{2} |y - x|^2, \forall x, y \in \mathbb R^n$

##### Proof

Define \begin{aligned} \phi &: &\mathbb R\mapsto &\mathbb R \ \phi &: &t \to &f(x + t(y - x)) \end{aligned} Rewrite \begin{aligned} f(y) - f(x) & = \phi(1) - \phi(0) \ & = \int^1_0 \phi'(t) ~\mathbb dt \ & = \int^1_0 \nabla f(x + t(y - x))^\top(y -x) ~ \mathbb dt \end{aligned} Therefore, \begin{aligned} | f(y) - f(x) - \left<\nabla f(x),x - y\right>| &= \left| \int^1_0 \left[\nabla f(x + t(y - x)) - \nabla f(x)\right]^\top(y -x) ~ \mathbb dt\right| \ &\le \int^1_0 \left|\nabla f(x + t(y - x)) - \nabla f(x)\right||y -x| ~ \mathbb dt\ &\le L | y - x|^2 \cdot \int^1_0 t~\mathbb dt \ &= \frac{L}{2}|y - x| \end{aligned}

#### Lipschitz Continuity and Boundedness

So now it is to calculate the quadratic surrogate function if we know the parameter $L$. For twice continuous differentiable Lipschitz continuous function ($C_L^{1,1}(\mathbb R^n)$), $L$ is given by the upperbound of the notm of Hessian. The theorem is well-stated in Notes 6, and now we give a proof to it.

##### Proof

We use a very similar trick as the proof of Descent Lemma.

Define \begin{aligned} \phi &: &\mathbb R\mapsto &\mathbb R \ \phi &: &t \to &\nabla(x + t(y - x)) \end{aligned} Rewrite \begin{aligned} \nabla f(y) - \nabla f(x) & = \phi(1) - \phi(0) \ & = \int^1_0 \phi'(t) ~\mathbb dt \ & = \int^1_0 \nabla^2 f(x + t(y - x))(y -x) ~ \mathbb dt \end{aligned} Let us first suppose $| \nabla^2 f(x) | \le L$ \begin{aligned} | \nabla f(y) - \nabla f(x)| &\le \left|\int^1_0 \nabla^2 f(x + t(y - x))(y -x) ~ \mathbb dt\right | \ &\le \int^1_0 \left|\nabla^2 f(x + t(y - x))(y -x)\right | ~ \mathbb dt \ &\le \int^1_0 L\left|y - x\right| ~ \mathbb dt \ &\le L| y - x| \end{aligned} To proof the other direction, choose $h \in \mathbb R^n$ in arbitrary direction. With Taylor Expansion: $Lt| h| \ge | \nabla f(x + th) - \nabla f(x)| = | t\nabla^2f(x) h + o(t|h|)|, t\to 0$ Dividing $t$, we then know $| \nabla^2 f(x) h| \le L | h|$ Since the spectral norm is an induced norm, $| \nabla^2f(x) | \le L$

#### Global Convergence Under Lipschitz Continuity

Let $f \in C_L^{1, 1}(\mathbb R^n)$ and let $(x^k)_k$ be the sequence generated by the gradient method with one of the following step size strategies:

• constant step size $\overline \alpha \in (0, \frac{2}{L})$
• exact line search
• Armijo line search (backtracking) with $\sigma, \gamma \in (0, 1)$ and $s > 0$
• diminishing step sizes, i.e., $\alpha_k \to 0$ and $\sum_{k = 0}^\infty \alpha_k = \infty$.

Then, we either have $f(x^k) \to -\infty$ or $(f(x^k))_k$ converges to a finite value and it holds that $\nabla f(x^k) \to 0$ as $k \to \infty$. (In particular, very accumulation point of $(x^k)_k$ is a stationary point in this case).

##### Proof of Constant Step Size

With Descent Lemma, $f(x^{k + 1}) - f(x^k) \le - \overline{\alpha} \left| \nabla f(x^k)\right|^2 + \frac{L\overline \alpha^2}{2} \left| \nabla f(x^k)\right|^2 = \left[\frac{L\overline \alpha}{2} - 1\right]\overline \alpha \left| \nabla f(x^k)\right|^2$ As long as $\overline \alpha \in (0, \frac{2}{L})$, $(f(x^k))_k$ is decreasing and it must converge to some $\xi \in \mathbb R \cup {-\infty}$. We can ignore the unbounded cases. Since we meet the rolling sum again, and why not sum them up again?

$\left[\frac{L\overline \alpha}{2} - 1\right]\overline \alpha \sum_{k = 0}^M\left| \nabla f(x^k)\right|^2 \le \sum_{k = 0}^M \left[f(x^{k + 1}) - f(x^k)\right] = f(x^0) - f(x^{M + 1}) \le f(x^0) - \xi$

As $M \to \infty$, we can see that $\sum_{k = 0}^M\left| \nabla f(x^k)\right|^2$ convergent, so $\nabla f(x^k) \to 0, k \to \infty$

##### Proof of Armijo Linear Search and Exact Line Search

Notice that when $\frac{L\alpha}{2} - 1 \le -\gamma \iff \alpha \le \frac{2(1 - \gamma)}{L}$ the constant $\alpha$ is always satisfies the requirement of ALS!

Therefore, setting $\tau := \min \left{s, \frac{2\sigma(1 - \gamma)}{L}\right}$ We always have $\alpha_k \ge \tau$.

Applying the the rolling sum trick with ALS: $f(x^0) - \xi = \lim_{k\to\infty}\sum_{i = 0}^{k - 1}\left[f(x^i) - f(x^{i + 1})\right] \ge \sum_{i = 0}^\infty \gamma \alpha_i \left| \nabla f(x^i) \right|^2 \ge \sum_{i = 0}^\infty \gamma \tau \left| \nabla f(x^i) \right|^2$ This also gives the convergence.

As for ELS,

Since, $\lim_{k\to\infty}\sum_{i = 0}^{k - 1}\left[f(x^i) - f(x^i + \alpha_k^e d)\right] \ge\lim_{k\to\infty}\sum_{i = 0}^{k - 1}\left[f(x^i) - f(x^i + \alpha_k^a d)\right]$ We can proof it by a simple subtitution.

##### Proof of Diminishing Step Sizes

We still have $f(x^{k + 1}) - f(x^k) \le \left[\frac{L\overline \alpha_k}{2} - 1\right] \alpha_k \left| \nabla f(x^k)\right|^2$ Since $\alpha_k \to 0$, there exists $c > 0$ such that $f(x^{k + 1}) - f(x^k) \le -c \alpha_k | \nabla f(x^k)|^2$ for all $k$ sufficiently large.

Hence, $f(x^{k + 1}) - f(x^k)$ should be monotonically decreasing for $k$ sufficiently large. We then know that $f(x^k)$ must

converge to some $\xi \in \mathbb R \cup {-\infty}$. Still, we only consider the bounded case, where by rolling summation, we get $\sum_{k = 0}^\infty \alpha_k | \nabla f(x^k)|^2 < \infty \Rightarrow \lim_{k\to\infty} \alpha_k | \nabla f(x^k)|^2 = 0$ This implies $\liminf_{k\to \infty} | \nabla f(x^k)| = 0.$ Otherwise if $| \nabla f(x^k)| \ge \epsilon \gt 0$, we have $\sum_{k = 0}^\infty \alpha_k | \nabla f(x^k)|^2 \ge \epsilon^2 \sum_{k = 0}^\infty \alpha_k = 0$ We still need to show $\lim_{k\to \infty} | \nabla f(x^k)| = 0.$ Assume on the contrary that we can find infinite sequences $(t_i)_i, (\ell_i)_i$ such that $\ell_i \gt t_i$ and $| \nabla f(x^{t_i}) \ge 2\varepsilon|, | \nabla f(x^{\ell_i})| \lt \epsilon$ Additionally, $\forall k = t_i + 1, \dots, \ell_i - 1, | \nabla f(x^k)| \ge \varepsilon$ Now, we sum block by block,

$\infty \gt \sum_{k = 0}^\infty \alpha_k | \nabla f(x^k)|^2 \ge \sum_{i = 1}^\infty\sum_{k = t_i}^{\ell_i - i} \alpha_k | \nabla f(x^k)|^2 \ge \epsilon^2 \sum_{i = 1}^\infty\sum_{k = t_i}^{\ell_i - i} \alpha_k$

With then know $\beta_i = \sum_{k = t_i}^{\ell_i - i} \alpha_k$ converges to $0$.

Consider the Hölder’s inequality: $\sum_{k = 1}^K |a_kb_k| \le \left[\sum_{k = 1}^Ka_k\right]^\frac{1}{2}\left[\sum_{k = 1}^Kb_k\right]^\frac{1}{2}$ We then obtain: \begin{aligned} | x^{\ell_i } - x^{t_i} | &\le \sum_{k = t_i}^{\ell_i - i} \left| x^{k + 1} - x^k\right| \ &= \sum_{k = t_i}^{\ell_i - i}\sqrt{\alpha_k}\sqrt{\alpha_k} \left| \nabla f(x^k)\right| \ &\le \sqrt{\beta_i} \cdot \sqrt{\sum_{k = t_i}^{\ell_i - i} \alpha_k \left| \nabla f(x^k)\right|^2} \ &\le \sqrt{\beta_i} \cdot \sqrt{\sum_{k = 0}^{\infty} \alpha_k \left| \nabla f(x^k)\right|^2} \ \end{aligned} Since $\beta_i \to 0$, we can see $| x^{\ell_i } - x^{t_i} | \to 0$ Finally, by the inverse triangle inequality $\epsilon \le \left| \left| \nabla f(x^{\ell_i})\right| - \left| \nabla f(x^{t_i})\right| \right| \le \left| \nabla f(x^{\ell_i}) - \nabla f(x^{t_i})\right|\le L | x^{\ell_i } - x^{t_i} | \to 0$ This leads to a contradiction.

## Isolated Accumulation Points

$\overline x$ is the Isolated Accumulation Point if there is a neighborhood of $\overline x$ that does not contain any other accumulation points of $(x^k)_k$.

[TO BE CONT.]