Selected Topics on Gradient Descent Method

book Public Date: 2020-10-03 11:46:24.715138

update Update Date: 2020-10-03 11:55:58.532449

In this post, we basically extend the discussions in Optimization Learning Notes 6 on Gradient Descent Method (GDM).

More on Descent Direction

We have already seen that Descent Direction is a important concept of GDM. However, why do we choose the negative gradient as the direction at the first place?

Steepest Descent Directions

Let f:RnRf: \mathbb R^n \to \mathbb R be continuously differentiable and let xRnx \in \mathbb R^n be given with f(x)0\nabla f(x) \ne 0. Let dRnd \in \mathbb R^n denote the solution of the optimization problem: mind=1f(x)d \min_{|d| = 1} \nabla f(x)^\top d Every vector of the form s=λd,λ>0s = \lambda d, \lambda > 0, is called the steepest descent direction of ff in xx.

It turns out that d=f(x)f(x) d = -\frac{\nabla f(x)}{|\nabla f(x)|} is the unique optimal solution.

By the Cauchy-Schwarz inequality, we have

f(x)df(x)d | \nabla f(x)^\top d| \le | \nabla f(x)| | d| The equality is satisfied if and only if f(x)\nabla f(x) and dd are linearly dependent.

For every dRn,d=1d \in \mathbb R^n, | d| = 1, we then have f(x)df(x)d=f(x) \nabla f(x)^\top d \ge - | \nabla f(x)| | d| = - | \nabla f(x)| This is true iff d=f(x)f(x) d = -\frac{\nabla f(x)}{|\nabla f(x)|}

  • The Steepest Descent Directions vary with the definition of norm and here we are only taking about the Euclidean Norm. And for AA-norm, s=λA1f(x),λ>0s = -\lambda A^{-1}\nabla f(x), \lambda \gt 0

More on Step Sizes

In Notes 6, we listed 3 Step Size Strategies:

  • Constant Step Size
  • Exact Line Search
  • Backtracking/Armijo Line Search

Another interesting strategy to consider is the Diminishing Step Sizes: (ak)k,s.t.,ak0,k=0ak= (a_k)k, s.t., a_k\to 0, \sum{k = 0}^\infty a_k = \infty

Armijo Condition

Let ff be continuously differentiable and let dRn</mi>{0}d \in \mathbb R^n \backslash {0} be a descent direction of ff at xx. Let γ(0,1)\gamma \in (0, 1) be given. Then there exists ε>0\varepsilon > 0 such that the inequality f(x+αd)f(x)γαf(x)d f(x + \alpha d) - f(x) \le \gamma \alpha \nabla f(x)^\top d holds for all α[0,ε]\alpha \in [0, \varepsilon].


Consider the Taylor's Expansion, f(x+αd)f(x)αf(x)d+o(αd) f(x + \alpha d) - f(x) \le \alpha \nabla f(x)^\top d + o(| \alpha d|) for α0\alpha \to 0 and hence, f(x+αd)f(x)γαf(x)d=(1γ)αf(x)d+o(αd) f(x + \alpha d) - f(x) - \gamma \alpha\nabla f(x)^\top d = (1 - \gamma)\alpha \nabla f(x)^\top d + o(|\alpha d|) Since dd is a descent direction, we obtain limα0(1γ)αf(x)d+o(αd)α=(1γ)f(x)d+o(αd)<0 \lim_{\alpha \downarrow 0} \frac{(1-\gamma) \alpha \nabla f(x)^\top d + o(|\alpha d|)}{\alpha} = (1 - \gamma)\nabla f(x)^\top d + o(|\alpha d|) \lt 0 So we finish the proof.

More on Convergence

Now, we have some tough works to do with the Convergence of Gradient Method. Let us look at more results of global Convergence and provide more rigorous proofs for our previous outcomes.

Basic Global Convergence

Let f:RnRf: \mathbb R^n \mapsto \mathbb R be continuous differentiable and let (xk)k(x_k)^k be generated by the gradient method for solving:

minxf(x) s.t. xRn \min_x f(x) ~s.t. ~x \in \mathbb R^n

with one of the following step size strategies:

  • exact line search
  • Armijo line search with σ,γ(0,1)\sigma, \gamma \in (0, 1)

Then, (f(xk))k(f(x^k))_k is non-increasing and every accumulation point of (xk)k(x^k)_k is a stationary point of ff.

This result in already shown in Notes 6, but let us take a look at how we can derive this from scratch.

Proof of ALS

We start with an arbitrary accumulation point xx^\ast of (xk)k(x^k)_k. With either ELS or ALS, we can see that f(xk+1)f(xk),k. f(x^{k + 1}) \le f(x^k), \forall k.

Since the sequence (f(xk)k)(f(x^k)_k) is monotonically non-increasing, it must converges to some limit ξR{}\xi \in \mathbb R \cup {-\infty}. Let

(xkl)lN(x^{k_l})_{l \in \mathbb N} be a subsequence of (xk)k(x^k)_k that converges to xx^\ast.

Since ff is continuous, limlf(xkl)=f(x) \lim_{l \to \infty} f(x^{k_l}) = f(x^\ast) However, f(xk)ξf(x^k) \to \xi, and by the uniqueness of the limit: ξ=f(x) \xi = f(x^\ast) We first consider Armijo Line Search: f(xi)f(xi+1)γαif(xi)2 f(x^i) - f(x^{i + 1}) \ge \gamma \alpha_i | \nabla f(x^i)|^2 It is easy to spot that if we sum up the inequalities for each ii, we will get a rolling sum on the left hand side. Hence, f(x0)f(x)=limkf(x0)f(xk)=limkk=0k1f(xi)f(xi+1)i=0γαif(xi)2 f(x_0) - f(x^\ast) = \lim_{k \to \infty} f(x_0) - f(x^k) = \lim_{k\to \infty} \sum_{k = 0}^{k - 1}f(x^i) - f(x^{i + 1}) \ge \sum_{i = 0}^\infty \gamma \alpha_i |\nabla f(x^i)|^2 Obviously, f(x0)f(x) f(x_0) - f(x^\ast) is bounded. Therefore, the sequence (αif(xi))i \left(\alpha_i | \nabla f(x^i)|\right)_i must converge to zero.

Assuming that f(x)0| \nabla f(x^\ast)| \ne 0. More precisely, ϵ>0,LN,f(xk)>ϵ,L \exists \epsilon \gt 0, \mathcal L \subseteq \mathbb N, | \nabla f(x^{k_\ell})| \gt \epsilon, \forall \ell \in \mathcal L We then obtain αkl0 \alpha_{k_l} \to 0 Since αkl\alpha_{k_l} is the Armijo step size, then by definition, its previous step does not conform the Armijo Step Size Condition, i.e. f(xkl+β1αkldkl)f(xkl)γβ1αklf(xkl)2 f(x^{k_l} + \beta^{-1} \alpha_{k_l} d^{k_l}) - f(x^{k_l}) \ge -\gamma \beta^{-1}\alpha_{k_l} | \nabla f(x^{k_l})|^2 Setting hkl:=β1αklf(xkl)h^{k_l} := -\beta^{-1} \alpha_{k_l} \nabla f(x^{k_l}) and using the mean value theorem, there exists tkl[0,1]t_{k_l} \in [0, 1] such that f(xkl+hkl)=f(xkl)+f(xkl+tklhkl)[β1αklf(xkl)] f(x^{k_l} + h^{k_l}) = f(x^{k_l}) + \nabla f(x^{k_l} + t_{k_l}h^{k_l})^\top [-\beta^{-1}\alpha_{k_l} \nabla f(x^{k_l})] Combining the two formulas, we then get f(xkl+tklhkl)f(xkl)γf(xkl)2 \nabla f(x^{k_l} + t_{k_l}h^{k_l})^\top \nabla f(x^{k_l}) \le \gamma |\nabla f(x^{k_l})|^2 It is easy to see, liml,lLf(xkl+tklhkl)f(xkl)=f(x)2 \lim_{l \to \infty, l \in \mathcal L} \nabla f(x^{k_l} + t_{k_l}h^{k_l})^\top \nabla f(x^{k_l}) = |\nabla f(x^\ast)|^2 which yields (1γ)f(xkl)20 (1 - \gamma) |\nabla f(x^{k_l})|^2 \le 0 This is contradict to our assumption. Consequently, we now know that f(x)=0 \nabla f(x^\ast) = 0

Proof of ELS

Denote the step sizes for Armijo line search as αka\alpha_{k}^a and the step sizes for exact line search as αke\alpha_{k}^e.

By definition, ELS minimizes f(xk+αkdk)f(x^{k} + \alpha_kd^k), therefore f(xk+αkedk)f(xk)f(xk+αkadk)f(xk)γαkaf(xk)2 f(x^k + \alpha_k^e d^k) - f(x^k) \le f(x^k + \alpha_k^a d^k) - f(x^k) \le - \gamma \alpha_k^a | \nabla f(x^k)|^2 Then we can finish the proof of ELS with the same way of ALS.

Convergence Analysis

We introduce

  • X:={xRn:f(x)=infxRnf(x)} \mathcal X^* := {x \in \mathbb R^n : f(x) = \inf_{x \in \mathbb R^n} f(x)}

  • S:={xRn:f(x)=0} \mathcal S := { x \in \mathbb R^n: \nabla f(x) = 0}

  • A:={xRn:(k) s.t. xklx,} \mathfrak A := {x \in \mathbb R^n : \exists (k_\ell)_\ell ~s.t.~x^{k_l} \to x, \ell \to \infty}

So our Basic Global Convergence result describes AS \mathfrak A \subseteq \mathcal S The current result is weak in several aspects:

  • A\mathfrak A can be an empty set
  • It does not guarantee the convergence of the whole sequence.

Global Convergence Under Convexity

Let f:RnRf: \mathbb R^n \to \mathbb R be convex and continuously differentiable and let (xk)k(x^k)_k be generated by the GDM utilizing the following step size strategy:

  • Armijo line search (backtracking) with σ,γ(0,1)\sigma, \gamma \in (0, 1) and s>0s > 0

Suppose that the solution set X\mathcal X^\ast is nonempty. Then, (xk)k(x_k)^k converges to a global solution of the program minxRnf(x)\min_{x \in \mathbb R^n} f(x).

Since (f(xk))k(f(x^k))_k is non-increasing and infima set X\mathcal X^\ast is nonempty, we can infer that fR,(f(xk))kf \exist f^\ast \in \mathbb R, (f(x^k))_k \to f^\ast For every xXx \in \mathcal X^\ast, we have xk+1x2=xk+1xk+xkx=xkxαk(xk)=xkx22αk<f(xk),xkx>+αk2f(xk)2xkx2+2αk[f(x)f(xk)]+sαkf(xk)2xkx2+sγ[f(x)f(xk)] \begin{aligned} | x^{k + 1} - x|^2 & = | x^{k + 1} - x^k + x^k - x| \ & = | x^k - x - \alpha_k\nabla(x^k)| \ & = | x^k - x|^2 - 2\alpha_k \left< \nabla f(x^k), x^k - x\right> + \alpha^2_k | \nabla f(x^k)|^2 \ &\le | x^k - x|^2 + 2\alpha_k[f(x) - f(x^k)] + s\alpha_k |\nabla f(x^k) |^2 \ &\le | x^k - x|^2 + \frac{s}{\gamma}[f(x) - f(x^k)] \end{aligned} Again, we meet our old friend rolling sum. Sum them up, we get x+1x2x0x2+sγ[f(x0)f(x+1)]x0x2+sγ[f(x0)f] | x^{\ell + 1} - x|^2 \le | x^0 - x|^2 + \frac{s}{\gamma} [f(x^0) - f(x^{\ell + 1})] \le |x^0 - x |^2 + \frac{s}{\gamma} [f(x^0) - f^\ast] Therefore, (xk)k(x^k)_k must be bounded. We infer that there exists an accumulation point xA\overline x \in \mathfrak A of (xk)k(x^k)_k. Since every accumulation point is a stationary point, x\overline x is stationary. Let (xkl)l(x^{k_l})_l be a subsequence converging to x\overline x. Then, for every ϵ>0\epsilon > 0 there exists LNL \in \mathbb N such that xklx<ϵ2 and f(xkl)f<γϵ2s | x^{k_l} - \overline x| < \frac{\epsilon}{2} \text{ and } f(x^{k_l}) - f^\ast < \frac{\gamma \epsilon}{2s} So specializing xx to x\overline x, we obtain: xk+1x2xkx2+sγ[f(x)f(xk)]<ϵ | x^{k + 1} - \overline x|^2 \le | x^k - x|^2 + \frac{s}{\gamma}[f(x) - f(x^k)] < \epsilon Then, it is easy to see, (xk)k(x^k)_k converges to x\overline x

Quasi Fejér Monotonicity

We have encountered rolling sum for several times. It turns out not to be a coincidence. Global Convergence Under Convexity is actually closely related to Quasi Fejér Monotonicity, which gives us a rolling sum structure.

A sequence (yk)k(y^k)_k is called Quasi Fejér convergent to a set YRn\mathcal Y \subset \mathbb R^n if it holds that yk+1y2(1+βk)yky2+γk,kN,yY | y^{k+1} - y|^2 \le (1 + \beta_k)| y^k - y|^2 + \gamma_k, \forall k \in \mathbb N, \forall y \in \mathcal Y where (βk),(γk)(\beta_k), (\gamma_k) are nonnegative sequences satisfying βk=0γk=0 \sum^\infty \beta_k = 0 \ \sum^\infty \gamma_k = 0

Convergence of Quasi Fejér Sequences

Suppose we have a sequence (yk)k(y^k)_k is called Quasi Fejér convergent to a set YRn\mathcal Y \subset \mathbb R^n. Then, (yk)k(y^k)_k is bounded. Moreover, if an accumulation point yy of (yk)k(y^k)_k belongs to Y\mathcal Y, then we have

limkyk=y \lim_{k\to \infty} y^k = y

Convergence Analysis under Lipschitz Continuity

In Notes 6, we have shown the basic applications of Lipschitz Continuity. If f\nabla f is Lipschitz Continuous, then we can derive a quadratic upper bound on the function ff: f(y)f(x)+<f(x),yx>+12Lyx2,x,yRn f(y) \le f(x) + \left<\nabla f(x), y - x\right> + \frac{1}{2}L | y - x|^2, \forall x, y \in \mathbb R^n This is a direct result from the Descent Lemma:

Descent Lemma

Suppose that fCL1,1(Rn)f \in C_L^{1,1}(\mathbb R^n), then it follows: f(y)f(x)<f(x),yx>L2yx2,x,yRn |f(y) - f(x) - \left<\nabla f(x), y - x\right>|\le \frac{L}{2} |y - x|^2, \forall x, y \in \mathbb R^n


Define ϕ:RRϕ:tf(x+t(yx)) \begin{aligned} \phi &: &\mathbb R\mapsto &\mathbb R \ \phi &: &t \to &f(x + t(y - x)) \end{aligned} Rewrite f(y)f(x)=ϕ(1)ϕ(0)=01ϕ(t) dt=01f(x+t(yx))(yx) dt \begin{aligned} f(y) - f(x) & = \phi(1) - \phi(0) \ & = \int^1_0 \phi'(t) ~\mathbb dt \ & = \int^1_0 \nabla f(x + t(y - x))^\top(y -x) ~ \mathbb dt \end{aligned} Therefore, f(y)f(x)<f(x),xy>=01[f(x+t(yx))f(x)](yx) dt01f(x+t(yx))f(x)yx dtLyx201t dt=L2yx \begin{aligned} | f(y) - f(x) - \left<\nabla f(x),x - y\right>| &= \left| \int^1_0 \left[\nabla f(x + t(y - x)) - \nabla f(x)\right]^\top(y -x) ~ \mathbb dt\right| \ &\le \int^1_0 \left|\nabla f(x + t(y - x)) - \nabla f(x)\right||y -x| ~ \mathbb dt\ &\le L | y - x|^2 \cdot \int^1_0 t~\mathbb dt \ &= \frac{L}{2}|y - x| \end{aligned}

Lipschitz Continuity and Boundedness

So now it is to calculate the quadratic surrogate function if we know the parameter LL. For twice continuous differentiable Lipschitz continuous function (CL1,1(Rn)C_L^{1,1}(\mathbb R^n)), LL is given by the upperbound of the notm of Hessian. The theorem is well-stated in Notes 6, and now we give a proof to it.


We use a very similar trick as the proof of Descent Lemma.

Define ϕ:RRϕ:t(x+t(yx)) \begin{aligned} \phi &: &\mathbb R\mapsto &\mathbb R \ \phi &: &t \to &\nabla(x + t(y - x)) \end{aligned} Rewrite f(y)f(x)=ϕ(1)ϕ(0)=01ϕ(t) dt=012f(x+t(yx))(yx) dt \begin{aligned} \nabla f(y) - \nabla f(x) & = \phi(1) - \phi(0) \ & = \int^1_0 \phi'(t) ~\mathbb dt \ & = \int^1_0 \nabla^2 f(x + t(y - x))(y -x) ~ \mathbb dt \end{aligned} Let us first suppose 2f(x)L| \nabla^2 f(x) | \le L f(y)f(x)012f(x+t(yx))(yx) dt012f(x+t(yx))(yx) dt01Lyx dtLyx \begin{aligned} | \nabla f(y) - \nabla f(x)| &\le \left|\int^1_0 \nabla^2 f(x + t(y - x))(y -x) ~ \mathbb dt\right | \ &\le \int^1_0 \left|\nabla^2 f(x + t(y - x))(y -x)\right | ~ \mathbb dt \ &\le \int^1_0 L\left|y - x\right| ~ \mathbb dt \ &\le L| y - x| \end{aligned} To proof the other direction, choose hRnh \in \mathbb R^n in arbitrary direction. With Taylor Expansion: Lthf(x+th)f(x)=t2f(x)h+o(th),t0 Lt| h| \ge | \nabla f(x + th) - \nabla f(x)| = | t\nabla^2f(x) h + o(t|h|)|, t\to 0 Dividing tt, we then know 2f(x)hLh | \nabla^2 f(x) h| \le L | h| Since the spectral norm is an induced norm, 2f(x)L | \nabla^2f(x) | \le L

Global Convergence Under Lipschitz Continuity

Let fCL1,1(Rn)f \in C_L^{1, 1}(\mathbb R^n) and let (xk)k(x^k)_k be the sequence generated by the gradient method with one of the following step size strategies:

  • constant step size α(0,2L)\overline \alpha \in (0, \frac{2}{L})
  • exact line search
  • Armijo line search (backtracking) with σ,γ(0,1)\sigma, \gamma \in (0, 1) and s>0s > 0
  • diminishing step sizes, i.e., αk0\alpha_k \to 0 and k=0αk=\sum_{k = 0}^\infty \alpha_k = \infty.

Then, we either have f(xk)f(x^k) \to -\infty or (f(xk))k(f(x^k))_k converges to a finite value and it holds that f(xk)0\nabla f(x^k) \to 0 as kk \to \infty. (In particular, very accumulation point of (xk)k(x^k)_k is a stationary point in this case).

Proof of Constant Step Size

With Descent Lemma, f(xk+1)f(xk)αf(xk)2+Lα22f(xk)2=[Lα21]αf(xk)2 f(x^{k + 1}) - f(x^k) \le - \overline{\alpha} \left| \nabla f(x^k)\right|^2 + \frac{L\overline \alpha^2}{2} \left| \nabla f(x^k)\right|^2 = \left[\frac{L\overline \alpha}{2} - 1\right]\overline \alpha \left| \nabla f(x^k)\right|^2 As long as α(0,2L)\overline \alpha \in (0, \frac{2}{L}), (f(xk))k(f(x^k))_k is decreasing and it must converge to some ξR{}\xi \in \mathbb R \cup {-\infty}. We can ignore the unbounded cases. Since we meet the rolling sum again, and why not sum them up again?

[Lα21]αk=0Mf(xk)2k=0M[f(xk+1)f(xk)]=f(x0)f(xM+1)f(x0)ξ \left[\frac{L\overline \alpha}{2} - 1\right]\overline \alpha \sum_{k = 0}^M\left| \nabla f(x^k)\right|^2 \le \sum_{k = 0}^M \left[f(x^{k + 1}) - f(x^k)\right] = f(x^0) - f(x^{M + 1}) \le f(x^0) - \xi

As MM \to \infty, we can see that k=0Mf(xk)2\sum_{k = 0}^M\left| \nabla f(x^k)\right|^2 convergent, so f(xk)0,k \nabla f(x^k) \to 0, k \to \infty

Proof of Armijo Linear Search and Exact Line Search

Notice that when Lα21γ    α2(1γ)L \frac{L\alpha}{2} - 1 \le -\gamma \iff \alpha \le \frac{2(1 - \gamma)}{L} the constant α\alpha is always satisfies the requirement of ALS!

Therefore, setting τ:=min{s,2σ(1γ)L} \tau := \min \left{s, \frac{2\sigma(1 - \gamma)}{L}\right} We always have αkτ\alpha_k \ge \tau.

Applying the the rolling sum trick with ALS: f(x0)ξ=limki=0k1[f(xi)f(xi+1)]i=0γαif(xi)2i=0γτf(xi)2 f(x^0) - \xi = \lim_{k\to\infty}\sum_{i = 0}^{k - 1}\left[f(x^i) - f(x^{i + 1})\right] \ge \sum_{i = 0}^\infty \gamma \alpha_i \left| \nabla f(x^i) \right|^2 \ge \sum_{i = 0}^\infty \gamma \tau \left| \nabla f(x^i) \right|^2 This also gives the convergence.

As for ELS,

Since, limki=0k1[f(xi)f(xi+αked)]limki=0k1[f(xi)f(xi+αkad)] \lim_{k\to\infty}\sum_{i = 0}^{k - 1}\left[f(x^i) - f(x^i + \alpha_k^e d)\right] \ge\lim_{k\to\infty}\sum_{i = 0}^{k - 1}\left[f(x^i) - f(x^i + \alpha_k^a d)\right] We can proof it by a simple subtitution.

Proof of Diminishing Step Sizes

We still have f(xk+1)f(xk)[Lαk21]αkf(xk)2 f(x^{k + 1}) - f(x^k) \le \left[\frac{L\overline \alpha_k}{2} - 1\right] \alpha_k \left| \nabla f(x^k)\right|^2 Since αk0\alpha_k \to 0, there exists c>0c > 0 such that f(xk+1)f(xk)cαkf(xk)2 f(x^{k + 1}) - f(x^k) \le -c \alpha_k | \nabla f(x^k)|^2 for all kk sufficiently large.

Hence, f(xk+1)f(xk)f(x^{k + 1}) - f(x^k) should be monotonically decreasing for kk sufficiently large. We then know that f(xk)f(x^k) must

converge to some ξR{}\xi \in \mathbb R \cup {-\infty}. Still, we only consider the bounded case, where by rolling summation, we get k=0αkf(xk)2<limkαkf(xk)2=0 \sum_{k = 0}^\infty \alpha_k | \nabla f(x^k)|^2 < \infty \Rightarrow \lim_{k\to\infty} \alpha_k | \nabla f(x^k)|^2 = 0 This implies lim infkf(xk)=0. \liminf_{k\to \infty} | \nabla f(x^k)| = 0. Otherwise if f(xk)ϵ>0| \nabla f(x^k)| \ge \epsilon \gt 0, we have k=0αkf(xk)2ϵ2k=0αk=0 \sum_{k = 0}^\infty \alpha_k | \nabla f(x^k)|^2 \ge \epsilon^2 \sum_{k = 0}^\infty \alpha_k = 0 We still need to show limkf(xk)=0. \lim_{k\to \infty} | \nabla f(x^k)| = 0. Assume on the contrary that we can find infinite sequences (ti)i,(i)i(t_i)_i, (\ell_i)_i such that i>ti\ell_i \gt t_i and f(xti)2ε,f(xi)<ϵ | \nabla f(x^{t_i}) \ge 2\varepsilon|, | \nabla f(x^{\ell_i})| \lt \epsilon Additionally, k=ti+1,,i1,f(xk)ε \forall k = t_i + 1, \dots, \ell_i - 1, | \nabla f(x^k)| \ge \varepsilon Now, we sum block by block,

>k=0αkf(xk)2i=1k=tiiiαkf(xk)2ϵ2i=1k=tiiiαk \infty \gt \sum_{k = 0}^\infty \alpha_k | \nabla f(x^k)|^2 \ge \sum_{i = 1}^\infty\sum_{k = t_i}^{\ell_i - i} \alpha_k | \nabla f(x^k)|^2 \ge \epsilon^2 \sum_{i = 1}^\infty\sum_{k = t_i}^{\ell_i - i} \alpha_k

With then know βi=k=tiiiαk \beta_i = \sum_{k = t_i}^{\ell_i - i} \alpha_k converges to 00.

Consider the Hölder’s inequality: k=1Kakbk[k=1Kak]12[k=1Kbk]12 \sum_{k = 1}^K |a_kb_k| \le \left[\sum_{k = 1}^Ka_k\right]^\frac{1}{2}\left[\sum_{k = 1}^Kb_k\right]^\frac{1}{2} We then obtain: xixtik=tiiixk+1xk=k=tiiiαkαkf(xk)βik=tiiiαkf(xk)2βik=0αkf(xk)2 \begin{aligned} | x^{\ell_i } - x^{t_i} | &\le \sum_{k = t_i}^{\ell_i - i} \left| x^{k + 1} - x^k\right| \ &= \sum_{k = t_i}^{\ell_i - i}\sqrt{\alpha_k}\sqrt{\alpha_k} \left| \nabla f(x^k)\right| \ &\le \sqrt{\beta_i} \cdot \sqrt{\sum_{k = t_i}^{\ell_i - i} \alpha_k \left| \nabla f(x^k)\right|^2} \ &\le \sqrt{\beta_i} \cdot \sqrt{\sum_{k = 0}^{\infty} \alpha_k \left| \nabla f(x^k)\right|^2} \ \end{aligned} Since βi0\beta_i \to 0, we can see xixti0 | x^{\ell_i } - x^{t_i} | \to 0 Finally, by the inverse triangle inequality ϵf(xi)f(xti)f(xi)f(xti)Lxixti0 \epsilon \le \left| \left| \nabla f(x^{\ell_i})\right| - \left| \nabla f(x^{t_i})\right| \right| \le \left| \nabla f(x^{\ell_i}) - \nabla f(x^{t_i})\right|\le L | x^{\ell_i } - x^{t_i} | \to 0 This leads to a contradiction.

Isolated Accumulation Points

x\overline x is the Isolated Accumulation Point if there is a neighborhood of x\overline x that does not contain any other accumulation points of (xk)k(x^k)_k.


Raw Content

Submit New Comment

question_answer Comments

No Comment Yet~