We are going to have a look at some advanced topics Optimality Conditions for Unconstrained Optimization Problems on convexity, gradient descent and other related stuffs. ## Additional Mathematics Background ### Fréchet Differentiability Fréchet Differentiation provides a way to extend Differentiability to general vector spaces. A mapping $F : \mathbb R^n \to \mathbb R^m$ is said to be Fréchet Differentiable at $x$ if there is a linear function $\mathcal L: \mathbb R^n \to \mathbb R^m$ such that $$\lim_{h \to 0} \frac{\left\|F(x + h) - F(x) - L(h)\right\|}{\| h \|} = 0$$ The function $\mathcal L$ is the derivative of $f$ at $x$ and it holds that $$\mathcal L(h) = Df(x) \cdot h$$ where $Df(x) \in \mathbb R^{m \times n}$ is the Jacobian-Matrix. ## Optimality Conditions for Unconstrained Optimization Problems Basic knowledge on FONC, SONC, SOSC has already been covered in [Optimization Learning Notes 5](https://zhuyi.fan/post/optimization-learning-notes-5.html). I may only provide more proofs and more remarks for them. ### Quadratic Growth Condition SOSC turns out to have a equivalent form. Let $f : X \to \mathbb R$ be twice continuously differentiable on the open set $X \subset \mathbb R^n$. Then, the SOSC conditions for the local optimum $x^\ast$ are equivalent to $$\exists \alpha, \epsilon, s.t. f(x) - f(x^\ast) \ge \frac{\alpha}{2} \|x - x^\ast\|^2, \forall x \in B_\epsilon(x^\ast) \cap X.$$ Notice in the proof of SOSC, we have already shown that $$f(x) - f(x^\ast) \ge \frac{\lambda_\text{max} (\nabla^2 f(x^\ast))}{4} \| x - x^\ast\|^2$$ Now we proof the other direction. Since QGC implies FONC, we have $\nabla f(x^\ast) = 0$. By Taylor Expansion, \begin{aligned} f(x) - f(x^\ast) = \nabla f(x^\ast)^\top (x - x^\ast) + \frac{1}{2} (x - x^\ast)^\top\nabla^2f(x^\ast) (x - x^\ast) + o(\| x - x^\ast\|^2) \end{aligned} Hence, \begin{aligned} (x - x^\ast)^\top\nabla^2f(x^\ast) (x - x^\ast) &= 2[f(x) - f(x^\ast)] - 2o(\| x - x^\ast\|^2)\\ & \ge 2\| x - x^\ast\|^2\left(\alpha - \frac{o(\| x - x^\ast\|^2)}{\| x - x^\ast\|^2} \right) \end{aligned} Therefore, $$\lim_{\| x - x^\ast\|\to 0} (x - x^\ast)^\top\nabla^2f(x^\ast)(x - x^\ast) \ge 0$$ Since $X$ is open, $x - x^\ast$ can be in arbitrary direction, we can see that $\nabla^2 f(x^\ast)$ must be PSD. ## Convexity ### Extended Definitions for Convexity Beside the normal definitions of Convex Functions in [Optimization Learning Notes 6](https://zhuyi.fan/post/optimization-learning-notes-6.html), we now introduce two more constrained convexity definitions: Let $f: X \to \mathbb R$ be given and let $X \subset \mathbb R^n$ be a convex set. The function $f$ is - **strictly convex** if for all $x, y \in X$ with $x \ne y$ and $\lambda \in (0, 1)$ we have $$f(\lambda x + (1 - \lambda)y) \lt \lambda f(x) + (1 - \lambda) f(y)$$ - **strongly convex (with parameter $\mu$)** is there exists $\mu \gt 0$ such that for all $x, y \in X$ and all $\lambda \in [0, 1]$, we have $$f(\lambda x + (1 - \lambda)y) + \frac{\mu}{2}\lambda(1-\lambda)\|y - x\|^2 \le \lambda f(x) + (1 - \lambda) f(y)$$ ### First Order Characterization Let $f : X \to \mathbb R$ be continuously differentiable on an open set that contains the convex set $X \subset \mathbb R^n$. Then, it follows 1. The function $f$ is convex if and only if $$f(y) - f(x) \ge \nabla f(x)^\top (y - x), \forall x, y \in X.$$ 2. The mapping $f$ is strictly convex if and only if $$f(y) - f(x) \gt \nabla f(x)^\top (y - x), \forall x, y \in X , x \ne y$$ 3. The function is strongly convex (with parameter $\mu$) if and only if there exists $\mu \gt 0$ such that $$f(y) - f(x) \ge \nabla f(x)^\top (y - x) + \frac{\mu}{2} \| y - x\|^2, \forall x, y \in X$$ #### Proof of 1 From the definitions, we have $$f(y) - f(x) \ge \frac{f(x + \lambda(y - x)) - f(x)}{\lambda}$$ Taking $\lambda \to 0$, we get $$f(y) - f(x) \ge \nabla f(x)^\top (y - x), \forall x, y \in X.$$ Denoting $x_\lambda = (1 - \lambda)x + \lambda y$, $\forall x, y \in X, \lambda \in [0, 1]$ $$f(x) - f(x_\lambda) \ge \nabla f(x_\lambda)^\top (x - x_\lambda) = \lambda\nabla f(x_\lambda)^\top (x - y) ~~~(1) \\ f(y) - f(x_\lambda) \ge \nabla f(x_\lambda)^\top (y - x_\lambda) = (1 - \lambda)\nabla f(x_\lambda)^\top (y - x) ~~~(2)$$ $(1 - \lambda)(1) + \lambda(2)$, we get $$0 \le (1 - \lambda)(f(x) - f(x_\lambda)) + \lambda(f(y) - f(x_\lambda))$$ This implies $$(1-\lambda)f(x) + \lambda f(y) \ge f(x_\lambda)$$ #### Proof of 2 First, suppose $f$ is strictly convex, by letting $z = \frac{1}{2}(x + y), \forall x, y \in X, x \ne y$ we have $$f(x) \lt \frac{1}{2} f(x) + \frac{1}{2}f(y) \Rightarrow f(z) - f(x) \lt \frac{1}{2}(f(y) - f(x))$$ Using 1, it holds that $$f(z) - f(x) \ge \nabla f(x)^\top(z - x) = \frac{1}{2}\nabla f(x)^\top (y - x)$$ This yields, $$\nabla f(x)^\top (y - x) \le 2(f(z) - f(x)) \lt f(y) - f(x)$$ The other half can be proved in the same way as $1$. #### Proof of 3 Following the same way of $1$, \begin{aligned} \nabla f(x)^\top(y - x) &= \lim_{\lambda \downarrow 0} \frac{f(x_\lambda) - f(x)}{\lambda} \\ &\le \lim_{\lambda \downarrow 0} \frac{(1 - \lambda)f(x) + \lambda f(y) - \frac{\mu}{2}\lambda(1 - \lambda)\| y - x\|^2 - f(x)}{\lambda} \\ &= f(y ) - f(x) - \frac{\mu}{2} \| y - x\|^2 \end{aligned} For the other direction, consider $$\| x - x_\lambda\| = \lambda \| y - x\| \\ \| y - x_\lambda\| = (1 - \lambda) \| y - x\|$$ We then obtain \begin{aligned} (1 - \lambda)f(x) + \lambda f(y) - f(x_\lambda) &= (1 - \lambda)(f(x) - f(x_\lambda)) + \lambda(f(y) - f(x_\lambda))\\ &\ge (1 - \lambda)[\nabla f(x_\lambda)^\top(x - x_\lambda) - \frac{\mu}{2}\| x - x_\lambda\|^2] + \lambda [\nabla f(x_\lambda)^\top(y - x_\lambda) - \frac{\mu}{2}\|y - x_\lambda\|^2]\\ & = \nabla f(x_\lambda)^\top [(1 - \lambda) x + \lambda y - x_\lambda] + \frac{\mu}{2}[\lambda^2(1 - \lambda) + \lambda(1 - \lambda)^2]\| y - x\|^2\\ & = \frac{\mu}{2}\lambda(1-\lambda)\| y - x\|^2 \end{aligned} ### Second Order Characterization Let $f : X \to \mathbb R$ be twice continuously differentiable on an open set and convex set $X \subset \mathbb R^n$. 1. The function $f$ is convex if and only if the Hessian $\nabla^2f(x)$ is positive semidefinite for all $x \in X$. $$h^\top \nabla^2f(x)h \ge 0, \forall h \in \mathbb R^n, \forall x \in X.$$ 2. If the Hessian $\nabla^2f(x)$ is PD for all $x \in X$, i.e., if we have $$h^\top \nabla^2f(x) h \gt 0, \forall h \in \mathbb R^n \backslash \{0\}, \forall x \in X$$ then $f$ is strictly convex. 3. The mapping $f$ is strongly convex (with parameter $\mu$) iff $\nabla^2 f(x)$ is uniformly positive definite (with parameter $\mu$), i.e., there is $\mu \gt 0$ s.t. $$h^\top \nabla^2f(x)h \ge \mu \| h\|^2, \forall h \in \mathbb R^n, \forall x \in X.$$ #### Proof of 1 Suppose the Hessian is PSD. By Second Order MVT, we can find $t \in [0, 1]$ s.t. $$f(y) - f(x) = \nabla f(x)^\top (y - x) + \frac{1}{2} (y - x)^\top \nabla^2f(x + t(y - x))(y - x) \ge f(x)^\top (y - x)$$ This gives us the first order characterization. To prove the other direction, consider $x + th, \forall h \in \mathbb R^n$. Since $X$ is open, we have find an $\epsilon$, s.t. $$x + th \in X, \forall t \in [0, \epsilon]$$ Now consider the following Taylor Expansion: $$f(x + th) = f(x) + t\nabla f(x)^\top h + \frac{t^2}{2}h^\top \nabla^2f(x)h + o(t^2)$$ for $t$ sufficiently small $(t \to 0)$. Since $f$ is convex, we can use the first order characterization: $$\frac{t^2}{2}h^\top \nabla^2f(x) h + o(t^2) \ge 0, (t \to 0)$$ Hence, $$h^\top \nabla^2f(x) h \ge - 2\frac{o(t^2)}{t^2}, (t \to 0)$$ We can infer that $\nabla^2f(x)$ is PSD. #### Proof of 2 The same approach as the proof of 1. #### Proof of 3 Similarly to the proof of 1, we use MVT to get \begin{aligned} f(y) - f(x) &= \nabla f(x)^\top (y - x) + \frac{1}{2} (y - x)^\top \nabla^2f(x + t(y - x))(y - x) \\ &\ge \nabla f(x)^\top (y - x) + \frac{\mu}{2}\| y - x \|^2 \end{aligned} This already gives us the first order characterization. Conversely, we start with the first order characterization: $$f(x + th) - f(x) \ge t \nabla f(x)^\top h + \frac{\mu t^2}{2} \|h\|^2$$ Reuse the Taylor Expansion: $$f(x + th) - f(x) = t\nabla f(x)^\top h + \frac{t^2}{2}h^\top \nabla^2f(x)h + o(t^2)$$ We now get $$h^\top \nabla^2f(x) h \ge - 2\frac{o(t^2)}{t^2} + \mu \| h\|^2, (t \to 0)$$ This finishes the proof. ### Convex Optimization Problems In [Optimization Learning Notes 6](https://zhuyi.fan/post/optimization-learning-notes-6.html), we already shows that for a typical convex minimization problem, any local minimum must be a global minimum. There are twos things remains here: - If $f$ is strictly convex, then the minimization problem possesses at most one local minimizer which (if it exists) is also the strict global minimum of $f$ on $X$. - If the optimization set $X$ is open and $f$ is continuously differentiable and $x^\ast$ is a stationary point of $f$. Then, $x^\ast$ is a global minimizer of $f$ on $X$. (we provides a proof here) #### Proof of the first part Suppose $f$ possesses two local minimizer $x^\ast \ne y ^\ast$. We then know that they are both global solutions: $$f(x^\ast) = f(y^\ast)$$ Consider $$z^\ast = \frac{1}{2}x^\ast + \frac{1}{2}y^\ast$$ We have $$f(z^\ast) \lt \frac{1}{2}f(x^\ast) + \frac{1}{2}f(y^\ast) = f(x^\ast)$$ This is a contradiction. #### Proof of the second part Just use the first order characterization: $$f(y) \ge f(x^\ast) + \nabla f(x^\ast)(y - x^\ast) = f(x^\ast), \forall y \in X$$ Hence, $x^\ast$ is a global solution. ### Some Additional Properties #### Convexity of $f(x) - \frac{\mu}{2}\| x\|^2$ Consider $$g(x) = f(x) - \frac{\mu}{2}\| x\|^2$$ We show that $f(x)$ is strongly convex with parameter $\mu$ iff $g(x)$ is convex. First suppose $f(x)$ is strongly convex, the we have \begin{aligned} g(\lambda a + (1 - \lambda)b) &= f(\lambda a + (1 - \lambda)b) - \frac{\mu}{2} \|\lambda a + (1 - \lambda) b\|^2 \\ &\le \lambda f(a) + (1 - \lambda )f(b) - \frac{\lambda (1 - \lambda)}{2} \mu \|a - b \|^2 - \frac{\mu}{2}\lambda^2\| a \|^2 - \frac{\mu}{2}(1 - \lambda)^2\| b \|^2 - \mu(1-\lambda)\lambda\|a \| \|b\|\\ &= \lambda f(a) + (1 - \lambda )f(b ) - \frac{\mu}{2} \left[\lambda(1 - \lambda) (\| a\|^2 - 2\| a\|\| b\| + \| b\|^2) + \lambda^2\| a \|^2 + (1 - \lambda)^2\| b \|^2 + 2(1-\lambda)\lambda\|a \| \|b\|\right] \\ &=\lambda f(a) + (1 - \lambda )f(b ) - \frac{\mu}{2} \left[\lambda(1 - \lambda) (\| a\|^2 + \| b\|^2) + \lambda^2\| a \|^2 + (1 - \lambda)^2\| b \|^2\right] \\ &=\lambda f(a) + (1 - \lambda )f(b ) - \frac{\mu}{2} \left[\lambda \| a\|^2 + (1 - \lambda) \| b\|^2\right] \\ &= \lambda g(a) + (1 - \lambda)g(b) \end{aligned} Therefore, $g(x)$ is convex. Conversely, if $g(x)$ is convex, \begin{aligned} \lambda f(a) + (1 - \lambda) f(b)& = \lambda [g(a) + \frac{\mu}{2} \| a\|^2 ] + (1 - \lambda)[g(b) + \frac{\mu}{2}\| b\|^2] \\ & = \lambda g(a) + (1 - \lambda) g(b) + \frac{\mu}{2} [\lambda \|a \|^2 + (1 - \lambda) \| b\|^2] \\ & \ge g(\lambda a +(1 - \lambda)b) + \frac{\mu}{2} [\lambda \|a \|^2 + (1 - \lambda) \| b\|^2] \\ & = f(\lambda a + (1 - \lambda)b) - \frac{\mu}{2} \| \lambda a + (1 - \lambda)b\|^2 + \frac{\mu}{2} [\lambda \|a \|^2 + (1 - \lambda) \| b\|^2] \end{aligned} Then, by a similar approach in the above proof, we see $$-\frac{\mu}{2} \| \lambda a + (1 - \lambda)b\|^2 + \frac{\mu}{2} [\lambda \|a \|^2 + (1 - \lambda) \| b\|^2] = 0$$ We finishes the proof. #### Convexity of Epigraph The epigraph of a function $f : \mathbb R^n \to \mathbb R$ is defined as $$\text{epi } f := \{(x, t) : x \in \mathbb R^n, \mu \in \mathbb R, \mu \ge f(x)\} \subseteq \mathbb R^{n + 1}$$ $f$ is convex iff its epigraph is convex. Consider we have $$(x_1, t_1), (x_2, t_2) \in \text{epi } f$$ Define $$(\tilde x, \tilde y) = \lambda (x_1, t_1) + (1 - \lambda)(x_2, t_2)$$ We then have \begin{aligned} \tilde y &= \lambda y_1 + (1 - \lambda)y_2 \\ &\ge \lambda f(y_1) + (1 - \lambda)f(y_2) \\ &\ge f(\lambda y_1 + (1 - \lambda)y_2) \\ & = f(\tilde x) \end{aligned} Therefore, $(\tilde x, \tilde y)$ is in the epigraph and the epigraph is convex. To proof the other side, we know that $$\lambda (x_1, f(x_1)) + (1 - \lambda)(x_2, f(x_2))$$ is in the epigraph. By the convexity of the epigraph, $$f(\lambda x_1 + (1 - \lambda)x_2) \le \lambda f(x_1) + (1 - \lambda) f(x_2)$$ Therefore, the function also be convex. #### Existence of Directional Derivative Sometimes, the objective function is not differentiable. However, if it is convex, we can see that the directional derivative always exists. Suppose $f : \mathbb R^n \to \mathbb R$ is convex. we then have 1. The difference quotient $$\frac{f(x + td) - f(x)}{t}$$ is a non-decreasing function of $t$ on $(0, +\infty)$. 2. For every $x, d \in \mathbb R^n$ the the directional derivative $f'(x; d)$ always exists and is given by $$f'(x; d) = \inf_{t \gt 0} \frac{f(x + td) - f(x)}{t}$$ ##### Proof of 1 Let $x, d \in \mathbb R^n$ and let $0 < t_1 < t_2$. Then \begin{aligned} f(x + t_1 d) &= f\left(x + \left(\frac{t_1}{t_2}\right)t_2d\right) \\ &= f\left[\left(1 - \frac{t_1}{t_2}\right)x + \frac{t_1}{t_2} (x + t_2 d)\right] \\ &\le \left(1 - \frac{t_1}{t_2}\right) f(x) + \frac{t_1}{t_2}f(x + t_2 d) \end{aligned} Hence, $$\frac{f(x + t_1 d) - f(x)}{t_1} \le \frac{f(x + t_2 d) - f(x)}{t_2}$$ ##### Proof of 2 The definition of the directional derivative is $$f'(x; d) := \lim_{t \to 0} \frac{f(x + td) - f(x)}{t}$$ Since $$\frac{f(x + td) - f(x)}{t}$$ is non-decreasing. Then the limit is necessarily given by the infimum of these ratios. #### Local Lipschitz Continuity Let $f: \mathbb R^n \to (-\infty, \infty]$ be a convex function such that the interior of its domain is not empty. Then, $f$ is locally Lipschitz continuous on in the interior of its domain. $$\exists L, | f(x) - f(y)| \le L |x - y|, \forall x, y \in K, K \subseteq \text{Int}(\text{Dom } f)$$ [Proof for Local Lipschitz Continuity may be provided later]