## More on Newton's Method ### Convergence Rate Let $f$ be twice continuous differentiable and let $x^\ast$ be a local minimizer of $f$. For some given $\epsilon \gt 0$ assume that: - there exists $\mu \gt 0$ with $\nabla^2 f(x) \succcurlyeq \mu I$ for any $x \in B_\epsilon(x^\ast)$. - there exists $L \gt \mu$ with $\|\nabla^2 f(x) - \nabla^2f(y)\| \le L \|x - y \|$ for all $x, y \in B_\epsilon (x^\ast)$. Let $(x_k)^k$ be generated by Newton's method. Then for $k = 0, 1, ...$ $$\|x_{k + 1} - x^\ast \| \le \frac{L}{2\mu} \|x_k - x^\ast\|^2$$ and in addition, if $\|x^0 - x^\ast \| \le \frac{\mu \min \{1, \epsilon\}}{L}$, then $$\|x_k - x^\ast\| \le \frac{2\mu }{L} ( \frac{1}{2})^{2^k}, k = 1, 2, 3, \cdots$$ In a word, Hessian locally uniformly positive definite with Lipschitz continuous and $x_0$ close to $x^\ast$ implies $$(x_k)^k \text{ converges q-quadratically to } x^\ast$$ ### On the way to Global Convergence Unfortunately, Newton's Method does not have the Global Convergence property. But one can combine Newton's Method with Gradient Descent Method to gain the speed up while not losing the convergence. More precisely, at each iteration, when the Newton's direction $d_k$ is calculated, one can check: $$- \nabla f(x_k) ^\top d_k \ge \gamma_1 \min \{1, \|d_k\|^{\gamma_2}\}\| d_k\|^2, ~~\gamma_1, \gamma_2 \in (0 ,1)$$ Otherwise, we use the gradient direction instead. ## Algorithms for Constrained Problems ### Projected gradient method This method is kind of geometrical and analytical. For constrained problem, we calculate $x_k$ iteratively. However, if we limit $x \in \Omega$, $x_k$ may become infeasible for some $k$. Whenever we runs into the feasibility problem, we project the point back to the feasible region. #### Definition: Euclidean Projections Let $\Omega \in \mathbb R^n$ be a nonempty, closed, convex set. The **(Euclidean) projection** of $x$ onto $\Omega$ is defined as the unique minimizer $y^*$ of the constrained optimization problem: $$\min_y \frac{1}{2} \|x - y \|^2 ~ s.t. y \in \Omega$$ and we write $y^* = \mathcal P_\Omega(x).$ The projection $y^\ast = \mathcal P_\Omega(x)$ is the point in $\Omega$ that has the minimum distance to $x$. #### Linear Constraints Suppose we have $$\Omega = \{x : Ax = b\}$$ where $A \in \mathbb R^{m\times n}$ and $b \in \mathbb R^m$ are given. If $A$ has full row rank ($m \le n$), then it holds that $$\mathcal P_\Omega(x) = x - A^\top(AA^\top)^{-1}[Ax - b]$$ ##### Derivation Consider the problem: $$\min_y \frac{1}{2} \|x - y \|^2 ~ s.t. Ay = b$$ One can show that the KKT conditions are $$\begin{cases} y - x + A^\top \mu &= 0 \\ Ay &= b \end{cases}$$ Therefore, \begin{aligned} 0 = y - x + A^\top \mu = Ay - Ax - AA^\top\mu = b - Ax - AA^\top \mu \end{aligned} Since $A$ have full row rank, $AA^\top$ is invertible. $$\mu = (AA^\top)^{-1}[Ax - b]$$ Therefore, $$\mathcal P_\Omega(x) = x - A^\top(AA^\top)^{-1}[Ax - b]$$ #### Box Constraints Suppose that $\Omega$ is given by box constraints: $$\Omega = \{x \in \mathbb R^n: x_i \in [a_i, b_i], \forall i\} = [a, b]$$ where $a, b \in \mathbb R^n, a \le b$ are given. Then, the projection can be computed as the following: $$\mathcal P_\Omega (x)_i = \max \{ \min \{x_i, b_i\}, a_i\}, \forall i$$ #### Ball Constraints For a Euclidean Ball: $$\Omega = \{ x \in \mathbb R^n : \|x - m\| \le r \}$$ The projection is $$\mathcal P_\Omega = \begin{cases} x & \|x - m\| \le r, \\ m + \frac{r}{\|x - m\|} (x - m) & \| x - m\| \ge r \\ \end{cases}$$ ##### Derivation We write the problem as $$\min_y \frac{1}{2} \|z - \overline x\|^2 s.t. \frac{1}{2}\|z\|^2 - \frac{1}{2}r^2 \le 0$$ where $\overline x = x -m, z = y - m$. One can show that the KKT conditions are $$\begin{cases} z - \overline x + \lambda z &= 0 \\ \frac{\lambda}{2} (\|z\|^2 - r^2) &= 0 \\ \|z \| &\le r \\ \lambda &\ge 0 \end{cases}$$ Solving the system, we get the projection. ### Properties of Projections and Descent Directions #### FONC for Convex-Constrained Problems Consider a minimization problem on a convex set: $$\min_y f(y)~s.t.~ y \in \Omega$$ Let $f$ be $\mathcal C^1$ on an open set that contains the convex, closed set $\Omega \subset \mathbb R^n$. Let $y^\ast \in \Omega$ be a local minimizer of the above program. Then, $$\nabla f(y^\ast)^\top (y - y ^\ast) \ge 0, \forall y \in \Omega$$ Moreover, - If $f$ is convex, then $y^\ast \in \Omega$ is global minimizer of the above problem iff the FONC is satisfied. - A point satisfying FONC is called a **stationary point**. ##### Proof If $y^\ast \in \Omega$ is a local minimizer, then it has to satisfy the first-order optimization conditions: $$\nabla f(y^\ast) ^\top d \ge 0, \forall d \in S_\Omega(y^\ast)$$ where $S_\Omega(y^\ast)$ is the set of feasible directions. We have: $$d \in S_\Omega(y^\ast) \iff \exists \overline t \gt 0, y^\ast + td \in \Omega, \forall t \in [0, \overline t]$$ Let $y \in \Omega$ be arbitrary, then by the convexity of $\Omega$. $$t y^\ast + (1 - t)y = y^\ast + (1 - t)(y - y^\ast) \in \Omega, \forall t \in [0, 1]$$ Hence, $y - y^\ast \in S_\Omega(y^\ast), \forall y \in \Omega$. This implies $$\nabla f(y^\ast)^\top (y - y ^\ast) \ge 0, \forall y \in \Omega.$$ ##### Remark If $f$ is convex, we have $f(y) - f(y^\ast) \ge \nabla f(y^\ast)^\top (y - y^\ast), \forall y \in \Omega$. With FONC, we can see that $y^\ast$ is a global solution. #### Projection Theorem Let $\Omega$ be a nonempty, closed and convex set. Then: - A point $y^\ast$ is the projection of $x$ onto $\Omega$, i.e., $y^\ast = \mathcal P_\Omega(x)$, iff $$(y^* - x)^\top(y - y^\ast) \ge 0, \forall y \in \Omega$$ - The mapping $\mathcal P_\Omega: \mathbb R^n \to \mathbb R^n$ is Lipschitz continuous with constant $L = 1$. - The vector $x^\ast$ is a stationary point iff $$x^\ast - \mathcal P_\Omega(x^\ast - \lambda \nabla f(x^\ast)) = 0, \text{for any \lambda \gt 0}.$$ ##### Proof - We first note that the problem $$\min_y \frac{1}{2}\|y - x\|^2, s.t. y \in \Omega$$ has a unique solution. Suppose there are two solutions $y^\ast$ and $\overline y$ with $\frac{1}{2}\|y^\ast - x\|^2 = \frac{1}{2}\| \overline y - x\|^2 = f^\ast$. Then, \begin{aligned} f(\frac{1}{2}y^\ast + \frac{1}{2}\overline y) &= \frac{1}{2} \| \frac{1}{2} (y^\ast - x) + \frac{1}{2}(\overline y - x) \|^2 \\ &= \frac{1}{2} \left[\frac{1}{4}\|y^\ast - x\|^2 + \frac{1}{2}(y^\ast - x)^\top (\overline y - x) + \frac{1}{4} \|\overline y - x \|^2\right] \\ &= \frac{1}{2} f(y^\ast) + \frac{1}{2} f(\overline y) + \\ &~~~~~\frac{1}{2} \left[\left(\frac{1}{4} - \frac{1}{2}\right)\left\| y^\ast - x\right\|^2 + \frac{1}{2} (y^\ast - x)^\top(\overline y - x) + \left(\frac{1}{4} - \frac{1}{2}\right)\|\overline y - x\|\right]\\ &= f^\ast - \frac{1}{2}\left\| \frac{1}{2} (y^\ast - x) - \frac{1}{2}(\overline y - x) \right\|^2\\ &= f^\ast - \frac{1}{8}\|y^\ast - \overline y\|^2 \\ &\lt f^\ast \end{aligned} This is a contradiction to the optimality of $y^\ast$ and $\overline y$ and hence, $y^\ast = \overline y$ and there is only one unique solution. - As a consequence, since the problem is convex, the FONC uniquely characterize the projection $y^\ast = \mathcal P_\Omega(x)$, i.e. setting $f(y) = \frac{1}{2}\|y - x\|^2$, we have: \begin{aligned} y^\ast = \mathcal P_\Omega(x) &\iff \nabla f(y^\ast)^\top (y - y^\ast) \ge 0, \forall y \in \Omega \\ &\iff (y^\ast - x)^\top (y - y^\ast)\ge 0, \forall y \in \Omega \end{aligned} - Let us consider the projections $\mathcal P_\Omega(x)$ and $\mathcal P_\Omega(y)$; by the property above, $$(\mathcal P_\Omega(x) - x)^\top (z - \mathcal P_\Omega(x)) \ge 0, \forall z \in \Omega, ~~~(\text{choose}~z = \mathcal P_\Omega(y)) \\ (\mathcal P_\Omega(y) - y)^\top (z - \mathcal P_\Omega(y)) \ge 0, \forall z \in \Omega, ~~~(\text{choose}~z = \mathcal P_\Omega(x))$$ Add up these inequalities: \begin{aligned} &0 \le (\mathcal P_\Omega(x) - x)^\top (\mathcal P_\Omega(y) - \mathcal P_\Omega (x)) + (\mathcal P_\Omega(y) - y)^\top(\mathcal P_\Omega(x) - \mathcal P_\Omega (y)) \\ \Rightarrow ~& \|\mathcal P_\Omega(x) - \mathcal P_\Omega(y)\|^2 \le (x - y)^\top (\mathcal P_\Omega (x) - \mathcal P_\Omega(y)) \le \|x- y\| \cdot \| \mathcal P_\Omega(x) - \mathcal P_\Omega (y)\| \\ \Rightarrow~& \| \mathcal P_\Omega(x) - \mathcal P_\Omega (y)\| \le \| x - y \| \end{aligned} Therefore, $\mathcal P_\Omega$ is Lipschitz continuous with constant $L = 1$. - Using the first property, we have $x^\ast = \mathcal P_\Omega (x^\ast - \lambda \nabla f(x^\ast))$ \begin{aligned} &(\mathcal P_\Omega(x^\ast - \lambda \nabla f(x^\ast)) - \left[x^\ast - \lambda \nabla f(x^\ast)\right])^\top (x - \mathcal P_\Omega(x^\ast - \lambda \nabla f(x^\ast))) \ge 0\\ \iff & \lambda \nabla f(x^\ast)^\top (x - x^\ast) \ge 0, \forall x \in \Omega \\ \iff & x^\ast \text{is a stationary point of the problem} \end{aligned} ### Algorithm Projected Gradient Method, we project the step $x_k - \lambda_k \nabla f(x_k)$ back onto $\Omega$: $$x^{k + 1} = \mathcal P_\Omega(x^k - \lambda_k \nabla f(x_k))$$ - How to choose $\lambda_k \gt 0$. - How to guarantee decent and convergence. We change the formula into $$x_{k + 1} = x_k + \left[\mathcal P_\Omega(x_k - \lambda_k \nabla f(x_k)) - x_k\right]$$ We can hen treat the second part as the decent direction. Consider: $$x_{k+1} = x_k + \alpha_k d_k = (1 - \alpha_k)x_k + \alpha_k\mathcal P_\Omega(x_k - \lambda_k\nabla f(x_k))$$ If $\alpha_k \in [0, 1]$, then the convexity of $\Omega$ implies that $x_{k + 1}$ will be feasible if $x_k \in \Omega$. #### Descent Directions Let $x \in \Omega$ and $\lambda > 0$ be given. If $x$ is not a stationary point, then the direction $d = \mathcal P_\Omega (x - \lambda \nabla f(x)) - x$ is a descent direction and it holds that $$\nabla f(x)^\top d \le - \frac{1}{\lambda} \| d\|^2 < 0$$ ##### Proof \begin{aligned} \nabla f(x)^\top d &= \nabla f(x)^\top [\mathcal P_\Omega(x - \lambda \nabla f(x)) - x] \\ &= \frac{1}{\lambda} (x - [x - \lambda \nabla f(x)])^\top [\mathcal P_\Omega(x - \lambda \nabla f(x)) - \mathcal P_\Omega(x)] ~~(\mathcal P_\Omega(x) = x \text{ since } x \in \Omega) \\ &\le - \frac{1}{\lambda} \| \mathcal P_\Omega(x - \lambda \nabla f(x)) - \mathcal P_\Omega (x)\|^2 \\ &= -\frac{1}{\lambda} \|d\|^2 \end{aligned} #### Steps Since the divergence is guaranteed in this way, we can reuse similar techniques like backtracking to generate step sizes We cam use $\|\mathcal P_\Omega(x_k - \lambda_k\nabla f(x_k)) - x_k\| \le \varepsilon$ as stopping criterion. At each step" - calculate $d_k = \mathcal P_\Omega(x_k - \lambda_k \nabla f(x_k)) - x_k$ - if $\|d_k\| \le \lambda_k \varepsilon$, then STOP and $x_k$ is the output. - Armijo Line Search - Update to $x_{k + 1}$ ### Convergence $f$ is cont. diff., $\Omega$ is nonempty, convex, and closed and the step sizes $(\lambda_k)_k$ are bounded: $$0 > \underline \lambda \le \lambda_k \le \overline \lambda, \forall k$$ Every accumulation point of $(x_k)$ is a stationary point. If $\nabla f$ is additionally Lipschitz continuous, we obtain: - If $f$ is convex, we converge to global solutions of the problem. - If $\lambda_k \le \frac{2(1 - \gamma)}{L}$, then $\alpha_k = 1$ will be accepted as step size! - If $f$ is (strongly) convex, then $(x^k)_k$ converges linearly o a global solution $x^\ast \in \Omega$ ## Integer Programming ### Theorem: LP Relaxation as a Bound for IP 1. For a maximization problem, the optimal value of the LP relaxation provides an upper bound for the optimal value of the IP. 2. For a minimization problem, the optimal value of the LP relaxation provides a lower bound for the optimal value of the IP. The difference between the optimal value of the LP and the IP is called the integrability gap. - For maximization problems, the integrality gap is $v^\text{LP} - v^\text{IP}$ - For maximization problems, the integrality gap is $v^\text{IP} - v^\text{LP}$ **Maximization Problem:** Suppose we solved the LP relaxation and obtain the optimal value is $v^\text{LP}$ (with some fractional solution) Then we round the solution to a feasible integer point and find the objective value is $v^\text{Rounding}$. Then the difference between the rounded solution and the optimal integer solution (whose objective value is $v^{IP}$) satisfies: $$0 \le v^\text{IP } - v^\text{Rounding} \le v^{LP} -v^\text{Rounding}$$ That is, one can use the LP relaxation solution to construct abound on how good a certain feasible integer point is. ### Theorem: The Ideal Case If the optimal solution to the LP relaxation is integer, then the solution must be optimal to the IP problem. ### Total Unimodularity #### Definition A matrix $A$ is said to be totally unimodular if the determinant of each submatrix of $A$ is either $0$, $1$, or $-1$. #### Theorem: Total Unimodularity and Integer Solutions If the constraint matrix $A$ is totally unimodular and $b$ is an integer vector, then all the BFS are integers and the LP relaxation must have an optimal solution that is an integer solution #### One Sufficient Condition for TU Let $A$ be an $m \times n$ matrix. Then the following conditions together are sufficient for $A$ to be totally unimodular: 1. Every column of $A$ contains at most two non-zero entries; 2. Every entry in $A$ is 0,+1, or 1; 3. The rows of $A$ can be partitioned into two disjoint sets $B$ and $C$ such that - If two non-zero entries in a column of $A$ have the same sign, then the row of one entry is in $B$, and the row of the other in $C$; - If two non-zero entries in a column of $A$ have opposite signs, then the rows are both in $B$, or both in $C$. #### Example The matching problem: \begin{aligned} \max & \sum_{i = 1}^n\sum_{j = 1}^nv_{ij}x_{ij}\\ s.t. & \sum_{i = 1}^n x_{ij} = 1, \forall j \\ & \sum_{j = 1}^n x_{ij} = 1, \forall i \\ & x_{ij} \in \{0, 1\} \end{aligned} The constraint matrix is TU. ### Branch and Bound Algorithm We consider solving a maximization problem. The whole procedure is 1. Solve the LP relaxation. - If the optimal solution is integral, then it is optimal to IP. - Otherwise go to step 2. 2. If the optimal solution to the LP relaxation is $x^\ast$ and $x^\ast$ is fractional, then branch the problem into the following two: - One with an added constraint that $x_i \le \lfloor x_i^\ast \rfloor$ - One with an added constraint that $x_i \ge \lceil x_i^\ast \rceil$ 3. For each of the two problems, use the same method to solve them, and get optimal sol $y_1^\ast$ and $y_2^\ast$ For each of the two problems, use the same method to solvethem, and get optimal sol.y⇤1andy⇤2with optimal $v_1^\ast$ and $v_2^\ast$. Compare to obtain the optimal solution. With bounding: - Any LP relaxation solution can provide an upper bound for each node in the branching process. - Any feasible point to the IP can provide a lower bound for the entire problem. When the optimal value of the LP relaxation of this branch is less than the current lower bound, then we can discard this branch. - No better solution can be obtained from further exploring this branch. #### Remark We usually prefer to use DFS since it will discover more integer solution at the early stage. #### Program python import cvxpy as cp def branch_and_bound(obj, cons, vars, eps=1e-4): current_opt = None current_sol = None visited = set() def _branch_and_bound(_obj, _cons): prob = cp.Problem(_obj, _cons) opt = prob.solve() if prob.status != 'infeasible' and (current_opt is None or opt >= current_opt): new_cons = [] for i in range(len(_vars)): v = vars[i].value ceil = math.ceil(v) floor = math.floor(v) if ceil - v > eps and v - floor > eps: new_cons.append(((vars[i] <= floor, (i, floor, False)), (vars[i] >= ceil, (i, ceil, True)))) sol = [i.value[0] for i in vars] if not new_cons: if current_opt is None: current_opt = opt current_sol = sol elif opt > current_opt: current_opt = max(opt, current_opt) current_sol = sol; else: for i in new_cons: if i[0][1] not in visited: visited.add(i[0][1]) _branch_and_bound(obj, [*cons, i[0][0]], vars) visited.remove(i[0][1]) if i[1][1] not in visited: visited.add(i[1][1]) _branch_and_bound(obj, [*cons, i[1][0]], vars) visited.remove(i[1][1]) _branch_and_bound(obj, cons) return current_opt, current_sol  #### Complexity There is no polynomial algorithm for IP, but branch and bounding is good enough to solve a large range of IP problems.