# Optimization Learning Notes 7

book Public Date: 2020-09-13 16:40:43.046931

update Update Date: 2020-09-13 16:40:43.046931

## More on Newton's Method

### Convergence Rate

Let $f$ be twice continuous differentiable and let $x^\ast$ be a local minimizer of $f$. For some given $\epsilon \gt 0$ assume that:

• there exists $\mu \gt 0$ with $\nabla^2 f(x) \succcurlyeq \mu I$ for any $x \in B_\epsilon(x^\ast)$.
• there exists $L \gt \mu$ with $|\nabla^2 f(x) - \nabla^2f(y)| \le L |x - y |$ for all $x, y \in B_\epsilon (x^\ast)$.

Let $(x_k)^k$ be generated by Newton's method. Then for $k = 0, 1, ...$ $|x_{k + 1} - x^\ast | \le \frac{L}{2\mu} |x_k - x^\ast|^2$ and in addition, if $|x^0 - x^\ast | \le \frac{\mu \min {1, \epsilon}}{L}$, then $|x_k - x^\ast| \le \frac{2\mu }{L} ( \frac{1}{2})^{2^k}, k = 1, 2, 3, \cdots$ In a word,

Hessian locally uniformly positive definite with Lipschitz continuous and $x_0$ close to $x^\ast$ implies $(x_k)^k \text{ converges q-quadratically to } x^\ast$

### On the way to Global Convergence

Unfortunately, Newton's Method does not have the Global Convergence property. But one can combine Newton's Method with Gradient Descent Method to gain the speed up while not losing the convergence.

More precisely, at each iteration, when the Newton's direction $d_k$ is calculated, one can check: 

• \nabla f(x_k) ^\top d_k \ge \gamma_1 \min {1, |d_k|^{\gamma_2}}| d_k|^2, ~~\gamma_1, \gamma_2 \in (0 ,1) Otherwise, we use the gradient direction instead.

## Algorithms for Constrained Problems

This method is kind of geometrical and analytical. For constrained problem, we calculate $x_k$ iteratively. However, if we limit $x \in \Omega$, $x_k$ may become infeasible for some $k$. Whenever we runs into the feasibility problem, we project the point back to the feasible region.

#### Definition: Euclidean Projections

Let $\Omega \in \mathbb R^n$ be a nonempty, closed, convex set. The (Euclidean) projection of $x$ onto $\Omega$ is defined as the unique minimizer $y^$ of the constrained optimization problem: $\min_y \frac{1}{2} |x - y |^2 ~ s.t. y \in \Omega$ and we write $y^$ = \mathcal P_\Omega(x).

The projection $y^\ast = \mathcal P_\Omega(x)$ is the point in $\Omega$ that has the minimum distance to $x$.

#### Linear Constraints

Suppose we have $\Omega = {x : Ax = b}$ where $A \in \mathbb R^{m\times n}$ and $b \in \mathbb R^m$ are given.

If $A$ has full row rank ($m \le n$), then it holds that $\mathcal P_\Omega(x) = x - A^\top(AA^\top)^{-1}[Ax - b]$

##### Derivation

Consider the problem: $\min_y \frac{1}{2} |x - y |^2 ~ s.t. Ay = b$ One can show that the KKT conditions are $\begin{cases} y - x + A^\top \mu &= 0 \ Ay &= b \end{cases}$ Therefore, \begin{aligned} 0 = y - x + A^\top \mu = Ay - Ax - AA^\top\mu = b - Ax - AA^\top \mu \end{aligned} Since $A$ have full row rank, $AA^\top$ is invertible. $\mu = (AA^\top)^{-1}[Ax - b]$ Therefore, $\mathcal P_\Omega(x) = x - A^\top(AA^\top)^{-1}[Ax - b]$

#### Box Constraints

Suppose that $\Omega$ is given by box constraints: $\Omega = {x \in \mathbb R^n: x_i \in [a_i, b_i], \forall i} = [a, b]$ where $a, b \in \mathbb R^n, a \le b$ are given.

Then, the projection can be computed as the following: $\mathcal P_\Omega (x)_i = \max { \min {x_i, b_i}, a_i}, \forall i$

#### Ball Constraints

For a Euclidean Ball: $\Omega = { x \in \mathbb R^n : |x - m| \le r }$ The projection is $\mathcal P_\Omega = \begin{cases} x & |x - m| \le r, \ m + \frac{r}{|x - m|} (x - m) & | x - m| \ge r \ \end{cases}$

##### Derivation

We write the problem as $\min_y \frac{1}{2} |z - \overline x|^2 s.t. \frac{1}{2}|z|^2 - \frac{1}{2}r^2 \le 0$ where $\overline x = x -m, z = y - m$.

One can show that the KKT conditions are $\begin{cases} z - \overline x + \lambda z &= 0 \ \frac{\lambda}{2} (|z|^2 - r^2) &= 0 \ |z | &\le r \ \lambda &\ge 0 \end{cases}$ Solving the system, we get the projection.

### Properties of Projections and Descent Directions

#### FONC for Convex-Constrained Problems

Consider a minimization problem on a convex set: $\min_y f(y)~s.t.~ y \in \Omega$

Let $f$ be $\mathcal C^1$ on an open set that contains the convex, closed set $\Omega \subset \mathbb R^n$. Let $y^\ast \in \Omega$ be a local minimizer of the above program. Then, $\nabla f(y^\ast)^\top (y - y ^\ast) \ge 0, \forall y \in \Omega$ Moreover,

• If $f$ is convex, then $y^\ast \in \Omega$ is global minimizer of the above problem iff the FONC is satisfied.
• A point satisfying FONC is called a stationary point.
##### Proof

If $y^\ast \in \Omega$ is a local minimizer, then it has to satisfy the first-order optimization conditions: $\nabla f(y^\ast) ^\top d \ge 0, \forall d \in S_\Omega(y^\ast)$ where $S_\Omega(y^\ast)$ is the set of feasible directions. We have: $d \in S_\Omega(y^\ast) \iff \exists \overline t \gt 0, y^\ast + td \in \Omega, \forall t \in [0, \overline t]$ Let $y \in \Omega$ be arbitrary, then by the convexity of $\Omega$. $t y^\ast + (1 - t)y = y^\ast + (1 - t)(y - y^\ast) \in \Omega, \forall t \in [0, 1]$ Hence, $y - y^\ast \in S_\Omega(y^\ast), \forall y \in \Omega$.

This implies $\nabla f(y^\ast)^\top (y - y ^\ast) \ge 0, \forall y \in \Omega.$

##### Remark

If $f$ is convex, we have $f(y) - f(y^\ast) \ge \nabla f(y^\ast)^\top (y - y^\ast), \forall y \in \Omega$. With FONC, we can see that $y^\ast$ is a global solution.

#### Projection Theorem

Let $\Omega$ be a nonempty, closed and convex set. Then:

• A point $y^\ast$ is the projection of $x$ onto $\Omega$, i.e., $y^\ast = \mathcal P_\Omega(x)$, iff $(y^* - x)^\top(y - y^\ast) \ge 0, \forall y \in \Omega$

• The mapping $\mathcal P_\Omega: \mathbb R^n \to \mathbb R^n$ is Lipschitz continuous with constant $L = 1$.

• The vector $x^\ast$ is a stationary point iff $x^\ast - \mathcal P_\Omega(x^\ast - \lambda \nabla f(x^\ast)) = 0, \text{for any$$\lambda \gt 0$}.

##### Proof
• We first note that the problem $\min_y \frac{1}{2}|y - x|^2, s.t. y \in \Omega$ has a unique solution. Suppose there are two solutions $y^\ast$ and $\overline y$ with $\frac{1}{2}|y^\ast - x|^2 = \frac{1}{2}| \overline y - x|^2 = f^\ast$.

Then, \begin{aligned} f(\frac{1}{2}y^\ast + \frac{1}{2}\overline y) &= \frac{1}{2} | \frac{1}{2} (y^\ast - x) + \frac{1}{2}(\overline y - x) |^2 \ &= \frac{1}{2} \left[\frac{1}{4}|y^\ast - x|^2 + \frac{1}{2}(y^\ast - x)^\top (\overline y - x) + \frac{1}{4} |\overline y - x |^2\right] \ &= \frac{1}{2} f(y^\ast) + \frac{1}{2} f(\overline y) + \ &~~~~~\frac{1}{2} \left[\left(\frac{1}{4} - \frac{1}{2}\right)\left| y^\ast - x\right|^2 + \frac{1}{2} (y^\ast - x)^\top(\overline y - x) + \left(\frac{1}{4} - \frac{1}{2}\right)|\overline y - x|\right]\ &= f^\ast - \frac{1}{2}\left| \frac{1}{2} (y^\ast - x) - \frac{1}{2}(\overline y - x) \right|^2\ &= f^\ast - \frac{1}{8}|y^\ast - \overline y|^2 \ &\lt f^\ast \end{aligned}

This is a contradiction to the optimality of $y^\ast$ and $\overline y$ and hence, $y^\ast = \overline y$ and there is only one unique solution.

• As a consequence, since the problem is convex, the FONC uniquely characterize the projection $y^\ast = \mathcal P_\Omega(x)$, i.e. setting $f(y) = \frac{1}{2}|y - x|^2$, we have: \begin{aligned} y^\ast = \mathcal P_\Omega(x) &\iff \nabla f(y^\ast)^\top (y - y^\ast) \ge 0, \forall y \in \Omega \ &\iff (y^\ast - x)^\top (y - y^\ast)\ge 0, \forall y \in \Omega \end{aligned}

• Let us consider the projections $\mathcal P_\Omega(x)$ and $\mathcal P_\Omega(y)$; by the property above, $(\mathcal P_\Omega(x) - x)^\top (z - \mathcal P_\Omega(x)) \ge 0, \forall z \in \Omega, ~~~(\text{choose}~z = \mathcal P_\Omega(y)) \ (\mathcal P_\Omega(y) - y)^\top (z - \mathcal P_\Omega(y)) \ge 0, \forall z \in \Omega, ~~~(\text{choose}~z = \mathcal P_\Omega(x))$ Add up these inequalities: \begin{aligned} &0 \le (\mathcal P_\Omega(x) - x)^\top (\mathcal P_\Omega(y) - \mathcal P_\Omega (x)) + (\mathcal P_\Omega(y) - y)^\top(\mathcal P_\Omega(x) - \mathcal P_\Omega (y)) \ \Rightarrow ~& |\mathcal P_\Omega(x) - \mathcal P_\Omega(y)|^2 \le (x - y)^\top (\mathcal P_\Omega (x) - \mathcal P_\Omega(y)) \le |x- y| \cdot | \mathcal P_\Omega(x) - \mathcal P_\Omega (y)| \ \Rightarrow~& | \mathcal P_\Omega(x) - \mathcal P_\Omega (y)| \le | x - y | \end{aligned} Therefore, $\mathcal P_\Omega$ is Lipschitz continuous with constant $L = 1$.

• Using the first property, we have $x^\ast = \mathcal P_\Omega (x^\ast - \lambda \nabla f(x^\ast))$ \begin{aligned} &(\mathcal P_\Omega(x^\ast - \lambda \nabla f(x^\ast)) - \left[x^\ast - \lambda \nabla f(x^\ast)\right])^\top (x - \mathcal P_\Omega(x^\ast - \lambda \nabla f(x^\ast))) \ge 0\ \iff & \lambda \nabla f(x^\ast)^\top (x - x^\ast) \ge 0, \forall x \in \Omega \ \iff & x^\ast \text{is a stationary point of the problem} \end{aligned}

### Algorithm

Projected Gradient Method, we project the step $x_k - \lambda_k \nabla f(x_k)$ back onto $\Omega$: $x^{k + 1} = \mathcal P_\Omega(x^k - \lambda_k \nabla f(x_k))$

• How to choose $\lambda_k \gt 0$.
• How to guarantee decent and convergence.

We change the formula into $x_{k + 1} = x_k + \left[\mathcal P_\Omega(x_k - \lambda_k \nabla f(x_k)) - x_k\right]$ We can hen treat the second part as the decent direction.

Consider: $x_{k+1} = x_k + \alpha_k d_k = (1 - \alpha_k)x_k + \alpha_k\mathcal P_\Omega(x_k - \lambda_k\nabla f(x_k))$ If $\alpha_k \in [0, 1]$, then the convexity of $\Omega$ implies that $x_{k + 1}$ will be feasible if $x_k \in \Omega$.

#### Descent Directions

Let $x \in \Omega$ and $\lambda > 0$ be given. If $x$ is not a stationary point, then the direction $d = \mathcal P_\Omega (x - \lambda \nabla f(x)) - x$ is a descent direction and it holds that $\nabla f(x)^\top d \le - \frac{1}{\lambda} | d|^2 < 0$

##### Proof

\begin{aligned} \nabla f(x)^\top d &= \nabla f(x)^\top [\mathcal P_\Omega(x - \lambda \nabla f(x)) - x] \ &= \frac{1}{\lambda} (x - [x - \lambda \nabla f(x)])^\top [\mathcal P_\Omega(x - \lambda \nabla f(x)) - \mathcal P_\Omega(x)] ~~(\mathcal P_\Omega(x) = x \text{ since } x \in \Omega) \ &\le - \frac{1}{\lambda} | \mathcal P_\Omega(x - \lambda \nabla f(x)) - \mathcal P_\Omega (x)|^2 \ &= -\frac{1}{\lambda} |d|^2 \end{aligned}

#### Steps

Since the divergence is guaranteed in this way, we can reuse similar techniques like backtracking to generate step sizes

We cam use $|\mathcal P_\Omega(x_k - \lambda_k\nabla f(x_k)) - x_k| \le \varepsilon$ as stopping criterion. At each step"

• calculate $d_k = \mathcal P_\Omega(x_k - \lambda_k \nabla f(x_k)) - x_k$
• if $|d_k| \le \lambda_k \varepsilon$, then STOP and $x_k$ is the output.
• Armijo Line Search
• Update to $x_{k + 1}$

### Convergence

$f$ is cont. diff., $\Omega$ is nonempty, convex, and closed and the step sizes $(\lambda_k)_k$ are bounded: $0 > \underline \lambda \le \lambda_k \le \overline \lambda, \forall k$ Every accumulation point of $(x_k)$ is a stationary point.

If $\nabla f$ is additionally Lipschitz continuous, we obtain:

• If $f$ is convex, we converge to global solutions of the problem.
• If $\lambda_k \le \frac{2(1 - \gamma)}{L}$, then $\alpha_k = 1$ will be accepted as step size!
• If $f$ is (strongly) convex, then $(x^k)_k$ converges linearly o a global solution $x^\ast \in \Omega$

## Integer Programming

### Theorem: LP Relaxation as a Bound for IP

1. For a maximization problem, the optimal value of the LP relaxation provides an upper bound for the optimal value of the IP.
2. For a minimization problem, the optimal value of the LP relaxation provides a lower bound for the optimal value of the IP.

The difference between the optimal value of the LP and the IP is called the integrability gap.

• For maximization problems, the integrality gap is $v^\text{LP} - v^\text{IP}$
• For maximization problems, the integrality gap is $v^\text{IP} - v^\text{LP}$

Maximization Problem: Suppose we solved the LP relaxation and obtain the optimal value is $v^\text{LP}$ (with some fractional solution)

Then we round the solution to a feasible integer point and find the objective value is $v^\text{Rounding}$.

Then the difference between the rounded solution and the optimal integer solution (whose objective value is $v^{IP}$) satisfies: $0 \le v^\text{IP } - v^\text{Rounding} \le v^{LP} -v^\text{Rounding}$ That is, one can use the LP relaxation solution to construct abound on how good a certain feasible integer point is.

### Theorem: The Ideal Case

If the optimal solution to the LP relaxation is integer, then the solution must be optimal to the IP problem.

### Total Unimodularity

#### Definition

A matrix $A$ is said to be totally unimodular if the determinant of each submatrix of $A$ is either $0$, $1$, or $-1$.

#### Theorem: Total Unimodularity and Integer Solutions

If the constraint matrix $A$ is totally unimodular and $b$ is an integer vector, then all the BFS are integers and the LP relaxation must have an optimal solution that is an integer solution

#### One Sufficient Condition for TU

Let $A$ be an $m \times n$ matrix. Then the following conditions together are sufficient for $A$ to be totally unimodular:

1. Every column of $A$ contains at most two non-zero entries;
2. Every entry in $A$ is 0,+1, or 1;
3. The rows of $A$ can be partitioned into two disjoint sets $B$ and $C$ such that
• If two non-zero entries in a column of $A$ have the same sign, then the row of one entry is in $B$, and the row of the other in $C$;
• If two non-zero entries in a column of $A$ have opposite signs, then the rows are both in $B$, or both in $C$.

#### Example

The matching problem: \begin{aligned} \max & \sum_{i = 1}^n\sum_{j = 1}^nv_{ij}x_{ij}\ s.t. & \sum_{i = 1}^n x_{ij} = 1, \forall j \ & \sum_{j = 1}^n x_{ij} = 1, \forall i \ & x_{ij} \in {0, 1} \end{aligned} The constraint matrix is TU.

### Branch and Bound Algorithm

We consider solving a maximization problem. The whole procedure is

1. Solve the LP relaxation.

• If the optimal solution is integral, then it is optimal to IP.
• Otherwise go to step 2.
2. If the optimal solution to the LP relaxation is $x^\ast$ and $x^\ast$ is fractional, then branch the problem into the following two:

• One with an added constraint that $x_i \le \lfloor x_i^\ast \rfloor$
• One with an added constraint that $x_i \ge \lceil x_i^\ast \rceil$
3. For each of the two problems, use the same method to solve them, and get optimal sol $y_1^\ast$ and $y_2^\ast$

For each of the two problems, use the same method to solvethem, and get optimal sol.y⇤1andy⇤2with optimal $v_1^\ast$ and $v_2^\ast$. Compare to obtain the optimal solution.

With bounding:

• Any LP relaxation solution can provide an upper bound for each node in the branching process.

• Any feasible point to the IP can provide a lower bound for the entire problem.

When the optimal value of the LP relaxation of this branch is less than the current lower bound, then we can discard this branch.

• No better solution can be obtained from further exploring this branch.

#### Remark

We usually prefer to use DFS since it will discover more integer solution at the early stage.

#### Program

import cvxpy as cp
def branch_and_bound(obj, cons, vars, eps=1e-4):
current_opt = None
current_sol = None
visited = set()
def _branch_and_bound(_obj, _cons):
prob = cp.Problem(_obj, _cons)
opt = prob.solve()
if prob.status != 'infeasible' and (current_opt is None or opt >= current_opt):
new_cons = []
for i in range(len(_vars)):
v = vars[i].value
ceil = math.ceil(v)
floor = math.floor(v)
if ceil - v > eps and v - floor > eps:
new_cons.append(((vars[i] <= floor, (i, floor, False)), (vars[i] >= ceil, (i, ceil, True))))
sol = [i.value[0] for i in vars]
if not new_cons:
if current_opt is None:
current_opt = opt
current_sol = sol
elif opt > current_opt:
current_opt = max(opt, current_opt)
current_sol = sol;
else:
for i in new_cons:
if i[0][1] not in visited:
_branch_and_bound(obj, [*cons, i[0][0]], vars)
visited.remove(i[0][1])
if i[1][1] not in visited:
_branch_and_bound(obj, [*cons, i[1][0]], vars)
visited.remove(i[1][1])
_branch_and_bound(obj, cons)
return current_opt, current_sol


#### Complexity

There is no polynomial algorithm for IP, but branch and bounding is good enough to solve a large range of IP problems.