More on Newton's Method
Convergence Rate
Let be twice continuous differentiable and let be a local minimizer of . For some given assume that:
- there exists with for any .
- there exists with for all .
Let be generated by Newton's method. Then for and in addition, if , then In a word,
Hessian locally uniformly positive definite with Lipschitz continuous and close to implies
On the way to Global Convergence
Unfortunately, Newton's Method does not have the Global Convergence property. But one can combine Newton's Method with Gradient Descent Method to gain the speed up while not losing the convergence.
More precisely, at each iteration, when the Newton's direction is calculated, one can check:
- \nabla f(x_k) ^\top d_k \ge \gamma_1 \min {1, |d_k|^{\gamma_2}}| d_k|^2, ~~\gamma_1, \gamma_2 \in (0 ,1) Otherwise, we use the gradient direction instead.
Algorithms for Constrained Problems
Projected gradient method
This method is kind of geometrical and analytical. For constrained problem, we calculate iteratively. However, if we limit , may become infeasible for some . Whenever we runs into the feasibility problem, we project the point back to the feasible region.
Definition: Euclidean Projections
Let be a nonempty, closed, convex set. The (Euclidean) projection of onto is defined as the unique minimizer of the constrained optimization problem: and we write
The projection is the point in that has the minimum distance to .
Linear Constraints
Suppose we have where and are given.
If has full row rank (), then it holds that
Derivation
Consider the problem: One can show that the KKT conditions are Therefore, Since have full row rank, is invertible. Therefore,
Box Constraints
Suppose that is given by box constraints: where are given.
Then, the projection can be computed as the following:
Ball Constraints
For a Euclidean Ball: The projection is
Derivation
We write the problem as where .
One can show that the KKT conditions are Solving the system, we get the projection.
Properties of Projections and Descent Directions
FONC for Convex-Constrained Problems
Consider a minimization problem on a convex set:
Let be on an open set that contains the convex, closed set . Let be a local minimizer of the above program. Then, Moreover,
- If is convex, then is global minimizer of the above problem iff the FONC is satisfied.
- A point satisfying FONC is called a stationary point.
Proof
If is a local minimizer, then it has to satisfy the first-order optimization conditions: where is the set of feasible directions. We have: Let be arbitrary, then by the convexity of . Hence, .
This implies
Remark
If is convex, we have . With FONC, we can see that is a global solution.
Projection Theorem
Let be a nonempty, closed and convex set. Then:
-
A point is the projection of onto , i.e., , iff
-
The mapping is Lipschitz continuous with constant .
-
The vector is a stationary point iff
Proof
-
We first note that the problem
has a unique solution. Suppose there are two solutionsmin y 1 2 ∥ y − x ∥ 2 , s . t . y ∈ Ω \min_y \frac{1}{2}|y - x|^2, s.t. y \in \Omega andy ∗ y^\ast withy ‾ \overline y .1 2 ∥ y ∗ − x ∥ 2 = 1 2 ∥ y ‾ − x ∥ 2 = f ∗ \frac{1}{2}|y^\ast - x|^2 = \frac{1}{2}| \overline y - x|^2 = f^\ast Then,
f ( 1 2 y ∗ + 1 2 y ‾ ) = 1 2 ∥ 1 2 ( y ∗ − x ) + 1 2 ( y ‾ − x ) ∥ 2 = 1 2 [ 1 4 ∥ y ∗ − x ∥ 2 + 1 2 ( y ∗ − x ) ⊤ ( y ‾ − x ) + 1 4 ∥ y ‾ − x ∥ 2 ] = 1 2 f ( y ∗ ) + 1 2 f ( y ‾ ) + 1 2 [ ( 1 4 − 1 2 ) ∥ y ∗ − x ∥ 2 + 1 2 ( y ∗ − x ) ⊤ ( y ‾ − x ) + ( 1 4 − 1 2 ) ∥ y ‾ − x ∥ ] = f ∗ − 1 2 ∥ 1 2 ( y ∗ − x ) − 1 2 ( y ‾ − x ) ∥ 2 = f ∗ − 1 8 ∥ y ∗ − y ‾ ∥ 2 < f ∗ \begin{aligned} f(\frac{1}{2}y^\ast + \frac{1}{2}\overline y) &= \frac{1}{2} | \frac{1}{2} (y^\ast - x) + \frac{1}{2}(\overline y - x) |^2 \ &= \frac{1}{2} \left[\frac{1}{4}|y^\ast - x|^2 + \frac{1}{2}(y^\ast - x)^\top (\overline y - x) + \frac{1}{4} |\overline y - x |^2\right] \ &= \frac{1}{2} f(y^\ast) + \frac{1}{2} f(\overline y) + \ &~~~~~\frac{1}{2} \left[\left(\frac{1}{4} - \frac{1}{2}\right)\left| y^\ast - x\right|^2 + \frac{1}{2} (y^\ast - x)^\top(\overline y - x) + \left(\frac{1}{4} - \frac{1}{2}\right)|\overline y - x|\right]\ &= f^\ast - \frac{1}{2}\left| \frac{1}{2} (y^\ast - x) - \frac{1}{2}(\overline y - x) \right|^2\ &= f^\ast - \frac{1}{8}|y^\ast - \overline y|^2 \ &\lt f^\ast \end{aligned}
This is a contradiction to the optimality of
-
As a consequence, since the problem is convex, the FONC uniquely characterize the projection
, i.e. settingy ∗ = P Ω ( x ) y^\ast = \mathcal P_\Omega(x) , we have:f ( y ) = 1 2 ∥ y − x ∥ 2 f(y) = \frac{1}{2}|y - x|^2 y ∗ = P Ω ( x ) ⟺ ∇ f ( y ∗ ) ⊤ ( y − y ∗ ) ≥ 0 , ∀ y ∈ Ω ⟺ ( y ∗ − x ) ⊤ ( y − y ∗ ) ≥ 0 , ∀ y ∈ Ω \begin{aligned} y^\ast = \mathcal P_\Omega(x) &\iff \nabla f(y^\ast)^\top (y - y^\ast) \ge 0, \forall y \in \Omega \ &\iff (y^\ast - x)^\top (y - y^\ast)\ge 0, \forall y \in \Omega \end{aligned} -
Let us consider the projections
andP Ω ( x ) \mathcal P_\Omega(x) ; by the property above,P Ω ( y ) \mathcal P_\Omega(y) Add up these inequalities:( P Ω ( x ) − x ) ⊤ ( z − P Ω ( x ) ) ≥ 0 , ∀ z ∈ Ω , ( choose z = P Ω ( y ) ) ( P Ω ( y ) − y ) ⊤ ( z − P Ω ( y ) ) ≥ 0 , ∀ z ∈ Ω , ( choose z = P Ω ( x ) ) (\mathcal P_\Omega(x) - x)^\top (z - \mathcal P_\Omega(x)) \ge 0, \forall z \in \Omega, ~~~(\text{choose}~z = \mathcal P_\Omega(y)) \ (\mathcal P_\Omega(y) - y)^\top (z - \mathcal P_\Omega(y)) \ge 0, \forall z \in \Omega, ~~~(\text{choose}~z = \mathcal P_\Omega(x)) Therefore,0 ≤ ( P Ω ( x ) − x ) ⊤ ( P Ω ( y ) − P Ω ( x ) ) + ( P Ω ( y ) − y ) ⊤ ( P Ω ( x ) − P Ω ( y ) ) ⇒ ∥ P Ω ( x ) − P Ω ( y ) ∥ 2 ≤ ( x − y ) ⊤ ( P Ω ( x ) − P Ω ( y ) ) ≤ ∥ x − y ∥ ⋅ ∥ P Ω ( x ) − P Ω ( y ) ∥ ⇒ ∥ P Ω ( x ) − P Ω ( y ) ∥ ≤ ∥ x − y ∥ \begin{aligned} &0 \le (\mathcal P_\Omega(x) - x)^\top (\mathcal P_\Omega(y) - \mathcal P_\Omega (x)) + (\mathcal P_\Omega(y) - y)^\top(\mathcal P_\Omega(x) - \mathcal P_\Omega (y)) \ \Rightarrow ~& |\mathcal P_\Omega(x) - \mathcal P_\Omega(y)|^2 \le (x - y)^\top (\mathcal P_\Omega (x) - \mathcal P_\Omega(y)) \le |x- y| \cdot | \mathcal P_\Omega(x) - \mathcal P_\Omega (y)| \ \Rightarrow~& | \mathcal P_\Omega(x) - \mathcal P_\Omega (y)| \le | x - y | \end{aligned} is Lipschitz continuous with constantP Ω \mathcal P_\Omega .L = 1 L = 1 -
Using the first property, we have
x ∗ = P Ω ( x ∗ − λ ∇ f ( x ∗ ) ) x^\ast = \mathcal P_\Omega (x^\ast - \lambda \nabla f(x^\ast)) ( P Ω ( x ∗ − λ ∇ f ( x ∗ ) ) − [ x ∗ − λ ∇ f ( x ∗ ) ] ) ⊤ ( x − P Ω ( x ∗ − λ ∇ f ( x ∗ ) ) ) ≥ 0 ⟺ λ ∇ f ( x ∗ ) ⊤ ( x − x ∗ ) ≥ 0 , ∀ x ∈ Ω ⟺ x ∗ is a stationary point of the problem \begin{aligned} &(\mathcal P_\Omega(x^\ast - \lambda \nabla f(x^\ast)) - \left[x^\ast - \lambda \nabla f(x^\ast)\right])^\top (x - \mathcal P_\Omega(x^\ast - \lambda \nabla f(x^\ast))) \ge 0\ \iff & \lambda \nabla f(x^\ast)^\top (x - x^\ast) \ge 0, \forall x \in \Omega \ \iff & x^\ast \text{is a stationary point of the problem} \end{aligned}
Algorithm
Projected Gradient Method, we project the step
- How to choose
.λ k > 0 \lambda_k \gt 0 - How to guarantee decent and convergence.
We change the formula into
Consider:
Descent Directions
Let
Proof
Steps
Since the divergence is guaranteed in this way, we can reuse similar techniques like backtracking to generate step sizes
We cam use
- calculate
d k = P Ω ( x k − λ k ∇ f ( x k ) ) − x k d_k = \mathcal P_\Omega(x_k - \lambda_k \nabla f(x_k)) - x_k - if
, then STOP and∥ d k ∥ ≤ λ k ε |d_k| \le \lambda_k \varepsilon is the output.x k x_k - Armijo Line Search
- Update to
x k + 1 x_{k + 1}
Convergence
If
- If
is convex, we converge to global solutions of the problem.f f - If
, thenλ k ≤ 2 ( 1 − γ ) L \lambda_k \le \frac{2(1 - \gamma)}{L} will be accepted as step size!α k = 1 \alpha_k = 1 - If
is (strongly) convex, thenf f converges linearly o a global solution( x k ) k (x^k)_k x ∗ ∈ Ω x^\ast \in \Omega
Integer Programming
Theorem: LP Relaxation as a Bound for IP
- For a maximization problem, the optimal value of the LP relaxation provides an upper bound for the optimal value of the IP.
- For a minimization problem, the optimal value of the LP relaxation provides a lower bound for the optimal value of the IP.
The difference between the optimal value of the LP and the IP is called the integrability gap.
- For maximization problems, the integrality gap is
v LP − v IP v^\text{LP} - v^\text{IP} - For maximization problems, the integrality gap is
v IP − v LP v^\text{IP} - v^\text{LP}
Maximization Problem: Suppose we solved the LP relaxation and obtain the optimal value is
Then we round the solution to a feasible integer point and find the objective value is
Then the difference between the rounded solution and the optimal integer solution (whose objective value is
Theorem: The Ideal Case
If the optimal solution to the LP relaxation is integer, then the solution must be optimal to the IP problem.
Total Unimodularity
Definition
A matrix
Theorem: Total Unimodularity and Integer Solutions
If the constraint matrix
One Sufficient Condition for TU
Let
- Every column of
contains at most two non-zero entries;A A - Every entry in
is 0,+1, or 1;A A - The rows of
can be partitioned into two disjoint setsA A andB B such thatC C - If two non-zero entries in a column of
have the same sign, then the row of one entry is inA A , and the row of the other inB B ;C C - If two non-zero entries in a column of
have opposite signs, then the rows are both inA A , or both inB B .C C
- If two non-zero entries in a column of
Example
The matching problem:
Branch and Bound Algorithm
We consider solving a maximization problem. The whole procedure is
-
Solve the LP relaxation.
- If the optimal solution is integral, then it is optimal to IP.
- Otherwise go to step 2.
-
If the optimal solution to the LP relaxation is
andx ∗ x^\ast is fractional, then branch the problem into the following two:x ∗ x^\ast - One with an added constraint that
x i ≤ ⌊ x i ∗ ⌋ x_i \le \lfloor x_i^\ast \rfloor - One with an added constraint that
x i ≥ ⌈ x i ∗ ⌉ x_i \ge \lceil x_i^\ast \rceil
- One with an added constraint that
-
For each of the two problems, use the same method to solve them, and get optimal sol
andy 1 ∗ y_1^\ast y 2 ∗ y_2^\ast
For each of the two problems, use the same method to solvethem, and get optimal sol.y⇤1andy⇤2with optimal
With bounding:
-
Any LP relaxation solution can provide an upper bound for each node in the branching process.
-
Any feasible point to the IP can provide a lower bound for the entire problem.
When the optimal value of the LP relaxation of this branch is less than the current lower bound, then we can discard this branch.
- No better solution can be obtained from further exploring this branch.
Remark
We usually prefer to use DFS since it will discover more integer solution at the early stage.
Program
import cvxpy as cp
def branch_and_bound(obj, cons, vars, eps=1e-4):
current_opt = None
current_sol = None
visited = set()
def _branch_and_bound(_obj, _cons):
prob = cp.Problem(_obj, _cons)
opt = prob.solve()
if prob.status != 'infeasible' and (current_opt is None or opt >= current_opt):
new_cons = []
for i in range(len(_vars)):
v = vars[i].value
ceil = math.ceil(v)
floor = math.floor(v)
if ceil - v > eps and v - floor > eps:
new_cons.append(((vars[i] <= floor, (i, floor, False)), (vars[i] >= ceil, (i, ceil, True))))
sol = [i.value[0] for i in vars]
if not new_cons:
if current_opt is None:
current_opt = opt
current_sol = sol
elif opt > current_opt:
current_opt = max(opt, current_opt)
current_sol = sol;
else:
for i in new_cons:
if i[0][1] not in visited:
visited.add(i[0][1])
_branch_and_bound(obj, [*cons, i[0][0]], vars)
visited.remove(i[0][1])
if i[1][1] not in visited:
visited.add(i[1][1])
_branch_and_bound(obj, [*cons, i[1][0]], vars)
visited.remove(i[1][1])
_branch_and_bound(obj, cons)
return current_opt, current_sol
Complexity
There is no polynomial algorithm for IP, but branch and bounding is good enough to solve a large range of IP problems.
question_answer Comments