Optimization Learning Notes 7

book Public Date: 2020-09-13 16:40:43.046931

update Update Date: 2020-09-13 16:40:43.046931

More on Newton's Method

Convergence Rate

Let ff be twice continuous differentiable and let xx^\ast be a local minimizer of ff. For some given ϵ>0\epsilon \gt 0 assume that:

  • there exists μ>0\mu \gt 0 with 2f(x)μI\nabla^2 f(x) \succcurlyeq \mu I for any xBϵ(x)x \in B_\epsilon(x^\ast).
  • there exists L>μL \gt \mu with 2f(x)2f(y)Lxy|\nabla^2 f(x) - \nabla^2f(y)| \le L |x - y | for all x,yBϵ(x)x, y \in B_\epsilon (x^\ast).

Let (xk)k(x_k)^k be generated by Newton's method. Then for k=0,1,...k = 0, 1, ... xk+1xL2μxkx2 |x_{k + 1} - x^\ast | \le \frac{L}{2\mu} |x_k - x^\ast|^2 and in addition, if x0xμmin{1,ϵ}L|x^0 - x^\ast | \le \frac{\mu \min {1, \epsilon}}{L}, then xkx2μL(12)2k,k=1,2,3, |x_k - x^\ast| \le \frac{2\mu }{L} ( \frac{1}{2})^{2^k}, k = 1, 2, 3, \cdots In a word,

Hessian locally uniformly positive definite with Lipschitz continuous and x0x_0 close to xx^\ast implies (xk)k converges q-quadratically to x (x_k)^k \text{ converges q-quadratically to } x^\ast

On the way to Global Convergence

Unfortunately, Newton's Method does not have the Global Convergence property. But one can combine Newton's Method with Gradient Descent Method to gain the speed up while not losing the convergence.

More precisely, at each iteration, when the Newton's direction dkd_k is calculated, one can check: f(xk)dkγ1min{1,dkγ2}dk2,  γ1,γ2(0,1)

  • \nabla f(x_k) ^\top d_k \ge \gamma_1 \min {1, |d_k|^{\gamma_2}}| d_k|^2, ~~\gamma_1, \gamma_2 \in (0 ,1) Otherwise, we use the gradient direction instead.

Algorithms for Constrained Problems

Projected gradient method

This method is kind of geometrical and analytical. For constrained problem, we calculate xkx_k iteratively. However, if we limit xΩx \in \Omega, xkx_k may become infeasible for some kk. Whenever we runs into the feasibility problem, we project the point back to the feasible region.

Definition: Euclidean Projections

Let ΩRn\Omega \in \mathbb R^n be a nonempty, closed, convex set. The (Euclidean) projection of xx onto Ω\Omega is defined as the unique minimizer yy^ of the constrained optimization problem: miny12xy2 s.t.yΩ \min_y \frac{1}{2} |x - y |^2 ~ s.t. y \in \Omega and we write y=PΩ(x).y^ = \mathcal P_\Omega(x).

The projection y=PΩ(x)y^\ast = \mathcal P_\Omega(x) is the point in Ω\Omega that has the minimum distance to xx.

Linear Constraints

Suppose we have Ω={x:Ax=b} \Omega = {x : Ax = b} where ARm×nA \in \mathbb R^{m\times n} and bRmb \in \mathbb R^m are given.

If AA has full row rank (mnm \le n), then it holds that PΩ(x)=xA(AA)1[Axb] \mathcal P_\Omega(x) = x - A^\top(AA^\top)^{-1}[Ax - b]


Consider the problem: miny12xy2 s.t.Ay=b \min_y \frac{1}{2} |x - y |^2 ~ s.t. Ay = b One can show that the KKT conditions are {yx+Aμ=0Ay=b \begin{cases} y - x + A^\top \mu &= 0 \ Ay &= b \end{cases} Therefore, 0=yx+Aμ=AyAxAAμ=bAxAAμ \begin{aligned} 0 = y - x + A^\top \mu = Ay - Ax - AA^\top\mu = b - Ax - AA^\top \mu \end{aligned} Since AA have full row rank, AAAA^\top is invertible. μ=(AA)1[Axb] \mu = (AA^\top)^{-1}[Ax - b] Therefore, PΩ(x)=xA(AA)1[Axb] \mathcal P_\Omega(x) = x - A^\top(AA^\top)^{-1}[Ax - b]

Box Constraints

Suppose that Ω\Omega is given by box constraints: Ω={xRn:xi[ai,bi],i}=[a,b] \Omega = {x \in \mathbb R^n: x_i \in [a_i, b_i], \forall i} = [a, b] where a,bRn,aba, b \in \mathbb R^n, a \le b are given.

Then, the projection can be computed as the following: PΩ(x)i=max{min{xi,bi},ai},i \mathcal P_\Omega (x)_i = \max { \min {x_i, b_i}, a_i}, \forall i

Ball Constraints

For a Euclidean Ball: Ω={xRn:xmr} \Omega = { x \in \mathbb R^n : |x - m| \le r } The projection is PΩ={xxmr,m+rxm(xm)xmr \mathcal P_\Omega = \begin{cases} x & |x - m| \le r, \ m + \frac{r}{|x - m|} (x - m) & | x - m| \ge r \ \end{cases}


We write the problem as miny12zx2s.t.12z212r20 \min_y \frac{1}{2} |z - \overline x|^2 s.t. \frac{1}{2}|z|^2 - \frac{1}{2}r^2 \le 0 where x=xm,z=ym\overline x = x -m, z = y - m.

One can show that the KKT conditions are {zx+λz=0λ2(z2r2)=0zrλ0 \begin{cases} z - \overline x + \lambda z &= 0 \ \frac{\lambda}{2} (|z|^2 - r^2) &= 0 \ |z | &\le r \ \lambda &\ge 0 \end{cases} Solving the system, we get the projection.

Properties of Projections and Descent Directions

FONC for Convex-Constrained Problems

Consider a minimization problem on a convex set: minyf(y) s.t. yΩ \min_y f(y)~s.t.~ y \in \Omega

Let ff be C1\mathcal C^1 on an open set that contains the convex, closed set ΩRn\Omega \subset \mathbb R^n. Let yΩy^\ast \in \Omega be a local minimizer of the above program. Then, f(y)(yy)0,yΩ \nabla f(y^\ast)^\top (y - y ^\ast) \ge 0, \forall y \in \Omega Moreover,

  • If ff is convex, then yΩy^\ast \in \Omega is global minimizer of the above problem iff the FONC is satisfied.
  • A point satisfying FONC is called a stationary point.

If yΩy^\ast \in \Omega is a local minimizer, then it has to satisfy the first-order optimization conditions: f(y)d0,dSΩ(y) \nabla f(y^\ast) ^\top d \ge 0, \forall d \in S_\Omega(y^\ast) where SΩ(y)S_\Omega(y^\ast) is the set of feasible directions. We have: dSΩ(y)    t>0,y+tdΩ,t[0,t] d \in S_\Omega(y^\ast) \iff \exists \overline t \gt 0, y^\ast + td \in \Omega, \forall t \in [0, \overline t] Let yΩy \in \Omega be arbitrary, then by the convexity of Ω\Omega. ty+(1t)y=y+(1t)(yy)Ω,t[0,1] t y^\ast + (1 - t)y = y^\ast + (1 - t)(y - y^\ast) \in \Omega, \forall t \in [0, 1] Hence, yySΩ(y),yΩy - y^\ast \in S_\Omega(y^\ast), \forall y \in \Omega.

This implies f(y)(yy)0,yΩ. \nabla f(y^\ast)^\top (y - y ^\ast) \ge 0, \forall y \in \Omega.


If ff is convex, we have f(y)f(y)f(y)(yy),yΩf(y) - f(y^\ast) \ge \nabla f(y^\ast)^\top (y - y^\ast), \forall y \in \Omega. With FONC, we can see that yy^\ast is a global solution.

Projection Theorem

Let Ω\Omega be a nonempty, closed and convex set. Then:

  • A point yy^\ast is the projection of xx onto Ω\Omega, i.e., y=PΩ(x)y^\ast = \mathcal P_\Omega(x), iff (yx)(yy)0,yΩ (y^* - x)^\top(y - y^\ast) \ge 0, \forall y \in \Omega

  • The mapping PΩ:RnRn\mathcal P_\Omega: \mathbb R^n \to \mathbb R^n is Lipschitz continuous with constant L=1L = 1.

  • The vector xx^\ast is a stationary point iff xPΩ(xλf(x))=0,for any λ>0. x^\ast - \mathcal P_\Omega(x^\ast - \lambda \nabla f(x^\ast)) = 0, \text{for any λ>0\lambda \gt 0}.

  • We first note that the problem miny12yx2,s.t.yΩ \min_y \frac{1}{2}|y - x|^2, s.t. y \in \Omega has a unique solution. Suppose there are two solutions yy^\ast and y\overline y with 12yx2=12yx2=f\frac{1}{2}|y^\ast - x|^2 = \frac{1}{2}| \overline y - x|^2 = f^\ast.

    Then, f(12y+12y)=1212(yx)+12(yx)2=12[14yx2+12(yx)(yx)+14yx2]=12f(y)+12f(y)+     12[(1412)yx2+12(yx)(yx)+(1412)yx]=f1212(yx)12(yx)2=f18yy2<f \begin{aligned} f(\frac{1}{2}y^\ast + \frac{1}{2}\overline y) &= \frac{1}{2} | \frac{1}{2} (y^\ast - x) + \frac{1}{2}(\overline y - x) |^2 \ &= \frac{1}{2} \left[\frac{1}{4}|y^\ast - x|^2 + \frac{1}{2}(y^\ast - x)^\top (\overline y - x) + \frac{1}{4} |\overline y - x |^2\right] \ &= \frac{1}{2} f(y^\ast) + \frac{1}{2} f(\overline y) + \ &~~~~~\frac{1}{2} \left[\left(\frac{1}{4} - \frac{1}{2}\right)\left| y^\ast - x\right|^2 + \frac{1}{2} (y^\ast - x)^\top(\overline y - x) + \left(\frac{1}{4} - \frac{1}{2}\right)|\overline y - x|\right]\ &= f^\ast - \frac{1}{2}\left| \frac{1}{2} (y^\ast - x) - \frac{1}{2}(\overline y - x) \right|^2\ &= f^\ast - \frac{1}{8}|y^\ast - \overline y|^2 \ &\lt f^\ast \end{aligned}

This is a contradiction to the optimality of yy^\ast and y\overline y and hence, y=yy^\ast = \overline y and there is only one unique solution.

  • As a consequence, since the problem is convex, the FONC uniquely characterize the projection y=PΩ(x)y^\ast = \mathcal P_\Omega(x), i.e. setting f(y)=12yx2f(y) = \frac{1}{2}|y - x|^2, we have: y=PΩ(x)    f(y)(yy)0,yΩ    (yx)(yy)0,yΩ \begin{aligned} y^\ast = \mathcal P_\Omega(x) &\iff \nabla f(y^\ast)^\top (y - y^\ast) \ge 0, \forall y \in \Omega \ &\iff (y^\ast - x)^\top (y - y^\ast)\ge 0, \forall y \in \Omega \end{aligned}

  • Let us consider the projections PΩ(x)\mathcal P_\Omega(x) and PΩ(y)\mathcal P_\Omega(y); by the property above, (PΩ(x)x)(zPΩ(x))0,zΩ,   (choose z=PΩ(y))(PΩ(y)y)(zPΩ(y))0,zΩ,   (choose z=PΩ(x)) (\mathcal P_\Omega(x) - x)^\top (z - \mathcal P_\Omega(x)) \ge 0, \forall z \in \Omega, ~~~(\text{choose}~z = \mathcal P_\Omega(y)) \ (\mathcal P_\Omega(y) - y)^\top (z - \mathcal P_\Omega(y)) \ge 0, \forall z \in \Omega, ~~~(\text{choose}~z = \mathcal P_\Omega(x)) Add up these inequalities: 0(PΩ(x)x)(PΩ(y)PΩ(x))+(PΩ(y)y)(PΩ(x)PΩ(y)) PΩ(x)PΩ(y)2(xy)(PΩ(x)PΩ(y))xyPΩ(x)PΩ(y) PΩ(x)PΩ(y)xy \begin{aligned} &0 \le (\mathcal P_\Omega(x) - x)^\top (\mathcal P_\Omega(y) - \mathcal P_\Omega (x)) + (\mathcal P_\Omega(y) - y)^\top(\mathcal P_\Omega(x) - \mathcal P_\Omega (y)) \ \Rightarrow ~& |\mathcal P_\Omega(x) - \mathcal P_\Omega(y)|^2 \le (x - y)^\top (\mathcal P_\Omega (x) - \mathcal P_\Omega(y)) \le |x- y| \cdot | \mathcal P_\Omega(x) - \mathcal P_\Omega (y)| \ \Rightarrow~& | \mathcal P_\Omega(x) - \mathcal P_\Omega (y)| \le | x - y | \end{aligned} Therefore, PΩ\mathcal P_\Omega is Lipschitz continuous with constant L=1L = 1.

  • Using the first property, we have x=PΩ(xλf(x))x^\ast = \mathcal P_\Omega (x^\ast - \lambda \nabla f(x^\ast)) (PΩ(xλf(x))[xλf(x)])(xPΩ(xλf(x)))0    λf(x)(xx)0,xΩ    xis a stationary point of the problem \begin{aligned} &(\mathcal P_\Omega(x^\ast - \lambda \nabla f(x^\ast)) - \left[x^\ast - \lambda \nabla f(x^\ast)\right])^\top (x - \mathcal P_\Omega(x^\ast - \lambda \nabla f(x^\ast))) \ge 0\ \iff & \lambda \nabla f(x^\ast)^\top (x - x^\ast) \ge 0, \forall x \in \Omega \ \iff & x^\ast \text{is a stationary point of the problem} \end{aligned}


Projected Gradient Method, we project the step xkλkf(xk)x_k - \lambda_k \nabla f(x_k) back onto Ω\Omega: xk+1=PΩ(xkλkf(xk)) x^{k + 1} = \mathcal P_\Omega(x^k - \lambda_k \nabla f(x_k))

  • How to choose λk>0\lambda_k \gt 0.
  • How to guarantee decent and convergence.

We change the formula into xk+1=xk+[PΩ(xkλkf(xk))xk] x_{k + 1} = x_k + \left[\mathcal P_\Omega(x_k - \lambda_k \nabla f(x_k)) - x_k\right] We can hen treat the second part as the decent direction.

Consider: xk+1=xk+αkdk=(1αk)xk+αkPΩ(xkλkf(xk)) x_{k+1} = x_k + \alpha_k d_k = (1 - \alpha_k)x_k + \alpha_k\mathcal P_\Omega(x_k - \lambda_k\nabla f(x_k)) If αk[0,1]\alpha_k \in [0, 1], then the convexity of Ω\Omega implies that xk+1x_{k + 1} will be feasible if xkΩx_k \in \Omega.

Descent Directions

Let xΩx \in \Omega and λ>0\lambda > 0 be given. If xx is not a stationary point, then the direction d=PΩ(xλf(x))xd = \mathcal P_\Omega (x - \lambda \nabla f(x)) - x is a descent direction and it holds that f(x)d1λd2<0 \nabla f(x)^\top d \le - \frac{1}{\lambda} | d|^2 < 0


f(x)d=f(x)[PΩ(xλf(x))x]=1λ(x[xλf(x)])[PΩ(xλf(x))PΩ(x)]  (PΩ(x)=x since xΩ)1λPΩ(xλf(x))PΩ(x)2=1λd2 \begin{aligned} \nabla f(x)^\top d &= \nabla f(x)^\top [\mathcal P_\Omega(x - \lambda \nabla f(x)) - x] \ &= \frac{1}{\lambda} (x - [x - \lambda \nabla f(x)])^\top [\mathcal P_\Omega(x - \lambda \nabla f(x)) - \mathcal P_\Omega(x)] ~~(\mathcal P_\Omega(x) = x \text{ since } x \in \Omega) \ &\le - \frac{1}{\lambda} | \mathcal P_\Omega(x - \lambda \nabla f(x)) - \mathcal P_\Omega (x)|^2 \ &= -\frac{1}{\lambda} |d|^2 \end{aligned}


Since the divergence is guaranteed in this way, we can reuse similar techniques like backtracking to generate step sizes

We cam use PΩ(xkλkf(xk))xkε|\mathcal P_\Omega(x_k - \lambda_k\nabla f(x_k)) - x_k| \le \varepsilon as stopping criterion. At each step"

  • calculate dk=PΩ(xkλkf(xk))xkd_k = \mathcal P_\Omega(x_k - \lambda_k \nabla f(x_k)) - x_k
  • if dkλkε|d_k| \le \lambda_k \varepsilon, then STOP and xkx_k is the output.
  • Armijo Line Search
  • Update to xk+1x_{k + 1}


ff is cont. diff., Ω\Omega is nonempty, convex, and closed and the step sizes (λk)k(\lambda_k)_k are bounded: 0>λλkλ,k 0 > \underline \lambda \le \lambda_k \le \overline \lambda, \forall k Every accumulation point of (xk)(x_k) is a stationary point.

If f\nabla f is additionally Lipschitz continuous, we obtain:

  • If ff is convex, we converge to global solutions of the problem.
  • If λk2(1γ)L\lambda_k \le \frac{2(1 - \gamma)}{L}, then αk=1\alpha_k = 1 will be accepted as step size!
  • If ff is (strongly) convex, then (xk)k(x^k)_k converges linearly o a global solution xΩx^\ast \in \Omega

Integer Programming

Theorem: LP Relaxation as a Bound for IP

  1. For a maximization problem, the optimal value of the LP relaxation provides an upper bound for the optimal value of the IP.
  2. For a minimization problem, the optimal value of the LP relaxation provides a lower bound for the optimal value of the IP.

The difference between the optimal value of the LP and the IP is called the integrability gap.

  • For maximization problems, the integrality gap is vLPvIPv^\text{LP} - v^\text{IP}
  • For maximization problems, the integrality gap is vIPvLPv^\text{IP} - v^\text{LP}

Maximization Problem: Suppose we solved the LP relaxation and obtain the optimal value is vLPv^\text{LP} (with some fractional solution)

Then we round the solution to a feasible integer point and find the objective value is vRoundingv^\text{Rounding}.

Then the difference between the rounded solution and the optimal integer solution (whose objective value is vIPv^{IP}) satisfies: 0vIP vRoundingvLPvRounding 0 \le v^\text{IP } - v^\text{Rounding} \le v^{LP} -v^\text{Rounding} That is, one can use the LP relaxation solution to construct abound on how good a certain feasible integer point is.

Theorem: The Ideal Case

If the optimal solution to the LP relaxation is integer, then the solution must be optimal to the IP problem.

Total Unimodularity


A matrix AA is said to be totally unimodular if the determinant of each submatrix of AA is either 00, 11, or 1-1.

Theorem: Total Unimodularity and Integer Solutions

If the constraint matrix AA is totally unimodular and bb is an integer vector, then all the BFS are integers and the LP relaxation must have an optimal solution that is an integer solution

One Sufficient Condition for TU

Let AA be an m×nm \times n matrix. Then the following conditions together are sufficient for AA to be totally unimodular:

  1. Every column of AA contains at most two non-zero entries;
  2. Every entry in AA is 0,+1, or 1;
  3. The rows of AA can be partitioned into two disjoint sets BB and CC such that
    • If two non-zero entries in a column of AA have the same sign, then the row of one entry is in BB, and the row of the other in CC;
    • If two non-zero entries in a column of AA have opposite signs, then the rows are both in BB, or both in CC.


The matching problem: maxi=1nj=1nvijxijs.t.i=1nxij=1,jj=1nxij=1,ixij{0,1} \begin{aligned} \max & \sum_{i = 1}^n\sum_{j = 1}^nv_{ij}x_{ij}\ s.t. & \sum_{i = 1}^n x_{ij} = 1, \forall j \ & \sum_{j = 1}^n x_{ij} = 1, \forall i \ & x_{ij} \in {0, 1} \end{aligned} The constraint matrix is TU.

Branch and Bound Algorithm

We consider solving a maximization problem. The whole procedure is

  1. Solve the LP relaxation.

    • If the optimal solution is integral, then it is optimal to IP.
    • Otherwise go to step 2.
  2. If the optimal solution to the LP relaxation is xx^\ast and xx^\ast is fractional, then branch the problem into the following two:

    • One with an added constraint that xixix_i \le \lfloor x_i^\ast \rfloor
    • One with an added constraint that xixix_i \ge \lceil x_i^\ast \rceil
  3. For each of the two problems, use the same method to solve them, and get optimal sol y1y_1^\ast and y2y_2^\ast

For each of the two problems, use the same method to solvethem, and get optimal sol.y⇤1andy⇤2with optimal v1v_1^\ast and v2v_2^\ast. Compare to obtain the optimal solution.

With bounding:

  • Any LP relaxation solution can provide an upper bound for each node in the branching process.

  • Any feasible point to the IP can provide a lower bound for the entire problem.

When the optimal value of the LP relaxation of this branch is less than the current lower bound, then we can discard this branch.

  • No better solution can be obtained from further exploring this branch.


We usually prefer to use DFS since it will discover more integer solution at the early stage.


import cvxpy as cp
def branch_and_bound(obj, cons, vars, eps=1e-4):
    current_opt = None
    current_sol = None
    visited = set()
    def _branch_and_bound(_obj, _cons):
    	prob = cp.Problem(_obj, _cons)
    	opt = prob.solve()
    	if prob.status != 'infeasible' and (current_opt is None or opt >= current_opt):
        	new_cons = []
        	for i in range(len(_vars)):
            	v = vars[i].value
            	ceil = math.ceil(v)
            	floor = math.floor(v)
                if ceil - v > eps and v - floor > eps:
                    new_cons.append(((vars[i] <= floor, (i, floor, False)), (vars[i] >= ceil, (i, ceil, True))))
            sol = [i.value[0] for i in vars]
            if not new_cons:
                if current_opt is None:
                    current_opt = opt
                    current_sol = sol
                elif opt > current_opt:
                    current_opt = max(opt, current_opt)
                    current_sol = sol;
                for i in new_cons:
                    if i[0][1] not in visited:
                     	_branch_and_bound(obj, [*cons, i[0][0]], vars)
                    if i[1][1] not in visited:
                     	_branch_and_bound(obj, [*cons, i[1][0]], vars)
    _branch_and_bound(obj, cons)
    return current_opt, current_sol


There is no polynomial algorithm for IP, but branch and bounding is good enough to solve a large range of IP problems.

Raw Content

Submit New Comment

question_answer Comments

No Comment Yet~