Optimization Learning Notes 5

book Public Date: 2020-07-08 05:11:21.020777

update Update Date: 2020-08-08 01:55:03.360573

From this note, we will start talk about non-linear programs. To get a first glance of NLPs, let us first check out how we can issue the Interior Point Method.

Interior Point Method

Interior Point Method takes care of the primal and the dual problem at the same time. Recall that we have the following constraints for LPs:

  • Primal Feasibility:

Ax=b,x0 Ax = b, x \ge 0

  • Dual Feasibility:

Ayc A^\top y \le c

  • Complementarity: xisi=xi(xiAy)=0 x_is_i = x_i(x_i-A^\top y) = 0

In the simplex method, we search along basic feasible solutions; which means we always maintain the primal feasibility. Since we are checking among the combinations of basis, at each iteration, we can simply define:

y=(AB1)cN y = (A_B^{-1})^\top c_N

such that the reduced costs are represented by:

cyA c^\top - y^\top A

Therefore, for each non-basic variable, the corresponding xix_i is zero; while for basic variable, the reduced costs are zero; this naturally yields the Complementarity Conditions.

While Complementarity Conditions hold during the whole process, there are cases when the dual feasibility is violated. However, notice that when we finally reach a stop condition, we will have the reduced costs greater or equal than zero:

cyA0 c^\top - y^\top A \ge 0

which means the dual feasibility is then achieved.

Proposition: Optimality Properties of the Simplex Method

During each iteration, the simplex method maintains primal feasibility and the complementarity conditions. It seeks a solution that is dual feasible.

It then strikes on us that what if we maintain the other two conditions during the process? In fact, this will result in the dual-simplex method and the interior point method.

Dual-Simplex Method

  • It maintains dual feasibility.
  • It maintains the complementarity conditions.
  • However, primal feasibility does not need to be satisfied during the iterations. It seeks for primal feasible BFS.
  • The tableau can be seen as a rotated variant of the primal one.

There are cases where using the dual simplex method can be more convenient:

  • If a dual BFS is available (but we don’t have a primal BFS).
  • We have mentioned a scenario in the discussion of the sensitivity analysis (when bb is changed by a large amount or a constraint is added)

The Interior Point Method

The interior point method maintains both primal feasibility and dual feasibility during its iterations and seeks for a pair of solutions that satisfy the complementarity conditions.

Looking at all constraints, a optimal pair satisfies:

  • Ax=0,x0Ax = 0, x \ge 0,
  • Ay+s=c,s0A ^\top y + s = c, s \ge 0
  • xisi=0,ix_i \cdot s_i = 0, \forall i

There is a set of nonlinear equations here, which makes it to solve. As we have mentioned, IPM do not maintain complementarity conditions during the process, this means we can relax the problem to

Looking at all constraints, a optimal pair satisfies:

  • Ax=0,x0Ax = 0, x \ge 0,
  • Ay+s=c,s0A ^\top y + s = c, s \ge 0
  • xisiμ,ix_i \cdot s_i \le \mu, \forall i

We can μ>0\mu \gt 0 the complementarity gap.

So the IPM goes like the following: If we have found a solution for a certain μ\mu, then it might be possible to find a solution for a smaller μ\mu. Then we keep decreasing μ\mu until we can find a solution of the LP. The essential step in the interior point method is to show that this is indeed doable – the approach is similar to Newton’s method.

Again, we need to resolve the issue of finding an initial basic solution in the interior point method. This can be resolved by solving an auxiliary problem (called the homogeneous self-dual problem).

Overall, the complexity of IPM is O(n3.5)O(n^{3.5})


Theorem: Quality of Solutions

The interior point method will always find the optimal solution with the maximum possible number of non-zeros.

  • If we want a high-rank solution (with the maximum possible non-zeros), then choose the interior point method.
  • If we want a low-rank solution (with small number of non-zeros), then choose the simplex method.
  • The optimal solution output of the simplex method may depend on the initial solution (as well as on the pivoting rules)

Non-Linear Programming

Mathematical Preparations

In Optimization Learning Notes 1, we have talked about the definition of optimizers (local/global minimizer/maximizer). So in non-linear programming, we are still going discuss about how we can find these optimizers for our problems.

In order to do so, we need to first introduce some math tools.

  • Gradient and First-Order Taylor Expansion: Suppose f(x)=f(x1,x2,...,xn)f(x) = f(x_1, x_2, ..., x_n) is continuous differentiable. Then we denote the gradient of ff by:

f(x)=(fxi)n \nabla f(x) = \left(\frac{\partial f}{\partial x_i}\right)_n And the First-Order Taylor Expansion yields: f(x+td)=f(x)+tf(x)d+o(t),t0 f(x + td) = f(x) + t \nabla f(x)^\top d + o(t), t \to 0

  • Hessian and Second-Order Taylor Expansion: If ff is twice continuously differentiable, then the Hessian of ff is given by:

2f(x)=(2fxixj)n×n \nabla^2f(x) = \left(\frac{\partial^2 f}{\partial x_i\partial x_j}\right)_{n \times n}

And the Second-Order Taylor Expansion yields:

f(x+td)=f(x)+tf(x)d+12t2d2f(x)d+o(t2),t0 f(x + td) = f(x) + t\nabla f(x)^\top d + \frac{1}{2}t^2 d^\top \nabla^2f(x)d + o(t^2), t \to 0

By Schwarz's theorem, we know that for twice continuously differentiable functions has the following property:

2fxixj=2fxjxi \frac{\partial^2 f}{\partial x_i\partial x_j} = \frac{\partial^2 f}{\partial x_j\partial x_i}

So the Hessian matrix is symmetric.

Optimality Conditions for Unconstrained Problems

W.L.O.G, we focus our discussion on minimizers.

To obtain a local minimizer, it is necessary to have:

f(x)=0 \nabla f(x) = 0

Otherwise, by First-Order Taylor Expansion,

f(x+td)=f(x)+tf(x)d+o(t),t0 f(x + td) = f(x) + t \nabla f(x)^\top d + o(t), t \to 0

We can find dd such that f(x)d<0\nabla f(x)^\top d \lt 0 and when tt is small enough, we can then make

f(x+td)<f(x) f(x + td) \lt f(x)

First-Order Necessary Conditions (FONC)

If xx^\ast is a local minimizer of the unconstrained problem minxRnf(x)\min{}_{x\in \mathbb R^n} f(x),then we must have f(x)=0\nabla f(x) = 0.

Example: Least Square Problem

In least square problem, we want to use a linear combination of xix_i to approximate a target value yy such that

yi=1nβixi y \approx \sum_{i = 1}^n \beta _i x_i

Given mm observations of (x,y)(x, y) (which are denoted by (xi,yi)(x_i, y_i)), we now want to find a suitable β\beta such that it solves that following problem:

minβi=1m(yij=1nβjxij)2 \min{}\beta \sum{i = 1}^m \left(y_i - \sum_{j = 1}^n \beta_j x_{ij}\right)^2

In the matrix form, we have:

minβXβy2=βXXβ2βXy+yy \begin{aligned} &\min{}_\beta |X\beta - y|^2 \ =&\beta^\top X^\top X \beta - 2 \beta^\top X^\top y + y^\top y \end{aligned}

So, the FONC for the problem is

XXβXy=0 X^\top X\beta - X^\top y = 0

FONC is not sufficient, so we need to add more constraints for make sure a point is our local minimizer. Another important condition is SONC.

By Taylor Expansion, we already have:

f(x+td)=f(x)+tf(x)d+12t2d2f(x)d+o(t2),t0 f(x + td) = f(x) + t\nabla f(x)^\top d + \frac{1}{2}t^2 d^\top \nabla^2f(x)d + o(t^2), t \to 0

Suppose FONC holds, then,

f(x+td)=f(x)+12t2d2f(x)d+o(t2),t0 f(x + td) = f(x) + \frac{1}{2}t^2 d^\top \nabla^2f(x)d + o(t^2), t \to 0

If xx is a local minimizer, then we have,

d2f(x)d0,d d^\top \nabla^2f(x)d \ge 0, \forall d

In other words, 2f(x)\nabla^2f(x) is positive semidefinite.

Second-Order Necessary Conditions (SONC)

If xx^\ast is a local minimizer of ff, then it holds that:

  1. f(x)=0\nabla f(x^\ast) = 0;
  2. For all dRnd \in \mathbb R^n, d2f(x)d0d^\top \nabla^2f(x)d \ge 0.


We call a (symmetric) matrix AA positive (negative) semidefinite (PSD/NSD) if and only if for all xx we have

xAx0(0) x^\top A x \ge 0 (\le 0)

  • we only talk about PSD for symmetric matrices.
  • For non-symmetric ones, we use 12(A+A)\frac{1}{2}(A^\top + A) to define PSD. (We always have xAx=12x(A+A)xx^\top A x = \frac{1}{2}x^\top(A^\top + A)x)
  • A symmetric matrix is PSD, it all of its eigenvalues are nonnegative
  • A symmetric matrix is PSD, it all of its leading principles are nonnegative.
  • For any matrix AA, AAA^\top A is always symmetric and PSD.

Definition: Stationary Points and Saddle Points

  • A point xx satisfying f(x)=0\nabla f(x) = 0 is called critical point or stationary point.
  • A stationary point is called saddle point if it is neither a local minimizer nor a local maximizer.

Corollary: Saddle Points

Suppose that xx^\ast is a stationary point (f(x)=0\nabla f(x^\ast) = 0) and that the Hessian 2f(x)\nabla^2 f(x^\ast) is indefinite, then xx^\ast is a saddle point.

SONC is still not sufficient, but it can be modified into a sufficient condition

Second-Order Sufficient Conditions (SOSC)

Let ff be twice continuously differentiable. If xx^\ast satisfies:

  1. f(x)=0\nabla f(x^\ast) = 0;
  2. For all dRn{0}d \in \mathbb R^n - { 0} , d2f(x)d>0d^\top \nabla^2f(x)d \gt 0.

Then xx^\ast is a strict local minimizer.


We call a (symmetric) matrix AA positive (negative) definite (PD/ND) if and only if for all xx we have

xAx>0(<0) x^\top A x \gt 0 (\lt 0)

PD implies PSD and correspondingly, PD also has similar properties as PSD, but in a strict version.

Remark: the above conditions can be transformed into maximization conditions if we change PSD to NSD and PD to ND.


To proof SOSC, we first need to proof a lemma:

Lemma: Bounds and Eigenvalues

Let ARn×nA \in \mathbb R^{n\times n} be a symmetric matrix. Then λmin(A)x2xAxλmax(A)x2,xRn \lambda_{\min_{}}(A)|x|^2 \le x^\top A x \le \lambda_{\max_{}}(A)|x|^2, \forall x \in \mathbb R^n where λmin(A),λmax(A)\lambda_{\min_{}}(A), \lambda_{\max_{}}(A) are the smallest and the largest one among all eigenvalues of AA.

Proof of the Lemma

Remember the eigenvalue-decomposition (EVD) of the symmetric matrix AA is written as: A=QΛQ=i=1nλqiqi A = Q\Lambda Q^\top = \sum_{i = 1}^n \lambda q_iq_i^\top where Q=[qi]nRn×n Q = [q_i]_n \in \mathbb R^{n \times n} are nn orthonormal eigenvectors qiq_i of AA, and Λ=diag(λ1,...,λn) \Lambda = \text{diag}\left(\lambda_1, ..., \lambda_n\right) is a diagonal matrix with these eigenvalues.

Then, we can proof the lemma. xRn\forall x \in \mathbb R^n, we have xAx=xQΛQx=x~Λx~   (x~=Qx)=i=1nλix~i2 \begin{aligned} x^\top A x &= x^\top Q \Lambda Q^\top x \ &= \tilde x^\top \Lambda \tilde x ~~~(\tilde x = Q^\top x) \ &= \sum_{i = 1}^n \lambda_i \tilde x^2_i \ \end{aligned} It is easy to see: i=1nλminx~i2i=1nλix~i2i=1nλmaxx~i2 \sum_{i = 1}^n \lambda_{\min_{}} \tilde x^2_i \le \sum_{i = 1}^n \lambda_i \tilde x^2_i \le \sum_{i = 1}^n \lambda_{\max_{}} \tilde x^2_i Since QQ is orthonormal, Qx2=x2 |Q^\top x| ^2 = |x|^2 Hence, we get λmin(A)x2xAxλmax(A)x2,xRn \lambda_{\min_{}}(A)|x|^2 \le x^\top A x \le \lambda_{\max_{}}(A)|x|^2, \forall x \in \mathbb R^n

Proof of SOSC

By Taylor Expansion, f(x+h)=f(x)+f(x)h+12h2f(x)h+o(h2),h0 f(x + h) = f(x) + \nabla f(x) ^\top h + \frac{1}{2}h^\top \nabla^2f(x) h + o(|h|^2), h \to 0 With μ=λmin(2f(x))\mu = \lambda_{\min_{}}(\nabla^2 f(x^\ast)) and f(x)=0\nabla f(x^\ast) = 0 f(x+d)=f(x)+f(x)d+12d2f(x)d+o(d)2f(x)+d2[μ2+o(d2)d2] \begin{aligned} f(x^\ast + d) &= f(x^\ast) + \nabla f(x^\ast)^\top d + \frac{1}{2}d^\top \nabla^2f(x^\ast)d + o(|d|)^2 \ & \ge f(x^\ast) + |d|^2 \left[\frac{\mu}{2} + \frac{o(|d|^2)}{|d|^2}\right]
when d2|d|^2 is small enough.

By the definition by little oo notation, limd0o(d2)d2=0 \lim_{d \to 0}\frac{o(|d|^2)}{|d|^2} = 0 Therefore, for all dd close enough to 00, we have o(d2)d2μ4 \frac{o(|d|^2)}{|d|^2} \ge -\frac{\mu}{4} Moreover, since SOSC requires the Hessian to be PD, then μ0\mu \ge 0, which yields f(x+d)f(x)+μ4d2>f(x) f(x^\ast + d) \ge f(x^\ast) + \frac{\mu}{4} |d|^2 \gt f(x^\ast) for dd sufficiently small. This indicates that xx^\ast is truly a local minimum.

Existence of Solutions

Weierstraβ\beta Theorem

When we talked about Linear Programs, we never consider the cases where feasible region of the problem is bounded but not finite optimal. Actually, there is a theorem protecting us from being bothered by such situations:

Let f:RnRf: \mathbb R^n \to \mathbb R be a continuous function and let ΩRn\Omega \subset \mathbb R^n be a bounded, closed and nonempty set. Then ff attains a global maximum and global minimum on Ω\Omega,

Here, we also clarify some definitions:


The set Ω\Omega is closed if for every convergent sequence {xk}{x_k} with xkΩx^k \in \Omega for all kk and limkxk=x\lim_{k \to \infty} x^k = x, it holds that xΩx \in \Omega.

Closeness is a topological property, that is if you have a continuous injective function mapping a closed set Ω\Omega to another set Ψ\Psi, then Ψ\Psi must be also closed.


Ω\Omega is bounded if there is B>0B \gt 0 such that xb,xΩ|x| \le b, \forall x \in \Omega.


A closed and bounded set is also called compact.


For unconstrained problems, however, we cannot directly use Weierstraβ\beta Theorem because the feasible region may not be bounded. Fortunately, there is another tool, Coercivity that can help us in unconstrained cases.


A continuous function f:RnRf: \mathbb R^n \to \mathbb R is said to be coercive if

limxf(x)=+ \lim_{|x| \to \infty} f(x) = +\infty

  • Geometrically, coercivity means that f(x)f(x) increases as xx moves away from the origin in any possible direction.

  • Mathematically, coercivity means

B>0,r>0,s.t.x>rf(x)>B \forall B \gt 0, \exists r \gt 0, s.t.|x| \gt r \Longrightarrow f(x) \gt B

It turns out that coercivity guarantees the existence of solutions:

Theorem: Coercivity and Existence of Solutions

Let f:RnRf : \mathbb R^n \to \mathbb R be a continuous and coercive function. Then for all α>0\alpha \gt 0, the level set

Lα:={xRn:f(x)α} L_{\le \alpha} := { x \in \mathbb R^n : f(x) \le \alpha}

is compact and ff has at least one global minimizer.


We already know ff is continuous, LαL_{\le \alpha} is closed (closeness preserves under continuous mapping).

Assume that the level set is not bounded. Then,

(xk)α,xkLα,kN,s.t.xk(k) \exists (x_k)\alpha, x_k \in L{\le \alpha}, k \in \mathbb N, s.t. |x_k| \to \infty (k \to \infty)

Since ff is coercive, we know

limkf(xk)=+ \lim_{k \to \infty} f(x_k) = +\infty

However, this is a contradiction to xkLαx_k \in L_{\le \alpha}. So the level set of a coercive function is compact.

Optimality Conditions for Constrained Problems

Feasible Directions

The key difference between Unconstrained Problems and Constrained Problems is that the possible moving directions can be limited for the points of the latter one. This brings more difficulties for finding the solution.

Given xΩx \in \Omega, we call vector dd a feasible direction at xx if there exists t>0\overline t \gt 0 such that x+tdΩx + td \in \Omega for all 0tt0 \le t \le \overline t.


Let xx^* be a local minimum of minxΩf(x)\min{}_{x \in \Omega} f(x). Then for any feasible direction dd at xx^\ast, we must have f(x)d0\nabla f(x^\ast)^\top d \ge 0.

Descent Directions

Let ff be continuous differentiable. Then dd is called a descent direction at xx if and only if f(x)d<0\nabla f(x)^\top d\lt0.

If we denote the set of feasible directions at xx by SΩ(x)S_\Omega(x) and the set of descent directions at xx by SD(x)S_D(x), then the first order necessary condition can be written as

SΩ(x)SD(x)= S_\Omega(x^\ast) \cap S_D(x^\ast) = \emptyset

General Optimality Conditions

From now on, let us consider the general form of the nonlinear programs:

minxRn  f(x)s.t.  gi(x)0,i=1,,m,hi(x)=0,j=1,,p \begin{aligned} \min{}_{x \in \mathbb R^n}~~& f(x) \ s.t.~~&g_i(x) \le 0, \forall i = 1,\dots,m, \ &h_i(x) = 0, \forall j = 1,\dots,p \end{aligned}

The feasible set is Ω={xRn:g(x)0,h(x)=0}\Omega = {x \in \mathbb R^n : g(x) \le 0, h(x) = 0}.

Definition: Active and Inactive Set

At a point xΩx \in \Omega, the set A(x):={gi(x)=0}\mathcal A(x) := {g_i(x) = 0} denotes the set of active constraints. The set of inactive constraints is given by I(x):={i:gi(x)<0}\mathcal I(x) := {i : g_i(x) < 0}.

Linear Constraints

Let us start our journey with the linearly constrained problems.

For the model:

minxf(x),s.t.Axb                     () \min{}_x f(x), s.t. Ax \ge b ~~~~~~~~~~~~~~~~~~~~~(\star)

The FONC can be expressed as

If xx^\ast is a local minimum of ()(\star), then

y0,s.t.f(x)Ay=0yi(aixbi)=0,i \begin{aligned} &\exists y \ge 0, s.t. \ & \nabla f(x^\ast) - A^\top y = 0\ &y_i(a_i^\top x^\ast - b_i) = 0, \forall i \end{aligned}


Now we show this actually guarantees that f(x)d0\nabla f(x^\ast)^\top d \ge 0.

This is equivalent to

[mindRnf(x)d,s.t.dSΩ(x)]0 [\min{}{d \in \mathbb R^n} \nabla f(x^\ast) ^\top d, s.t. d \in S{\Omega}(x^\ast)] \ge 0

Since inactive indices provide no limit to the moving direction, we get

[mindRnf(x)d,s.t.aid0,iA(x)]0 [\min{}_{d \in \mathbb R^n} \nabla f(x^\ast) ^\top d, s.t. a_i^\top d \ge 0 , \forall i \in \mathcal A(x^\ast)] \ge 0


C=[cicm],ci={ai,iA(x)0,iI(x) C = \begin{bmatrix} c_i^\top \ \vdots \ c_m^\top \end{bmatrix}, c_i^\top = \begin{cases}a_i^\top, i \in \mathcal A(x^\ast) \ 0, i \in \mathcal I(x^\ast) \end{cases}

Rewrite the condition as

[mindRnf(x)d,s.t.Cd0]0 [\min{}_{d \in \mathbb R^n} \nabla f(x^\ast) ^\top d, s.t. Cd \ge 0] \ge 0 This is a linear program and the dual is maxy0,s.t.y0,Cy=f(x) \max{}_y 0, s.t. y \ge 0, C^\top y = \nabla f(x^\ast)

The primal problem is bounded and feasible, so according to the strong duality, the dual is also feasible.

Consequently, we can find

y0,f(x)=Cy=iA(x)aiyi+iI(x)ciyi y \ge 0, \nabla f(x^\ast) = C^\top y = \sum_{i \in \mathcal A(x^\ast)}a_iy_i + \sum_{ i \in \mathcal I(x^\ast)} c_iy_i

This is exactly:

y0,s.t.f(x)Ay=0yi(aixbi)=0,i \begin{aligned} &\exists y \ge 0, s.t. \ &\nabla f(x^\ast) - A^\top y = 0\ &y_i(a_i^\top x^\ast - b_i) = 0, \forall i \end{aligned}

The last constraints enforces yi=0,iI(x)y_i = 0, \forall i \in \mathcal I(x^\ast)

So if the optimization model changes to:

minxf(x),s.t.Ax=b \min{}_x f(x), s.t. Ax = b

Then there are no inactive constraints anymore, so the FONC changes to

yRm,f(x)=Ay \exists y \in \mathbb R^m, \nabla f(x^\ast) = A^\top y

General Inequality Constraints

For general cases, consider the following nonlinear program:

minxRn  f(x)s.t.  gi(x)0,i=1,,m \begin{aligned} \min{}_{x \in \mathbb R^n} &~~ f(x) \ s.t. &~~ g_i(x) \le 0, \forall i = 1,\dots, m \end{aligned}

Assumption: ff, gig_i are continuous differentiable.

When f(x)f(x) reaches a local minimum at xx^\ast, we have the following lemma:

Lemma: No Feasible Descent for Inequality Constraints

Let xx^\ast be a local minimum of the above problem. Then, there does not exist a vector dRnd \in \mathbb R^n s.t.

f(x)d<0gi(x)d<0,iA(x) \nabla f(x^\ast) d < 0 \wedge \nabla g_i(x^\ast)^\top d \lt 0, \forall i \in \mathcal A(x^\ast)


Suppose we have found a dd such that

f(x)d<0gi(x)d<0,iA(x) \nabla f(x^\ast) d < 0 \wedge \nabla g_i(x^\ast)^\top d \lt 0, \forall i \in \mathcal A(x^\ast)

Then, by Taylor Expansion, we can find ϵ1,ϵ2\epsilon_1, \epsilon_2 s.t.

f(x+td)<f(x)gi(x+td)<gi(x)=0,iA(x) f(x^\ast + td) < f(x^\ast) \wedge g_i(x^\ast + td) \lt g_i(x^\ast) = 0, \forall i \in \mathcal A(x^\ast)

t(0,ϵ1)\forall t \in (0, \epsilon_1) and mean while,

iI(x),gi(x)<0,t[0,ϵ2) \forall i \in \mathcal I(x^\ast), g_i(x^\ast) < 0, \forall t \in [0, \epsilon_2)

So we can see that,

x+td,t[0,min(ϵ1,ϵ2)) x^\ast + td, \forall t\in [0, \min{}(\epsilon_1, \epsilon_2))

is even optimal, which is a contradiction.

Theorem: Fritz-John Conditions

Further extend the above idea, we then come across Fritz-John Conditions:

Let xx^\ast be a local minimum of the above problem. Then there exists λ0,λ1,...,λm0\lambda_0, \lambda_1, ..., \lambda _m \ge 0, which are not all zeros, such that

λ0f(x)+i=1mλigi(x)=0λigi(x)=0,i=1,,m. \begin{aligned} \lambda_0\nabla f(x^\ast) + \sum_{i = 1}^m \lambda_i\nabla g_i(x^\ast) &= 0 \ \lambda_ig_i(x^\ast) &= 0, \forall i = 1, \dots, m. \end{aligned}


We prove this theorem by the last lemma, we cannot find such dd that:

f(x)d<0gi(x)d<0,iA(x) \nabla f(x^\ast) d < 0 \wedge \nabla g_i(x^\ast)^\top d \lt 0, \forall i \in \mathcal A(x^\ast)

Consider the problem

mind,tt  s.t.  f(x)dtgi(x)dt,iA(x) \begin{aligned} \min{}_{d, t} t ~~s.t. ~~& \nabla f(x^\ast)^\top d \le t \ & \nabla g_i(x^\ast)^\top d \le t, \forall i \in \mathcal A(x^\ast) \end{aligned}

the optimal solution must be non-negative.

Again, we define

C=[c1cm] C = \begin{bmatrix}c_1^\top \ \vdots \ c_m^\top \end{bmatrix}


ci={gi(x),iA(x)0,iI(x) c_i^\top = \begin{cases} \nabla g_i(x^\ast)^\top, &\forall i \in \mathcal A(x^\ast) \ 0, &\forall i \in \mathcal I(x^\ast) \end{cases}

Rewrite the problem with CC, we get

mind,tt  s.t.  f(x)dt0Cdt10 \begin{aligned} \min{}_{d, t} t ~~s.t. ~~& \nabla f(x^\ast)^\top d - t \le 0 \ & Cd - t\mathbf 1 \le 0 \end{aligned}

The dual problem is a feasibility problem:

maxλ~0,λ~0  s.t.  λ~i0λ~0f(x)+Cλ~=0λ~01λ~=1 \begin{aligned} \max{}_{\tilde\lambda_0, \tilde \lambda} 0 ~~s.t.~~& \tilde \lambda_i \le 0 \ & \tilde \lambda_0 \nabla f(x^\ast) + C^\top \tilde \lambda = 0 \ & -\tilde \lambda_0 - \mathbf 1^\top \tilde \lambda = 1 \end{aligned}

Since strong duality holds, the problem is feasible.

So we can find λ0,λ1,...,λm0\lambda_0, \lambda_1, ..., \lambda _m \ge 0,

λ0f(x)+i=1mλigi(x)=0λigi(x)=0,i=1,,m. \begin{aligned} \lambda_0\nabla f(x^\ast) + \sum_{i = 1}^m \lambda_i\nabla g_i(x^\ast) &= 0 \ \lambda_ig_i(x^\ast) &= 0, \forall i = 1, \dots, m. \end{aligned}

(rescale λ~i-\tilde \lambda_i to obtain λigi(x)=0,i\lambda_ig_i(x^\ast) = 0, \forall i; and λ~0+1λ~>1\tilde \lambda_0 + \mathbf 1^\top \tilde \lambda \gt 1 guarantees that not all λ0,λ1,,λm\lambda_0, \lambda_1, \dots, \lambda_m are zero.)

However, Fritz-John Conditions are not very useful. λ0=0\lambda_0 = 0 can happen and in that case, we need to check all possible combinations that makes the set of

gi(x),iA(x) \nabla g_i(x^\ast), \forall i \in \mathcal A(x^\ast)

linearly dependent.

And, without the constraints from ff, many of the combinations may not be minimizer.

KKT-Conditions for Inequality Constraints

KKT-Conditions use an assumption to overcome the major drawbacks of Fritz-John Conditions.

Let xx^\ast be a local minimum and suppose the set of

gi(x),iA(x) \nabla g_i(x^\ast), \forall i \in \mathcal A(x^\ast)

are linearly independent. Then,

Then there exists λ1,...,λm0\lambda_1, ..., \lambda _m \ge 0, which are not all zeros, such that

f(x)+i=1mλigi(x)=0λigi(x)=0,i=1,,m. \begin{aligned} \nabla f(x^\ast) + \sum_{i = 1}^m \lambda_i\nabla g_i(x^\ast) &= 0 \ \lambda_ig_i(x^\ast) &= 0, \forall i = 1, \dots, m. \end{aligned}

The proof is trivial. We can find required by Fritz-John Conditions,

λ0,λ1,...,λm0 \lambda_0, \lambda_1, ..., \lambda _m \ge 0


gi(x),iA(x) \nabla g_i(x^\ast), \forall i \in \mathcal A(x^\ast)

are linearly independent, λ00\lambda_0 \ne 0. Then we can divide all λi\lambda_i by λ0\lambda_0 and obtain the required lambdas for KKT Conditions.

KKT Conditions: The General Setting

Now, it is the high time for General Nonlinear Optimization Problem

minxRn  f(x)s.t.  gi(x)0,i=1,,m  hj(x)=0,j=1,,p \begin{aligned} \min_{x \in \mathbb R^n} &~~ f(x) \ s.t. &~~ g_i(x) \le 0, \forall i = 1,\dots, m \ &~~ h_j(x) = 0, \forall j = 1, \dots, p \end{aligned} To begin with, we associate each constraint with a Lagrangian multiplier: gi(x)0λihi(x)0μi g_i(x) \le 0 \leadsto \lambda_i \ h_i(x) \le 0 \leadsto \mu_i We define the Lagrangian of this problem by: L(x,λ,μ)=f(x)+i=1mλigi(x)+j=1pμjhj(x) L(x, \lambda, \mu) = f(x) + \sum_{i = 1}^m \lambda_ig_i(x) + \sum_{j = 1}^p\mu_jh_j(x) Until now you may still feel confused about what can we do with the Lagrangian. However, if you look at the form carefully, you may have grabbed a similar feeling as we are deriving a dual form. Indeed, here we require λi0,μj free \lambda_i \ge 0, \mu_j \text{ free} These conditions are exactly called dual feasibility.

And we also have the complementarity conditions: λigi(x)=0,iμjhi(x)=0,j  (omitted) \begin{aligned} \lambda_ig_i(x) = 0,& \forall i \ \mu_jh_i(x) = 0,& \forall j ~~(\text{omitted}) \end{aligned} Our main condition comes from the Lagrangian (We move all the constraints to the objective and interpret them as reward/penalty functions, so the following condition does make sense): xL=0 \nabla_x L = 0 To sum up, the formal definition of KKT conditions is:

Theorem: KKT Condition

If xx is a local minimizer and if a regularity condition holds, then there exist λ\lambda and μ such that:

  1. Main Condition f(x)+i=1mλigi(x)+j=1pμjhj(x)=0 \nabla f(x) + \sum_{i = 1}^m \lambda_i\nabla g_i(x) + \sum_{j = 1}^p\mu_j\nabla h_j(x) = 0

  2. Dual Feasibility λi0,i{1,,m} \lambda_i \ge 0, \forall i \in {1, \dots, m}

  3. Complementarity λigi(x)=0,i{1,,m} \lambda_ig_i(x) = 0, \forall i\in {1, \dots, m}

  4. Primal Feasibility (extended) gi(x)0,hj(x)=0,i,j g_i(x) \le 0, h_j(x) = 0, \forall i, j

Constraint Qualifications

Again, we require the collection of gradients {gi(x):iA(x)}{hj(x):j=1,,p} {\nabla g_i(x^\ast) : i \in \mathcal A(x^\ast)} \cup{\nabla h_j(x^\ast): j = 1, \dots, p} to be linearly independent (fully-ranked).

  • This condition is a constraint qualification (CQ) and is called Linear Independence Constraint Qualification (LICQ).
  • A feasible point xx^\ast satisfying the LICQ is called regular.
  • There are more CQs: ACQ, GCQ, MFCQ, PLICQ, Slater’s condition. See wikipedia.


  • A (feasible) point satisfying the KKT conditions is called a KKT point
  • KKT points are candidates for local optimal solutions – just like stationary points.

Second-Order Optimality Conditions for Constrained Problems

KKT conditions are necessary first-order optimality conditions. KKT-points are potential candidates for local minimizer. We still need to investigate Second-Order Optimality Conditions to check whether a point is truly a local minimizer or not.

We assume that ff, gig_i,and hjh_j are twice continuous differentiable. Therefore, we get calculate the Hessian of our Lagrangian: xx2L(x,λ,μ)=2f(x)+i=1mλi2gi(x)+j=1pμj2hj(x) \nabla_{xx}^2L(x, \lambda, \mu) = \nabla^2 f(x) + \sum_{i = 1}^m \lambda_i\nabla^2 g_i(x) + \sum_{j = 1}^p\mu_j\nabla^2 h_j(x) We define the so-called critical cone: C(x):={dRn:f(x)d=0,gid0,iA(x),hj(x)d=0,j} \begin{aligned} \mathcal C(x) := { d \in \mathbb R^ n :& \nabla f(x)^\top d = 0, \nabla g_i^\top d \le 0, \forall i \in \mathcal A(x), \ & \nabla h_j(x)^\top d = 0, \forall j} \end{aligned} Then, the second-order necessary conditions take the following form:

Theorem: SONC for Constrained Problems

Let xx^\ast be a regular point and local min. Then, the KKT-conditions hold and there are unique multiplier λRm\lambda \in \mathbb R^m and μRp\mu \in \mathbb R^p such that f(x)+g(x)λ+h(x)μ=0g(x)0,h(x)=0,λ0,λigi(x)=0,i \nabla f(x^\ast) + \nabla g(x^\ast) \cdot \lambda + \nabla h(x^\ast) \cdot \mu = 0 \ g(x^\ast) \le 0, h(x^\ast) = 0, \lambda \ge 0, \lambda_ig_i(x^\ast) = 0 , \forall i and we have dxx2L(x,λ,μ)d0,dC(x) d^\top \nabla^2_{xx}L(x^\ast, \lambda, \mu) d \ge 0, \forall d \in \mathcal C(x^\ast)


The uniqueness of λ\lambda and μ\mu in the SONC follows from the LICQ. (This can be helpful in calculations). There do exist variants of SONC withouf LICQ, but normally they are hard to use and form is unfriendly.

Theorem: SOSC for Constrained Problems

Let xx^\ast be a KKT-point with multiplier λRm\lambda \in \mathbb R^m and μRp\mu \in \mathbb R^p such that f(x)+g(x)λ+h(x)μ=0g(x)0,h(x)=0,λ0,λigi(x)=0,i \nabla f(x^\ast) + \nabla g(x^\ast) \cdot \lambda + \nabla h(x^\ast) \cdot \mu = 0 \ g(x^\ast) \le 0, h(x^\ast) = 0, \lambda \ge 0, \lambda_ig_i(x^\ast) = 0 , \forall i and suppose that the condition dxx2L(x,λ,μ)d>0,dC(x){0} d^\top \nabla^2_{xx}L(x^\ast, \lambda, \mu) d \gt 0, \forall d \in \mathcal C(x^\ast) - {0} is satisfied. Then,xx^\ast is a strict local minimizer.


Unlike our SONC, SOSC does not require the uniqueness of multipliers here. If there are multiple multipliers then as long as there exists such multiplier that fulfills SOSC,

then, xx^\ast is already a strict local minimizer.


Suppose that xx^\ast is not a strict local minimizer. Then, there exists a sequence (xk)(x_k) within Ω\Omega, such that limkxk=x\lim_{k \to \infty}x_k = x^\ast and f(xk)f(x),kN     () f(x^k) \le f(x^\ast), \forall k \in \mathbb N ~~~~~(\star) Define tk:=xkxt_k := | x^k - x^\ast| and dk:=1tk(xkx)d_k := \frac{1}{t_k}(x_k - x^\ast). Then, we have dk=1|d_k| = 1. Therefore, (dk)(d_k) is bounded. Every bounded sequence has a convergent subsequence, so we can find KN\mathcal K \subseteq \mathbb N with dRnd \in \mathbb R^n such that limk,kKdk=d \lim_{k \to \infty, k \in \mathcal K} d_k = d Obviously, d=1|d| = 1. Then by Taylor Expansion, we get f(xk)=f(x+xkx)=f(x+tkdk)=f(x)+tkf(x)dk+o(tk) \begin{aligned} f(x_k) &= f(x^\ast + x_k - x^\ast)\ &= f(x^\ast + t_k d_k)\ &= f(x^\ast) + t_k\nabla f(x^\ast)^\top d_k + o(t_k) \end{aligned} when k,kKk \to \infty, k \in \mathcal K, by ()(\star) we know  tkf(x)dk+o(tk)0f(x)dk+o(tk)tk0f(x)d0            () \begin{aligned} ~&t_k\nabla f(x^\ast)^\top d_k + o(t_k) \le 0 \ \Rightarrow & \nabla f(x^\ast)^\top d_k + \frac{o(t_k)}{t_k} \le 0 \ \Rightarrow & \nabla f(x^\ast)^\top d \le 0 ~~~~~~~~~~~~(\diamondsuit) \end{aligned} Similarly, iA(x)\forall i \in \mathcal A(x^\ast), we have gi(xk)=gi(x)+tkgi(x)dk+o(tk) g_i(x_k) = g_i(x^\ast) + t_k\nabla g_i(x^\ast)^\top d_k + o(t_k) Since gi(x)0g_i(x) \le 0, we then get gi(x)dk=1tkgi(xk)+o(tk)tko(tk)tk \nabla g_i(x^\ast)^\top d_k = \frac{1}{t_k}g_i(x_k) + \frac{o(t_k)}{t_k} \le \frac{o(t_k)}{t_k} As k,kKk \to \infty, k \in \mathcal K, gi(x)dk0,iA(x) \nabla g_i(x^\ast)^\top d_k \le 0, \forall i \in \mathcal A(x^\ast) With similar tricks, for those equality constraints: hj(xk)=hj(x)+tkhj(x)dk+o(tk) h_j(x_k) = h_j(x^\ast) + t_k\nabla h_j(x^\ast)^\top d_k + o(t_k) So, as k,kKk \to \infty, k \in \mathcal K, hj(x)dk=0,j{1,,p} \nabla h_j(x^\ast)^\top d_k = 0, \forall j \in {1,\dots,p} Since xx^\ast is a KKT point, we have f(x)d=iA(x)λigi(x)d+j=1pμjhj(x)d0 -\nabla f(x^\ast)^\top d = \sum_{i \in \mathcal A(x^\ast)} \lambda_i\nabla g_i(x^\ast)^\top d + \sum_{j = 1}^p \mu_j\nabla h_j(x^\ast) ^\top d \le 0 With ()(\diamondsuit), we know f(x)d=0 \nabla f(x^\ast)^\top d = 0 which means dC(x){0} d \in \mathcal C(x^\ast) - {0} By ()(\star), L(xk,λ,μ)=f(xk)+i=1mλigi(xk)+j=1pμjhj(xk)f(xk)f(x)=f(x)+i=1mλigi(x)+j=1pμjhj(x)=L(x,λ,μ)        () \begin{aligned} L(x_k,\lambda, \mu) &= f(x_k) + \sum_{i = 1}^m \lambda_ig_i(x_k) + \sum_{j = 1}^p \mu_jh_j(x_k) \ & \le f(x_k) \ & \le f(x^\ast) \ & = f(x^\ast) + \sum_{i = 1}^m \lambda_ig_i(x^\ast) + \sum_{j = 1}^p\mu_jh_j(x^\ast) \ & = L(x^\ast, \lambda, \mu) \end{aligned} ~~~~~~~~(\circledast) By Second Order Taylor Expansion, L(xk,λ,μ)=L(x+tkdk,λ,μ)=L(x,λ,μ)+tkxL(x,λ,μ)dk+tk22dkxx2L(x,λ,μ)dk+o(tk2) \begin{aligned} L(x_k, \lambda, \mu) &= L(x^\ast + t_kd_k, \lambda, \mu) \ &= L(x^\ast, \lambda, \mu) + t_k\nabla_xL(x^\ast, \lambda, \mu)^\top d_k + \frac{t_k^2}{2} d_k^\top \nabla_{xx}^2L(x^\ast, \lambda, \mu )d^k + o(t_k^2) \end{aligned} Notice that xL(x,λ,μ)=0\nabla_xL(x^\ast, \lambda, \mu)^\top = 0. With ()(\circledast), tk22dkxx2L(x,λ,μ)dk+o(tk2)0 \frac{t_k^2}{2} d_k^\top \nabla_{xx}^2L(x^\ast, \lambda, \mu )d^k + o(t_k^2) \le 0 As k,kKk \to \infty, k \in \mathcal K, dxx2L(x,λ,μ)d0 d^\top \nabla_{xx}^2L(x^\ast, \lambda, \mu)d \le 0 which leads to a contradiction. \blacksquare

Solving Nonlinear Programs: Strategy

General Strategy

  • Check LICQ (if required).
  • Derive KKT-conditions.
  • Discuss different easy cases via the complementarity conditions (set multiplier or constraints to 00) to find all KKT-points.
  • Calculate C(x)\mathcal C(x) and xx2L(x,λ,μ)\nabla_{xx}^2L(x^\ast, \lambda, \mu) at KKT-points.
  • Check second-order conditions

Additional Information

  • Check if ff is coercive or if Ω\Omega is bounded. If so, the problem has global solutions (which must be KKT-points)!
  • If the LICQ holds, then λ\lambda and μ\mu are always unique!
  • Finding maximizer: apply all steps to f-f.

Visualization and Interpretation of the KKT Conditions


At the KKT-point, we have f(x)=iA(x)λigi(x)+j=1pμjhj(x) -\nabla f(x^\ast) = \sum_{i \in \mathcal A(x^\ast)} \lambda_i\nabla g_i(x^\ast) + \sum_{j = 1}^p \mu_j\nabla h_j(x^\ast) This means f(x)-\nabla f(x^\ast) is the linear combination of gi(x)\nabla g_i(x^\ast) and hj(x)\nabla h_j(x^\ast). One can visualize the gradient using this property.

The Dual Perspective

The KKT-conditions imply that the LP maxλ0,μ 0s.t. f(x)+i=1mλigi(x)+j=1pμjhj(x)=0 \begin{aligned} \max_{\lambda \ge 0, \mu} ~&0\ s.t.~& \nabla f(x) + \sum_{i = 1}^m \lambda_i\nabla g_i(x) + \sum_{j = 1}^p\mu_j\nabla h_j(x) = 0 \end{aligned} is feasible.

The dual problem is mind  f(x)ds.t.  gi(x)d0,iA(x),h(x)d=0 \begin{aligned} \min_d ~~& -\nabla f(x^\ast)^\top d \ s.t. ~~& \nabla g_i(x^\ast) d \ge 0, \forall \mathcal i \in \mathcal A(x^\ast), \nabla h(x^\ast)^\top d = 0 \end{aligned} By strong duality, the optimal value is 00.

Define, T(x):={d:gi(x)d0,iA(x),h(x)d=0} T_{\ell}(x^\ast) := {d: \nabla g_i(x^\ast) d \le 0, \forall \mathcal i \in \mathcal A(x^\ast), \nabla h(x^\ast)^\top d = 0 } So the KKT-conditions is equivalent to f(x)d0,dT(x) \nabla f(x^\ast) d \ge 0, \forall d \in T_{\ell}(x^\ast) The set T(x)T_{\ell}(x^\ast) is called linearized tangent set.


  • Lecture Notes 11-14 for MAT3007 (Optimization) by Dr. Andre Milzarek. Institute for Data and Decision Analytics, The Chinese University of Hong Kong, Shenzhen.
  • Introduction to Nonlinear Optimization. Theory, Algorithms, and Applications with MATLAB, MOS-SIAM Series Optimization (2014), by Beck.
Raw Content

Submit New Comment

question_answer Comments

No Comment Yet~