## Lec #24: SDPs I

• Nicholas asked how, in the SoS example where we wanted to write
the polynomial

$\displaystyle P(x) = 2x_1^4 + 2x_1^3x_2 - x_1^2x_2^2 + 5x_2^4$

as a sum of square polynomials, we could quickly restrict ourselves to
${(x_1^2, x_2^2, x_1x_2)}$.

Here’s one way. Of course, in general we have to consider the vector ${\bar{X} = (1, x_1, x_2, x_1^2, x_2^2, x_1x_2)^\intercal}$, since the squares of these monomials span ${P(x)}$. This means we must consider a much larger (i.e., ${6 \times 6}$) matrix ${Q}$, and try to write ${P(x) = \bar{X}^\intercal Q \bar{X}}$ with ${Q \succeq 0}$. But observe two simple facts:

• The first three diagonal terms of ${Q}$ are zero, by inspection.
• The bottom ${3 \times 3}$ principal minor is just our old matrix.

Now we claim that if some diagonal term in some PSD matrix is zero, then all entries in that row and column are zero too. Indeed, if ${Q_{ii} = 0}$, then for any ${j}$, the non-negativity of the determinants of the principal minors (see below) says: ${0 = Q_{ii}Q_{jj} \geq Q_{ij}^2}$, so ${Q_{ij} = 0}$. Hence, the problem just collapses to finding ${Q \succeq 0}$ such that

$\displaystyle P(x) = \begin{pmatrix} x_1^2 \\ x_2^2 \\ x_1x_2 \end{pmatrix}^\intercal \begin{pmatrix} & & \\ & Q & \\ & & \end{pmatrix} \begin{pmatrix} x_1^2 \\ x_2^2 \\ x_1x_2 \end{pmatrix} \ \ \ \ \ (1)$

which we saw a solution for. Indeed, if we set

$\displaystyle Q = \begin{pmatrix} 2 & -3 & 1\\ -2 & 5 & 0\\ 1 & 0 & 5 \end{pmatrix} \ \ \ \ \ (2)$

then the eigenvalues are ${\lambda_1 = 0, \lambda_2 = 5, \lambda_3 = 7}$, with eigenvectors

$\displaystyle v_1 = \frac{1}{\sqrt{35}}(-5, -3, 1)^\intercal, v_2 = \frac{1}{\sqrt{10}}(0,1,3)^\intercal, v_3 = \frac{1}{\sqrt{14}}(2, -3, 1)^\intercal.$

Since ${Q = \sum_i \lambda_i v_i v_i^\intercal}$, you get

$\displaystyle \begin{array}{rl} P(x) &= X^\intercal Q X = \sum_{i = 1}^3 \lambda_i X^\intercal (v_i v_i^\intercal) X = \sum_{i = 1}^3 \lambda_i \langle X, v_i \rangle^2\\ &= \frac12 (x_2^2 + 3x_1x_2)^2 + \frac12 (2x_1^2 - 3x_2^2 + x_1x_2)^2. \end{array}$

Which is the SoS representation we wanted.

• Thanks, C.J.: I’d clean forgotten about the all-important equivalent definition of PSD-ness: ${A}$ is PSD if and only if all principal minors have non-negative determinants. Recall that a principal minor is obtained by restricting to some set ${S \subseteq [n]}$ of rows and columns. One direction of the proof is easy (and thankfully it’s the one we need): first, a matrix is PSD then all its principal minors are PSD. (Use the ${x^\intercal A x \geq 0}$ definition.) But the determinant of a principal minor is just the product of its eigenvalues, which is non-negative for PSD matrices. The other direction is a little more involved, I don’t know a one-line proof, please consult your favorite algebra text. (Or if you know one, catch me and tell me about it.)

• Finally, the point that Goran asked a clarification for: if ${P(x) = \sum_i h_i(x)^2}$ has max-degree ${2d}$, then the ${h_i(x)}$ need only be polynomials of degree at most ${d}$. Suppose not. Let ${D > d}$ be the highest degree monomial in some ${h_i(x)}$, and let ${\widehat{h}_i(x)}$ contain all the monomials in ${h_i(x)}$ of degree ${D}$. Now ${H(x) = \sum_i \widehat{h}_i(x)^2}$ is a sum of square polynomials, and all its terms (of degree ${2D > 2d}$) also appear in ${\sum_i h_i(x)^2}$.

Goran had asked: Can ${H(x)}$ be identically zero, with all its terms cancelling out? No, and here’s one easy way to see it: consider some ${x^*}$ for which some ${\widehat{h}_i(x^*) \neq 0}$. Then ${H(x) \geq \widehat{h}_i(x^*)^2 > 0}$. (I am sure there are another easy arguments? Please tell me about them.)

## Lec #19: JL etc.

So two things that we wanted to check during lecture today, and some notes:

• If ${X_i}$ is sub-Gaussian with parameter ${c_i}$, and they are independent, then ${\sum a_i X_i}$ is indeed sub-Gaussian with parameter ${\sqrt{ \sum_i a_i^2 c_i^2 }}$, as claimed. Basically, just use the MGF definition to show this. And note the analogy to Gaussians, where ${c_i}$ should be the standard deviation, and not the variance.
• I got the sub-exponential equivalence slightly wrong. The MGF-based definition: there exist parameters ${(\nu, b)}$ such that

$\displaystyle \mathbb{E}[ e^{t(W - \mathbb{E}[W])} ] \leq exp( \nu^2 t^2/2) \qquad \forall |t| \le 1/b.$

This is equivalent to there existing some other constants ${c_1, c_2}$ such that

$\displaystyle \Pr[ |W - \mathbb{E}[W]| \geq \lambda ] \leq c_1 e^{- c_2 t}.$

See Theorem~2.2 here for a proof. (Thanks for pointing out that mistake, Tom!)

• For compressive sensing, there are a million resources online. E.g., here’s Terry Tao on the topic.
• I should have emphasized: the blood testing example we talked about was the simple case where you were guaranteed that the ${x}$ vector was ${1}$-sparse. I.e., there is exactly one person with the disease among a population of ${D}$ people. Then the binary search strategy we did used ${\log_2 D}$ linear tests, which is optimal. (To clarify, for each ${i}$, the ${i^{th}}$ test combines together the samples for all people whose index has a ${1}$ in the ${i^{th}}$ place and tests this combination. Now construct a ${\log_2 D}$-bit vector with a ${1}$ in the ${i^{th}}$ position precisely if the ${i^{th}}$ test came up positive: this is the index of the person with the infection.

Note this is a non-adaptive strategy: we can write down all the tests up-front, and can read off the answer from their results. This is opposed to an adaptive strategy that would look at the answers to previous tests to decide which sets to test next.

What about the case where there are exactly two people infected? The same strategy does not work. In fact, since there are ${\binom{n}{2}}$ answers and each answer gives us ${1}$ bit each time, we must perform at least ${\log_2 \binom{n}{2}}$ tests. Any thoughts on how to do this? An adaptive strategy that uses ${2 \log_2 n}$ tests is easy: can you get a non-adaptive strategy? Using randomness?

## Matroids

Some of you asked for intuition about matroids: they seem so abstract, how to think about them? Whenever I have to prove a fact about matroids, I think about how I’d prove the fact for graphic matroids. And often (though alas, not always) that proof translates seamlessly to general matroids.

Recall that a graphic matroid is constructed as follows: take a graph. Its edges are the elements of the matroid. A set of elements (edges) is independent if they do not induce a cycle (in the graph-theory sense), i.e., if they form a forest. Now you can check that this definition satisfies the conditions for being a matroid. The bases of the matroid are spanning trees of G.

So on HW3, the first part asked: suppose I have two spanning trees on a graph G, a red one and a blue one. Show that there exists a bijection F between the red and blue edges so that replacing any red edge e by its matched blue edge F(e) gives a spanning tree. The proof now seems well within reach. And the arguments to prove it (adding an element/edge forms a cycle, etc) all generalize to matroids.

Why are we studying matroids? They form a convenient abstraction for many combinatorial structures you often encounter. E.g., spanning trees, linearly-independent sets of vectors, matchable vertices in a bipartite graph, sets of vertices that can routed to a sink in some vertex-disjoint way, all are matroids. E.g., perfect matchings in bipartite graphs can be phrased as finding a common base of two matroids defined on the same ground set. A famous result of Edmonds shows that this matroid intersection problem can be solved for any two matroids. Also, since matroids characterize the greedy algorithm (in a sense you saw in HW1), the fact that the greedy algorithm works for some problem often indicates there may be some deeper structure lurking within, that we can exploit.

I was hoping to cover matroids in a lecture later in the course (let’s see how things go), but here is a course by Jan Vondrak that covers many polyhedral aspects of matroids.

## A comment about HW3 #5

Pranav and Sidhanth asked me for the motivation behind that problem: why did we just not find a perfect matching in the graph (say by the blossom algorithm), put weight 0 on the matching edges, and 1s elsewhere?

The reason is that we want to find perfect matchings in parallel. If we had a single perfect matching in the graph, we could check (in parallel) for each edge if it was in this perfect matching. (How? Use Lovasz’s algorithm via Tutte’s theorem on G and then on G-e. This requires computing determinants, which is doable in parallel.) And then (in parallel) output all edges that belong to the matching.

But if there are many perfect matchings, maybe all edges may find that G-e still has a perfect matching. Which of these should we output? We don’t want to do things sequentially, so we seem stuck.

This is why Ketan Mulmuley, Umesh Vazirani, and Vijay Vazirani came up with the isolation lemma. Suppose we choose random weights for all edges. Whp there is a unique min-weight matching. So now each edge e’s subproblem is: does e belong to the unique min-weight matching? This can also be done in parallel, using a slight extension of the Lovasz idea. See the MVV paper for details.

The next question arises: do we really need randomization? And this has been open for some time, and people have tried to reduce the number of random bits you need. The naive approach uses O(log m) bits per edge, so O(m log m) random bits over all. In this question you proved a bound of O(log^2 m) bits, which is much better. It still remains an open problem how to remove the randomness altogether.

## Lec #17: Ellipsoid and Interior-Point

Hi all: since we covered most of the short-step method today, it may be easiest if you looked over the proof yourself. We can discuss it in office hours on Tuesday, if you’d like. I’d like to start off talking about concentration bounds on Monday.

Some other notes about Ellipsoid and interior-point methods.

• We talked about how a separation oracle for a convex ${K}$ gives (via Ellipsoid) an algorithm for optimization over ${K}$. In fact, the two problems of separation and optimization are equivalent! So, given an algorithm for max-weight perfect matching in general graphs, you also get an algorithm for finding minimum odd-cuts. You will give a direct algorithm for the problem in the upcoming HW.
• In fact, something even more surprising is known. Suppose we are given a point ${x_0}$ in the body ${K}$, and values ${r, R}$ such that ${B(x_0,r) \subseteq K \subseteq B(x_0, R)}$. And we are also given a membership oracle, which given a point ${x}$, just outputs whether or not ${x \in K}$. Then we can still optimize over ${K}$. The idea is to use the membership oracle to sample points from the polytope and generate a (weak) separation oracle.

(It is important that you be given a point inside ${K}$, else you have no chance, even if you know a bounding ball ${B(0,R)}$, and that ${K}$ has some non-trivial ball within it. There is too much volume in ${B(0,R)}$ for you to go searching with just a membership oracle: you’ll pretty much need ${(R/r)^n}$ time to find a point inside ${K}$.)

Of course, this three-way equivalence between membership, separation, and optimization requires care to make precise and to prove. See the Grotschel Lovasz Schrijver book for all the details.

• There are many different ways to implement interior-point methods. We saw a primal-dual analysis that was pretty much from first principles (with a little bit swept under the rug, and even those details can be found in the Matousek-Gaertner book). Matousek and Gaertner also give a different way to find a starting point ${x_0}$, different from the one we outlined in lecture.

Sadly, we did not talk about Newton’s method, or self-concordance, or local norms and Dikin ellipsoids, which form the basis for the “modern” treatment of interior-point methods. If your interest is piqued, you should check out a dedicated optimization course (Tepper and MLD both offer one), or have a look at one of the books listed on the course webpage.

Also, another interior-point algorithm that is very approachable (and has a short self-contained exposition) is Renegar’s algorithm: here are notes by Ryan from our LP/SDP course.

## Lec #16: Ellipsoid and Center-of-Gravity

We will talk a bit more about the Ellipsoid algorithm (and separation vs optimization) on Friday, maybe start talking about the Newton method, and then move on to more details on interior point techniques on Monday.

• One point to emphasize: the ${t^{th}}$ step of Ellipsoid cuts ${\mathcal{E}_t}$ into two, and builds ${\mathcal{E}_{t+1}}$ around the relavant half. Suppose ${\mathcal{E}_t = B}$, where ${B}$ is the unit ball. Then ${\mathcal{E}_{t+1} = \mathcal{E}'}$ is some ellipsoid with volume at most ${e^{1/(2(n+1))}}$ times the volume of the unit ball. Now since any ellipsoid is a linear transformation of the ball, if ${\mathcal{E}_t = L_t B}$, where ${L_t}$ is the associated linear transformation, then ${\mathcal{E}_{t+1} = L_t \mathcal{E}'}$. But for any body ${K}$, the volume of ${L_t K}$ is ${|det(L_t)| \cdot volume(K)}$. So

$\displaystyle \frac{volume(\mathcal{E}_{t+1})}{volume(\mathcal{E}_{t})} = \frac{|det(L_t)| \cdot volume(\mathcal{E}')}{|det(L_t)| \cdot volume(B)} \leq e^{1/(2(n+1))}.$

In fact, you can imagine that each step we are just transforming the current elipsoid back to a ball, making a cut, and then transforming things back. One problem, of course, is that at each step we make the transformation more complicated. Indeed, in the notation above, if ${E' = L' B}$, then ${L_{t+1} := L_t L'}$. So the numbers involved can get larger at each step: we may need to round the numbers to control how big they get. These and other numerical issues are at the heart of Khachiyan’s proof that the algorithm runs in polynomial time.

• My apologies: the alternate version of Ellipsoid I thought used boxes, in fact, uses simplices. The analysis that it runs in polynomial time is due to Boris Yamnitsky and (Leonid) Levin. The main idea is that half of some simplex ${S}$ can be contained in another simplex of volume at most ${e^{1/(2(n+1)^2)} \approx 1 - \Theta(1/n^2)}$ times the volume of ${S}$. (A proof appears in Vasek Chvatal’s book.) Note this factor is worse than the Ellipsoid factor of ${e^{1/(2(n+1))}}$ by another factor of ${n}$, but the numerical calculations in the algorithm definition do not require the use of square roots. The original paper (also here) is not quite kosher, since it can lead to the size of the numbers blowing up: this report by Bartels gives an example, and also suggests rounding approaches to control the problem. Finally, some notes on the simplices algorithm by Yossi Azar (parts 1, 2) which I have not had a chance to go over in detail.

And a couple words about the center-of-gravity algorithm:

• The center-of-gravity definition. It’s the natural extension of the discrete case. Indeed, if we have ${N}$ objects in ${{\mathbb R}^n}$, the ${i^{th}}$ one having location ${x_i \in {\mathbb R}^n}$ and mass ${\mu_i \geq 0}$, the center of gravity (or the center of mass, or centroid) is defined as

$\displaystyle c = \frac{\sum_{i \in [N]} x_i \mu_i }{\sum_{i \in [N]} \mu_i}.$

The continuous analog of this where we have a general measure ${\mu(x)}$ over ${{\mathbb R}^n}$ (basically replacing sums by integrals), is

$\displaystyle c = \frac{\int_{x \in R^n} x \; d\mu(x) }{\int_{x \in R^n} d\mu(x)}.$

The numerator is the total measure over ${{\mathbb R}^n}$. (In class I was implcitly assuming the uniform measure over ${K \subseteq {\mathbb R}^n}$, which is given by ${d\mu(x) = \mathbf{1}_{x \in K} dx}$.

The ${1/e}$ in Grunbaum’s theorem (that each hyperplane through the centroid of a convex body contains at least ${1/e}$ fraction of the mass on either side) is best possible for convex bodies. And the proof is clever but not difficult See Grunbaum’s (very short) paper for examples and proof. Or these notes by Jon Kelner or Santosh Vempala.

What happens if we don’t have the uniform measure over a convex body, but a more general distribution? Then things change quite a bit. E.g., consider ${n+1}$ equal point masses at the vertices of an ${n}$-dimensional simplex. No matter which point you choose, you can find a hyperplane through it that contains only a single point (which is ${1/n+1}$ of the mass) on one side. Grunbaum actually shows (in the same paper) that you can find a point that ensures at least ${1/(n+1)}$ fraction of the mass on either side.

## Lec #12: Notes on solving LPs

Notes for today’s lecture:

• Tightness of the Hedge guarantee. Consider the basic Hedge analysis, which says that for any ${\varepsilon \leq 1}$, we have regret at most ${\frac{\ln N}{\varepsilon} + \varepsilon T}$. Now if we were to set ${\varepsilon = \sqrt{\ln N}{T}}$ by balancing those terms, the regret bound would be ${2\sqrt{T \ln N}}$. This is tight, upto the constant term.

Let’s see why in the case ${N = 2}$. Suppose we again have two experts ${H}$ and ${T}$, and are trying to predict a fair coin toss. I.e., every time the loss vector is either ${(1,-1)}$ or ${(-1, 1)}$ with equal probability. So our expected gain is at least ${0}$. But after ${T}$ coin tosses, with constant probability we have ${\Omega(\sqrt{T})}$ more flips of one type than of another, and indeed, the expected gain of one of the static experts is ${\Omega(\sqrt{T})}$. So our regret cannot be less than ${\Omega(\sqrt{T})}$ even for two experts. Similarly for ${N}$ experts one can show that ${\Omega(\sqrt{T \log N})}$ is necessary.

• Larger Range of Loss Vectors. For the setting where loss/gain functions could be in ${[-\rho, \rho]^N}$, we claimed an algorithm with average regret less than ${\epsilon}$ as long as ${T \geq \frac{4 \rho^2 \ln N}{\varepsilon}}$. We left it as an exercise in HW3.

In fact, you can prove something slightly weaker for the asymmetric setting where losses are in ${[-\gamma, \rho]^N}$, where ${1 \leq \gamma \leq \rho}$. In handwritten notes on the webpage, I show how to use a guarantee for Hedge to get

$\displaystyle \frac{1}{T} \sum_t \langle p^t, \ell^t \rangle \leq (1 - \varepsilon) \frac{1}{T} \sum_t \langle e_i, \ell^t \rangle + \varepsilon \quad \forall i \in [N]$

as long as ${T = \Omega( \frac{\rho \gamma \ln N}{\varepsilon^2})}$. The constants are worse, and there’s a ${(1-\varepsilon)}$ term hitting the “best expert” term, but the analysis is mechanical.

You can use this gurantee along with the shortest-path oracle to get an ${O(\frac{F^* \log m}{\varepsilon^{O(1)}})}$-iteration algorithm for ${(1-\varepsilon)}$-approximate maximum flow algorithm, since the gains will be in the range ${[-1, F^*]}$. More details below.

• The max-flow part was fast, here are some more details. We wrote the LP, and plugged it into the multiplicative weights framework. Since we had a constraint for each edge ${e}$, the “average” constraint looked like:

$\displaystyle \sum_e p_e \sum_{P: e \in P} f_P \leq \sum_e p_e \cdot 1 = 1.$

Flipping the summations, we get

$\displaystyle \sum_{P} f_P \sum_{e \in P} p_e \leq 1.$

If we denote ${len(P) = \sum_{e \in P} p_e}$ the optimal solution is to send flow along a shortest path, where the edge lengths are ${p_e}$. We can find this using Dijkstra even though we cannot write the massive LP down. Since the “easy” constraints were ${K = \{ f_P \geq 0, \sum_P f_P = F^*\}}$, we send ${F^*}$ flow along this shortest path. Now we update the probabilities (edge lengths), find another shortest path, push flow, and repeat. At each step the gains will be in the range ${[-1, F^*]}$.

So we can use the asymmetric losses analysis above. After ${T = \Theta(\frac{F^* \log m}{\varepsilon^{2}})}$-iterations, taking the “average” flow ${\widehat{x}_e := \frac{1}{T} \sum_{t = 1}^T \sum_{P: e \in P} f^t_P}$, we have that for each edge ${e}$,

$\displaystyle (1-\varepsilon)\cdot\left(\widehat{x}_e - 1\right) \leq \varepsilon \quad \Rightarrow \quad \widehat{x}_e \leq 1 + \frac{\varepsilon}{1-\varepsilon} = \frac{1}{1-\varepsilon}.$

(How? chase through the LP-solving analysis we did in lecture, but use the above asymmetric analysis, instead of the standard symmetric one we used.)

Finally, the flow is not feasible, since it may violate edge capacities. So scale down. I.e., define the flow ${(1-\varepsilon)\widehat{x}}$: it has value ${(1-\varepsilon)F^*}$ and satisfies all the edge constraints. Viola.

• Next lecture we will do the improvement using electrical flows.