15-850: CMU Advanced Algorithms

Lec #24: SDPs I

Posted on March 30, 2017 by cmurandomized

Nicholas asked how, in the SoS example where we wanted to write
the polynomial

$\displaystyle P(x) = 2x_1^4 + 2x_1^3x_2 - x_1^2x_2^2 + 5x_2^4$

as a sum of square polynomials, we could quickly restrict ourselves to
${(x_1^2, x_2^2, x_1x_2)}$ .

Here’s one way. Of course, in general we have to consider the vector ${\bar{X} = (1, x_1, x_2, x_1^2, x_2^2, x_1x_2)^\intercal}$ , since the squares of these monomials span ${P(x)}$ . This means we must consider a much larger (i.e., ${6 \times 6}$ ) matrix ${Q}$ , and try to write ${P(x) = \bar{X}^\intercal Q \bar{X}}$ with ${Q \succeq 0}$ . But observe two simple facts:
- The first three diagonal terms of ${Q}$ are zero, by inspection.
- The bottom ${3 \times 3}$ principal minor is just our old matrix.
Now we claim that if some diagonal term in some PSD matrix is zero, then all entries in that row and column are zero too. Indeed, if ${Q_{ii} = 0}$ , then for any ${j}$ , the non-negativity of the determinants of the principal minors (see below) says: ${0 = Q_{ii}Q_{jj} \geq Q_{ij}^2}$ , so ${Q_{ij} = 0}$ . Hence, the problem just collapses to finding ${Q \succeq 0}$ such that

$\displaystyle P(x) = \begin{pmatrix} x_1^2 \\ x_2^2 \\ x_1x_2 \end{pmatrix}^\intercal \begin{pmatrix} & & \\ & Q & \\ & & \end{pmatrix} \begin{pmatrix} x_1^2 \\ x_2^2 \\ x_1x_2 \end{pmatrix} \ \ \ \ \ (1)$

which we saw a solution for. Indeed, if we set

$\displaystyle Q = \begin{pmatrix} 2 & -3 & 1\\ -2 & 5 & 0\\ 1 & 0 & 5 \end{pmatrix} \ \ \ \ \ (2)$

then the eigenvalues are ${\lambda_1 = 0, \lambda_2 = 5, \lambda_3 = 7}$ , with eigenvectors

$\displaystyle v_1 = \frac{1}{\sqrt{35}}(-5, -3, 1)^\intercal, v_2 = \frac{1}{\sqrt{10}}(0,1,3)^\intercal, v_3 = \frac{1}{\sqrt{14}}(2, -3, 1)^\intercal.$

Since ${Q = \sum_i \lambda_i v_i v_i^\intercal}$ , you get

$\displaystyle \begin{array}{rl} P(x) &= X^\intercal Q X = \sum_{i = 1}^3 \lambda_i X^\intercal (v_i v_i^\intercal) X = \sum_{i = 1}^3 \lambda_i \langle X, v_i \rangle^2\\ &= \frac12 (x_2^2 + 3x_1x_2)^2 + \frac12 (2x_1^2 - 3x_2^2 + x_1x_2)^2. \end{array}$

Which is the SoS representation we wanted.
Thanks, C.J.: I’d clean forgotten about the all-important equivalent definition of PSD-ness: ${A}$ is PSD if and only if all principal minors have non-negative determinants. Recall that a principal minor is obtained by restricting to some set ${S \subseteq [n]}$ of rows and columns. One direction of the proof is easy (and thankfully it’s the one we need): first, a matrix is PSD then all its principal minors are PSD. (Use the ${x^\intercal A x \geq 0}$ definition.) But the determinant of a principal minor is just the product of its eigenvalues, which is non-negative for PSD matrices. The other direction is a little more involved, I don’t know a one-line proof, please consult your favorite algebra text. (Or if you know one, catch me and tell me about it.)
Finally, the point that Goran asked a clarification for: if ${P(x) = \sum_i h_i(x)^2}$ has max-degree ${2d}$ , then the ${h_i(x)}$ need only be polynomials of degree at most ${d}$ . Suppose not. Let ${D > d}$ be the highest degree monomial in some ${h_i(x)}$ , and let ${\widehat{h}_i(x)}$ contain all the monomials in ${h_i(x)}$ of degree ${D}$ . Now ${H(x) = \sum_i \widehat{h}_i(x)^2}$ is a sum of square polynomials, and all its terms (of degree ${2D > 2d}$ ) also appear in ${\sum_i h_i(x)^2}$ .
Goran had asked: Can ${H(x)}$ be identically zero, with all its terms cancelling out? No, and here’s one easy way to see it: consider some ${x^*}$ for which some ${\widehat{h}_i(x^*) \neq 0}$ . Then ${H(x) \geq \widehat{h}_i(x^*)^2 > 0}$ . (I am sure there are another easy arguments? Please tell me about them.)

Posted in Uncategorized | Leave a comment

Lec #19: JL etc.

Posted on March 9, 2017 by cmurandomized

So two things that we wanted to check during lecture today, and some notes:

If ${X_i}$ is sub-Gaussian with parameter ${c_i}$ , and they are independent, then ${\sum a_i X_i}$ is indeed sub-Gaussian with parameter ${\sqrt{ \sum_i a_i^2 c_i^2 }}$ , as claimed. Basically, just use the MGF definition to show this. And note the analogy to Gaussians, where ${c_i}$ should be the standard deviation, and not the variance.
I got the sub-exponential equivalence slightly wrong. The MGF-based definition: there exist parameters ${(\nu, b)}$ such that
$\displaystyle \mathbb{E}[ e^{t(W - \mathbb{E}[W])} ] \leq exp( \nu^2 t^2/2) \qquad \forall |t| \le 1/b.$

This is equivalent to there existing some other constants ${c_1, c_2}$ such that

$\displaystyle \Pr[ |W - \mathbb{E}[W]| \geq \lambda ] \leq c_1 e^{- c_2 t}.$

See Theorem~2.2 here for a proof. (Thanks for pointing out that mistake, Tom!)
For compressive sensing, there are a million resources online. E.g., here’s Terry Tao on the topic.
I should have emphasized: the blood testing example we talked about was the simple case where you were guaranteed that the ${x}$ vector was ${1}$ -sparse. I.e., there is exactly one person with the disease among a population of ${D}$ people. Then the binary search strategy we did used ${\log_2 D}$ linear tests, which is optimal. (To clarify, for each ${i}$ , the ${i^{th}}$ test combines together the samples for all people whose index has a ${1}$ in the ${i^{th}}$ place and tests this combination. Now construct a ${\log_2 D}$ -bit vector with a ${1}$ in the ${i^{th}}$ position precisely if the ${i^{th}}$ test came up positive: this is the index of the person with the infection.
Note this is a non-adaptive strategy: we can write down all the tests up-front, and can read off the answer from their results. This is opposed to an adaptive strategy that would look at the answers to previous tests to decide which sets to test next.

What about the case where there are exactly two people infected? The same strategy does not work. In fact, since there are ${\binom{n}{2}}$ answers and each answer gives us ${1}$ bit each time, we must perform at least ${\log_2 \binom{n}{2}}$ tests. Any thoughts on how to do this? An adaptive strategy that uses ${2 \log_2 n}$ tests is easy: can you get a non-adaptive strategy? Using randomness?

Posted in Uncategorized | Leave a comment

Matroids

Posted on March 8, 2017 by cmurandomized

Some of you asked for intuition about matroids: they seem so abstract, how to think about them? Whenever I have to prove a fact about matroids, I think about how I’d prove the fact for graphic matroids. And often (though alas, not always) that proof translates seamlessly to general matroids.

Recall that a graphic matroid is constructed as follows: take a graph. Its edges are the elements of the matroid. A set of elements (edges) is independent if they do not induce a cycle (in the graph-theory sense), i.e., if they form a forest. Now you can check that this definition satisfies the conditions for being a matroid. The bases of the matroid are spanning trees of G.

So on HW3, the first part asked: suppose I have two spanning trees on a graph G, a red one and a blue one. Show that there exists a bijection F between the red and blue edges so that replacing any red edge e by its matched blue edge F(e) gives a spanning tree. The proof now seems well within reach. And the arguments to prove it (adding an element/edge forms a cycle, etc) all generalize to matroids.

Why are we studying matroids? They form a convenient abstraction for many combinatorial structures you often encounter. E.g., spanning trees, linearly-independent sets of vectors, matchable vertices in a bipartite graph, sets of vertices that can routed to a sink in some vertex-disjoint way, all are matroids. E.g., perfect matchings in bipartite graphs can be phrased as finding a common base of two matroids defined on the same ground set. A famous result of Edmonds shows that this matroid intersection problem can be solved for any two matroids. Also, since matroids characterize the greedy algorithm (in a sense you saw in HW1), the fact that the greedy algorithm works for some problem often indicates there may be some deeper structure lurking within, that we can exploit.

I was hoping to cover matroids in a lecture later in the course (let’s see how things go), but here is a course by Jan Vondrak that covers many polyhedral aspects of matroids.

Posted in Uncategorized | Leave a comment

A comment about HW3 #5

Posted on March 4, 2017 by cmurandomized

Pranav and Sidhanth asked me for the motivation behind that problem: why did we just not find a perfect matching in the graph (say by the blossom algorithm), put weight 0 on the matching edges, and 1s elsewhere?

The reason is that we want to find perfect matchings in parallel. If we had a single perfect matching in the graph, we could check (in parallel) for each edge if it was in this perfect matching. (How? Use Lovasz’s algorithm via Tutte’s theorem on G and then on G-e. This requires computing determinants, which is doable in parallel.) And then (in parallel) output all edges that belong to the matching.

But if there are many perfect matchings, maybe all edges may find that G-e still has a perfect matching. Which of these should we output? We don’t want to do things sequentially, so we seem stuck.

This is why Ketan Mulmuley, Umesh Vazirani, and Vijay Vazirani came up with the isolation lemma. Suppose we choose random weights for all edges. Whp there is a unique min-weight matching. So now each edge e’s subproblem is: does e belong to the unique min-weight matching? This can also be done in parallel, using a slight extension of the Lovasz idea. See the MVV paper for details.

The next question arises: do we really need randomization? And this has been open for some time, and people have tried to reduce the number of random bits you need. The naive approach uses O(log m) bits per edge, so O(m log m) random bits over all. In this question you proved a bound of O(log^2 m) bits, which is much better. It still remains an open problem how to remove the randomness altogether.

Posted in Uncategorized | Leave a comment

Lec #17: Ellipsoid and Interior-Point

Posted on March 4, 2017 by cmurandomized

Hi all: since we covered most of the short-step method today, it may be easiest if you looked over the proof yourself. We can discuss it in office hours on Tuesday, if you’d like. I’d like to start off talking about concentration bounds on Monday.

Some other notes about Ellipsoid and interior-point methods.

We talked about how a separation oracle for a convex ${K}$ gives (via Ellipsoid) an algorithm for optimization over ${K}$ . In fact, the two problems of separation and optimization are equivalent! So, given an algorithm for max-weight perfect matching in general graphs, you also get an algorithm for finding minimum odd-cuts. You will give a direct algorithm for the problem in the upcoming HW.
In fact, something even more surprising is known. Suppose we are given a point ${x_0}$ in the body ${K}$ , and values ${r, R}$ such that ${B(x_0,r) \subseteq K \subseteq B(x_0, R)}$ . And we are also given a membership oracle, which given a point ${x}$ , just outputs whether or not ${x \in K}$ . Then we can still optimize over ${K}$ . The idea is to use the membership oracle to sample points from the polytope and generate a (weak) separation oracle.
(It is important that you be given a point inside ${K}$ , else you have no chance, even if you know a bounding ball ${B(0,R)}$ , and that ${K}$ has some non-trivial ball within it. There is too much volume in ${B(0,R)}$ for you to go searching with just a membership oracle: you’ll pretty much need ${(R/r)^n}$ time to find a point inside ${K}$ .)

Of course, this three-way equivalence between membership, separation, and optimization requires care to make precise and to prove. See the Grotschel Lovasz Schrijver book for all the details.
There are many different ways to implement interior-point methods. We saw a primal-dual analysis that was pretty much from first principles (with a little bit swept under the rug, and even those details can be found in the Matousek-Gaertner book). Matousek and Gaertner also give a different way to find a starting point ${x_0}$ , different from the one we outlined in lecture.
Sadly, we did not talk about Newton’s method, or self-concordance, or local norms and Dikin ellipsoids, which form the basis for the “modern” treatment of interior-point methods. If your interest is piqued, you should check out a dedicated optimization course (Tepper and MLD both offer one), or have a look at one of the books listed on the course webpage.

Also, another interior-point algorithm that is very approachable (and has a short self-contained exposition) is Renegar’s algorithm: here are notes by Ryan from our LP/SDP course.

Posted in Uncategorized | Leave a comment

Lec #16: Ellipsoid and Center-of-Gravity

Posted on March 2, 2017 by cmurandomized

We will talk a bit more about the Ellipsoid algorithm (and separation vs optimization) on Friday, maybe start talking about the Newton method, and then move on to more details on interior point techniques on Monday.

A couple things about Ellipsoid:

One point to emphasize: the ${t^{th}}$ step of Ellipsoid cuts ${\mathcal{E}_t}$ into two, and builds ${\mathcal{E}_{t+1}}$ around the relavant half. Suppose ${\mathcal{E}_t = B}$ , where ${B}$ is the unit ball. Then ${\mathcal{E}_{t+1} = \mathcal{E}'}$ is some ellipsoid with volume at most ${e^{1/(2(n+1))}}$ times the volume of the unit ball. Now since any ellipsoid is a linear transformation of the ball, if ${\mathcal{E}_t = L_t B}$ , where ${L_t}$ is the associated linear transformation, then ${\mathcal{E}_{t+1} = L_t \mathcal{E}'}$ . But for any body ${K}$ , the volume of ${L_t K}$ is ${|det(L_t)| \cdot volume(K)}$ . So
$\displaystyle \frac{volume(\mathcal{E}_{t+1})}{volume(\mathcal{E}_{t})} = \frac{|det(L_t)| \cdot volume(\mathcal{E}')}{|det(L_t)| \cdot volume(B)} \leq e^{1/(2(n+1))}.$

In fact, you can imagine that each step we are just transforming the current elipsoid back to a ball, making a cut, and then transforming things back. One problem, of course, is that at each step we make the transformation more complicated. Indeed, in the notation above, if ${E' = L' B}$ , then ${L_{t+1} := L_t L'}$ . So the numbers involved can get larger at each step: we may need to round the numbers to control how big they get. These and other numerical issues are at the heart of Khachiyan’s proof that the algorithm runs in polynomial time.
My apologies: the alternate version of Ellipsoid I thought used boxes, in fact, uses simplices. The analysis that it runs in polynomial time is due to Boris Yamnitsky and (Leonid) Levin. The main idea is that half of some simplex ${S}$ can be contained in another simplex of volume at most ${e^{1/(2(n+1)^2)} \approx 1 - \Theta(1/n^2)}$ times the volume of ${S}$ . (A proof appears in Vasek Chvatal’s book.) Note this factor is worse than the Ellipsoid factor of ${e^{1/(2(n+1))}}$ by another factor of ${n}$ , but the numerical calculations in the algorithm definition do not require the use of square roots. The original paper (also here) is not quite kosher, since it can lead to the size of the numbers blowing up: this report by Bartels gives an example, and also suggests rounding approaches to control the problem. Finally, some notes on the simplices algorithm by Yossi Azar (parts 1, 2) which I have not had a chance to go over in detail.

And a couple words about the center-of-gravity algorithm:

The center-of-gravity definition. It’s the natural extension of the discrete case. Indeed, if we have ${N}$ objects in ${{\mathbb R}^n}$ , the ${i^{th}}$ one having location ${x_i \in {\mathbb R}^n}$ and mass ${\mu_i \geq 0}$ , the center of gravity (or the center of mass, or centroid) is defined as

$\displaystyle c = \frac{\sum_{i \in [N]} x_i \mu_i }{\sum_{i \in [N]} \mu_i}.$

The continuous analog of this where we have a general measure ${\mu(x)}$ over ${{\mathbb R}^n}$ (basically replacing sums by integrals), is

$\displaystyle c = \frac{\int_{x \in R^n} x \; d\mu(x) }{\int_{x \in R^n} d\mu(x)}.$

The numerator is the total measure over ${{\mathbb R}^n}$ . (In class I was implcitly assuming the uniform measure over ${K \subseteq {\mathbb R}^n}$ , which is given by ${d\mu(x) = \mathbf{1}_{x \in K} dx}$ .

The ${1/e}$ in Grunbaum’s theorem (that each hyperplane through the centroid of a convex body contains at least ${1/e}$ fraction of the mass on either side) is best possible for convex bodies. And the proof is clever but not difficult See Grunbaum’s (very short) paper for examples and proof. Or these notes by Jon Kelner or Santosh Vempala.

What happens if we don’t have the uniform measure over a convex body, but a more general distribution? Then things change quite a bit. E.g., consider ${n+1}$ equal point masses at the vertices of an ${n}$ -dimensional simplex. No matter which point you choose, you can find a hyperplane through it that contains only a single point (which is ${1/n+1}$ of the mass) on one side. Grunbaum actually shows (in the same paper) that you can find a point that ensures at least ${1/(n+1)}$ fraction of the mass on either side.

Posted in Uncategorized | Leave a comment

Lec #12: Notes on solving LPs

Posted on February 20, 2017 by cmurandomized

Notes for today’s lecture:

Tightness of the Hedge guarantee. Consider the basic Hedge analysis, which says that for any ${\varepsilon \leq 1}$ , we have regret at most ${\frac{\ln N}{\varepsilon} + \varepsilon T}$ . Now if we were to set ${\varepsilon = \sqrt{\ln N}{T}}$ by balancing those terms, the regret bound would be ${2\sqrt{T \ln N}}$ . This is tight, upto the constant term.
Let’s see why in the case ${N = 2}$ . Suppose we again have two experts ${H}$ and ${T}$ , and are trying to predict a fair coin toss. I.e., every time the loss vector is either ${(1,-1)}$ or ${(-1, 1)}$ with equal probability. So our expected gain is at least ${0}$ . But after ${T}$ coin tosses, with constant probability we have ${\Omega(\sqrt{T})}$ more flips of one type than of another, and indeed, the expected gain of one of the static experts is ${\Omega(\sqrt{T})}$ . So our regret cannot be less than ${\Omega(\sqrt{T})}$ even for two experts. Similarly for ${N}$ experts one can show that ${\Omega(\sqrt{T \log N})}$ is necessary.
Larger Range of Loss Vectors. For the setting where loss/gain functions could be in ${[-\rho, \rho]^N}$ , we claimed an algorithm with average regret less than ${\epsilon}$ as long as ${T \geq \frac{4 \rho^2 \ln N}{\varepsilon}}$ . We left it as an exercise in HW3.
In fact, you can prove something slightly weaker for the asymmetric setting where losses are in ${[-\gamma, \rho]^N}$ , where ${1 \leq \gamma \leq \rho}$ . In handwritten notes on the webpage, I show how to use a guarantee for Hedge to get

$\displaystyle \frac{1}{T} \sum_t \langle p^t, \ell^t \rangle \leq (1 - \varepsilon) \frac{1}{T} \sum_t \langle e_i, \ell^t \rangle + \varepsilon \quad \forall i \in [N]$

as long as ${T = \Omega( \frac{\rho \gamma \ln N}{\varepsilon^2})}$ . The constants are worse, and there’s a ${(1-\varepsilon)}$ term hitting the “best expert” term, but the analysis is mechanical.

You can use this gurantee along with the shortest-path oracle to get an ${O(\frac{F^* \log m}{\varepsilon^{O(1)}})}$ -iteration algorithm for ${(1-\varepsilon)}$ -approximate maximum flow algorithm, since the gains will be in the range ${[-1, F^*]}$ . More details below.
The max-flow part was fast, here are some more details. We wrote the LP, and plugged it into the multiplicative weights framework. Since we had a constraint for each edge ${e}$ , the “average” constraint looked like:
$\displaystyle \sum_e p_e \sum_{P: e \in P} f_P \leq \sum_e p_e \cdot 1 = 1.$

Flipping the summations, we get

$\displaystyle \sum_{P} f_P \sum_{e \in P} p_e \leq 1.$

If we denote ${len(P) = \sum_{e \in P} p_e}$ the optimal solution is to send flow along a shortest path, where the edge lengths are ${p_e}$ . We can find this using Dijkstra even though we cannot write the massive LP down. Since the “easy” constraints were ${K = \{ f_P \geq 0, \sum_P f_P = F^*\}}$ , we send ${F^*}$ flow along this shortest path. Now we update the probabilities (edge lengths), find another shortest path, push flow, and repeat. At each step the gains will be in the range ${[-1, F^*]}$ .

So we can use the asymmetric losses analysis above. After ${T = \Theta(\frac{F^* \log m}{\varepsilon^{2}})}$ -iterations, taking the “average” flow ${\widehat{x}_e := \frac{1}{T} \sum_{t = 1}^T \sum_{P: e \in P} f^t_P}$ , we have that for each edge ${e}$ ,

$\displaystyle (1-\varepsilon)\cdot\left(\widehat{x}_e - 1\right) \leq \varepsilon \quad \Rightarrow \quad \widehat{x}_e \leq 1 + \frac{\varepsilon}{1-\varepsilon} = \frac{1}{1-\varepsilon}.$

(How? chase through the LP-solving analysis we did in lecture, but use the above asymmetric analysis, instead of the standard symmetric one we used.)

Finally, the flow is not feasible, since it may violate edge capacities. So scale down. I.e., define the flow ${(1-\varepsilon)\widehat{x}}$ : it has value ${(1-\varepsilon)F^*}$ and satisfies all the edge constraints. Viola.
Next lecture we will do the improvement using electrical flows.

Posted in Uncategorized | Leave a comment

Lec #9:Notes

Posted on February 15, 2017 by cmurandomized

A few loose ends from yesterday’s lecture:

The formula for computing inverses of a submatrix from the inverse of the whole thing is given by this:

(from this survey by Pawel Parys). A good explanation of the Mucha and Sankowski ${O(n^\omega)}$ -time algorithm is in Marcin Mucha’s thesis.
I looked at the Tutte paper again. The reason he was introducing the Tutte matrix (not his terminology) was to prove Tutte’s theorem (again, not his terminology). Namely, that a graph has a perfect matching if and only if, for every subset of vertices ${U \subseteq V}$ , the number of odd components in ${G \setminus V}$ is at most ${|U|}$ . (Note that this is a special case of the Tutte-Berge formula.) Pretty cool that this 5-page paper introduced two important concepts.
The proof of the isolation lemma went a little fast towards the end, so let me say it again. We wanted to bound the probability that an edge is confused, i.e., that the min-weight perfect matching containing ${e}$ has the same as that of the min-weight perfect matching not containing ${e}$ , by ${1/2m}$ .So let’s give weights to all edges apart from ${e}$ . We can now read off the weight of the min-weight perfect matching not containing ${e}$ , say this is some constant ${W_0}$ . Moreover, if the weight of ${e}$ is eventually set to ${w_e}$ , then since the weights of all other elements is known, the min-weight perfect matching will have weight ${W_1 + w_e}$ , where ${W_1}$ is another constant depending on the weights of other edges. So, for ${e}$ to be confused, we must have ${w_e = W_0 - W_1}$ . And the chance this happens is at most ${1/2m}$ , since we are choosing the weight for each edge independently from ${\{1,2,\ldots, 2m\}}$ . Done.We will see how to use the isolation lemma in tomorrow’s lecture on smoothed analysis. (Slight change in lecture order, by the way.)
By the way, the really useful part of the isolation lemma is that the weights are small, linear in the number of edges even though the number of matchings is exponential. E.g., if you assigned weight of edge ${i}$ to be ${2^i}$ , you would again get that the min-weight matching would be unique. But the weights would be exponentially large, which is undesirable in many contexts.

Posted in Uncategorized | Leave a comment

Lec #7: Notes

Posted on February 8, 2017 by cmurandomized

Today we saw several definitions and facts about polytopes, including proofs of the integrality of the perfect matching polytope for bipartite graphs. Some comments.

We saw the definition of the perfect matching polytope ${K_{PM-gen}}$ for non-bipartite graphs (with exponentially many constraints), but did not prove it correct. Do give it a try yourself (or read over a proof via the basic-feasible solution approach which appears in my hand-written notes, or in these notes).
We saw a definition of the ${r}$ -arborescence polytope ${K_{arb}}$ (again with exponentially many constraints). As an exercise, show that ${K_{arb}}$ is also an integral polytope via the vertex approach.
Here’s how to go about it. In lecture #2, we saw an algorithm that finds the min-cost arborescence for non-negative edge costs. First, change it slightly to find min-cost arborescence ${T}$ for general edge costs ${c_e}$ . Naturally, ${T}$ gives you an (integer) solution to the LP ${\min \{ c^\intercal x \mid x \in K_{arb}\}}$ . No reason to believe it is optimal for the LP, since this is the best integer solution and there may exist cheaper fractional solutions.

Next, take the dual of this LP. If you show a set of values for the dual such that the dual value equals the primal value, then the integer solution ${T}$ you found is also an optimal solution to the LP. (How would you do it? Think of the prices we were defining at the end of Lecture 2.) Now the optimal solution being an integer solution is true for every set of edge costs. So the vertices of the polytope ${K_{arb}}$ must be all integral.

If this is a little vague, no worries: we will see the corresponding proofs for bipartite perfect matchings in the next lecture.
Both the above LPs have exponential size. But ${K_{arb}}$ has an “equivalent” poly-sized representation using some extra variables. Here’s a sketch. Indeed, the bad constraints are the ones saying that for every set ${S}$ not containing the root, ${\sum_{e \in \partial^+ S} x_e \geq 1}$ . This is the same as saying that the min-cut between any vertex and the root is at least ${1}$ , when viewing ${x_e}$ as arc capacities. By maxflow-mincut, the maxflow from each vertex to the root (using arc-capacities ${x_e}$ ) is at least ${1}$ . So replace the exponentially many constraints by other constraints that enforce this max-flow condition. This may requires polynomially many extra variables. But “projected down” onto the variables ${\{ x_e \}_{e \in E}}$ the polytope is the same. This is another instance of the “projection example” I was attempting to draw in class. These things are called “extended formulations”.
Does ${K_{PM-gen}}$ have a similarly small extended formulation? Recently Thomas Rothvoss showed that there are no tricks like the ones above, settling a decades-old conjecture of Yannakakis. In particular, every polytope that projects down to ${K_{PM-gen}}$ (for the complete graph ${K_n}$ ) must have ${2^{\Omega(n)}}$ constraints. See his slides/video for more details. As we remarked, this is an unconditional lower bound.
For an extreme case when extended formulations help, Goemans shows a polytope with ${2^n - 2}$ facets and ${n!}$ extreme points that arises from projecting down a polytope in ${O(n \log n)}$ dimensions using ${O(n \log n)}$ facets.
Finally, the use of total unimodularity for showing integrality of polytopes is indeed due to Alan Hoffman and Joe Kruskal. Here is their paper titled Integral Boundary Points of Convex Polyhedra, along with historical comments by the authors. Dabeen reminded me at the end of lecture that Edmonds and Giles actually developed a closely related concept, the theory of total dual integrality.

Posted in Uncategorized | Leave a comment

Lec #4: Notes

Posted on February 1, 2017 by cmurandomized

Some more notes about today’s lecture:

We know that ${APSP(n) \leq O(\log n) \cdot MSP(n)}$ , and in a forthcoming homework we will remove this log factor. The reduction in the other direction is not that mysterious: given two matrices ${A}$ , ${B}$ , can you write down an ${O(n)}$ vertex graph ${G}$ such that the ${APSP}$ in this graph gives you the MSP? Please post answers in the comments below.
If you’ve not seen Strassen’s algorithm before, it is an algorithm for computing ${n \times n}$ matrix multiplication in time ${n^{\log_2 7} \approx n^{2.81}}$ . It’s quite simple to state, and one can think of it as a 2-dimensional version of Karatsuba’s algorithm for multiplying two numbers. Mike Paterson has a very beautiful geometric interpretation of the sub-problems Strassen comes up with, and how they relate to Karatsuba.
The time for matrix multiplication was later improved by the Coppersmith-Winograd algorithm to get ${\omega = 2.376}$ . Small improvements in the lower-order digits of ${\omega}$ were given by Andrew Strothers and (our own) Virginia Vassilevska-Williams.

To answer Goran’s question, I looked over the original CW paper: it is presented as being similar to Strassen’s algorithm in that it breaks the matrices into smaller blocks and recurses on them. But the recursion is quite mysterious, at least to me. Recently, Cohn, Kleinberg, Szegedy, and Umans gave some group-theoretic approaches that use some of the CW ideas, but other seemingly orthogonal ideas, to match the CW-bound. They also give conjectures that would lead to ${\omega = 2}$ .
I was mistaken in claiming that for general directed graphs, the ${(1+\epsilon)}$ -approximate APSP is as hard as computing the exact APSP even for small values of ${\epsilon > 0}$ . One advantage is that one can round all edge weights to powers of ${(1+\epsilon)}$ , and that gives a lot of room to play with. There are a lot of positive results in this setting, e.g., this paper of Uri Zwick, and a talk by him.