Euiwoong asked about what happens with constrained linear function optimization (which are ${0}$-smooth), since this could solve LPs. So a minor issue is that the proofs require ${\beta > 0}$, since they first set ${\eta = O(1/\beta)}$, which will be disastrous in this case.

A potentially bigger question, of course, is this: how would we do the projection operation? Given ${y_{t+1} \not\in K}$, what is the closest point ${x_{t+1} \in K}$ to ${y_{t+1}}$. For a polytope ${K}$ given as ${\{x \mid Ax \leq b\}}$, this might not be that easy a problem to solve — at least I don’t see how to do this quicker than linear programming. Any ideas?

In fact, the projection operation is one reason why people sometimes use something called conditional gradient descent (a.k.a. the Frank-Wolfe algorithm). The update rule now is: take ${x_t}$ and find

$\displaystyle y_{t+1} \gets \min \{ \langle \nabla f(x_t), x \rangle \mid x\in K \}.$

I.e., use an LP solver to find the best point in ${K}$ in the direction of the negative gradient. And now walk a little in the direction of ${y_{t+1}}$:

$\displaystyle x_{t+1} \gets (1 - \eta) x_t + \eta y_{t+1}$

As the Bubeck book says, this approach leverages the fact that linear programming is in some cases a simpler problem than projection. Also, one can prove guarantees for the Frank-Wolfe process for ${\beta}$-smooth functions that are similar to those for projected gradient descent.

I will write more with (pointers to) the bad examples for the basic gradient descent algorithm, which should give us more intuition about what is happening here.