Esquire Theme by Matthew Buchanan
Social icons by Tim van Damme

hit counter
hit counter

06

Dec

What is a simplified explanation and proof of the Johnson-Lindenstrauss lemma?

Answer by Alex Clemmer:

Why are random projections probably nearly as effective as carefully designed ones?

This question you have asked in the “details” section is the key question of the JL lemma. And the unfortunate truth is that they aren’t. Or rather, they are, but only when your data are all over the place.

To see why, consider this toy example I drew. (Note that this example was picked because we can visualize 3 dimensions, and not because the math of the JL lemma is useful here! It is for intuition only.)
image

This brings us to our first point: in order to understand a “simplified” version of the JL lemma, we must understand:

What is the JL lemma really saying about projections?

Intuitively, the JL lemma says is this: if you pick a random subspace and project onto it, the scaled pairwise distances between points will probably be preserved.

This will be true regardless of the pointset you have. But you’ll notice that in the example on the right, some planes seem to be “better” than others. In the pointset on the left, you could probably project onto any plane, and it would almost certainly be equally bad. But the data on the right seem to lie close to a plane, so intuitively, the planes “close” to the data seem to be “less bad.”

So on the one hand, the JL lemma is telling us that pairwise distances are probably not distorted. But on the other hand, geometry is telling us that some projections are “better” than others. And this mismatch tells us something interesting about random projections:
  • Pairwise distance does not tell us everything there is to know about dimensionality reduction. The JL lemma by itself is not able to tell us why some projections are worst than others in the dataset on the right. All it tells us is that the scaled pairwise distance is not distorted too much.
  • But it’s still pretty useful. For example, if you were running approximate nearest neighbors or something, then you could pick a random projection and reduce the dimension significantly, but still be mostly correct.

So in some sense, the JL lemma seems to work because pairwise distances are not quite as important for dimensionality reduction as we might have hoped. Still, it is interesting that they would be this resilient to random projection, and it seems worth wondering why.

A more formal definition of the JL lemma


First I’ll state the JL lemma, and then demonstrate the intuition using a caricature of a proof, which should give you a good idea why it is true.

Proposition 1 (the JL lemma): For some k \ge O(\log m / \varepsilon^2) (where \varepsilon is our chosen error tolerance), with high probability, map f : \mathbb{R}^d \rightarrow \mathbb{R}^k does not change the pairwise distance between any two points more than a factor of (1 \pm \varepsilon), after scaling by \sqrt{n/k}).

There are a couple of things that are good to notice about the JL lemma that might help you to know when it’s applicable to your problems.
  • It makes statements for reduction from high space to “medium” space. It does not really work in the same way for extremely small space, like \mathbb{R}^1.
  • According to the JL lemma, our choice of k (recall we’re mapping from \mathbb{R}^d to \mathbb{R}^k) should depend only on the number of points m and our error tolerance \varepsilon.

A Caricature of a Proof

[Note: I believe this “caricature” is due to Avrim Blum, but I can’t find it, and so can’t be sure.]

The gist of the proof is this: the JL lemma follows from the fact that the squared length of a vector is sharply concentrated around its mean when projected onto a random k-dimensional subspace. What we aim to show in this section is why this would even be true or useful for proving the JL lemma.

We will begin as something that doesn’t really resemble the JL lemma, but by the end it will hopefully become clear what the relationship is.

First, let’s say we randomly sample m points that lie on the surface of a d-dimensional sphere. These points can also be seen as random unit-length vectors.
image

We would like to look at how individual coordinates behave. So we will simplify the picture somewhat.

Since they lie on a d-dimensional spere, they are points in \mathbb{R}^d. But, notice that, regardless of the size of d, we can say that these points lie in an m-dimensional subspace, because there are only m points. So we’ll say that they lie in \mathbb{R}^m.

Let’s start by looking at the first coordinate. By standard probability theory, we know that \mathbb{E}[x^2_1] = 1/m. Note that as m gets bigger, the concentration rises precipitously, meaning that the actual values of the coordinates will be sharply concentrated around this value.

Now we’d like to extend this intuition to look at all of the coordinates. Since the value of these coordinates will be sharply concentrated around 1/m for large m, we can see that the coordinates kind of sort of look iid. They’re not really, because if one coordinate is large, the others are necessarily small, but this “sharpness” means that they are “almost” iid. This is not a real proof, right? So let’s say they’re iid for illustrative purposes.

If they’re iid, then we can apply our favorite Chernoff/Hoeffding bound to say that, with really high probability, all the coordinates will be really really close to being 1/m in size. For example, p(|(x_1^2 + \ldots + x_k^2) - k/n| \ge \varepsilon k/n] \le 1/(e^{O(k \varepsilon^2)} ). Remember, this is if they’re iid, which of course they aren’t, but they kind of look like they are. This is intuition.

At this point we’re ready to look at random projections. The basic idea is, we’re going to take a random plane and project our unit vectors onto it. Here’s are a series of “random” examples that I made up (note that the plane of projection is translated to the origin).
image
But it turns out that we can look at vector projections in a more interesting way. Projecting from \mathbb{R}^m to \mathbb{R}^k is basically the same thing as randomly rotating the vector, and then reading off the first k coordinates.

Now we get to the JL lemma: in order for the JL lemma to be basically true, the distance between the two vectors must be almost the same after we scale it accordingly! In the picture above, the projected vectors each have a dotted line going between them, depicting the distance. So, in other words, that vector needs to be almost the same scaled length as it would be in the original example.

Unsurprisingly, it turns out that this is true. Here’s the math that justifies it.

Using the above Chernoff-Hoeffding bound, we see that at k = O(\frac{1}{\varepsilon^2} \log n), then with probability 1 - O(n^p) (for almost whatever positive integer choice of p you like), the projection to the first k coordinates has length \sqrt{k/n} \cdot (1 \pm \varepsilon).

Now, let’s look at the distance between some pair of vectors. Let’s say our regular non-projected vectors are called \vec{v}_1 and \vec{v}_2. Then the vector that represents the dotted line between the original vectors would be \vec{d} = \vec{v}_2 - \vec{v}_1.

And here’s the punch line. We can take that “distance vector” that goes between our original vectors, and use the same projection argument as above. Thus, with high probability (by the union bound), the length of all these “distance vectors” \vec{d} project to length \sqrt{\frac{k}{n}} \cdot (1 \pm \varepsilon) ||\vec{d}||_2.

Toward more general arguments.

So that’s the intuition for why the JL lemma is true. As we said before,  the  JL lemma follows from the fact that the squared length of a vector is  sharply concentrated around its mean when projected onto a random k-dimensional subspace. This last section, I hope, explains roughly both why this would help us prove the JL lemma, and is a beginning on why it’s true on a more general basis.

If you want to know more, I hope this arms you nicely to figuring out the proofs floating around out there. I would say that Alexandre Passos’s answer is a reasonable start. An excellent treatment exists in Foundations of Machine Learning, by Mohri et al. It’s an excellent book anyway, and you should read it just because.
View Answer on Quora