Have they figured out what causes double descent yet?

tel · on March 2, 2024

I don't know if it's a generalized result, but the Circuits team at Anthropic has a very compelling thesis: the first phase of descent corresponds to the model memorizing data points, the second phase corresponds to it shifting geometrically toward learning "features".

Here a "feature" might be seen as an abstract, very, very high dimensional vector space. The team is pretty deep in investigating the idea of superposition, where individual neurons encode for multiple concepts. They experiment with a toy model and toy data set where the latent features are represented explicitly and then compressed into a small set of data dimensions. This forces superposition. Then they show how that superposition looks under varying sizes of training data.

It's obviously a toy model, but it's a compelling idea. At least for any model which might suffer from superposition.

https://transformer-circuits.pub/2023/toy-double-descent/ind...

TeMPOraL · on March 3, 2024

> The team is pretty deep in investigating the idea of superposition, where individual neurons encode for multiple concepts.

Wonder if it's a matter of perspective - that is, of transform. Consider an image. Most real-world images have pixels with high locality - distant pixels are less correlated than immediate neighbours.

Now take an FFT of that. You get an equivalent 2D image containing the same information, but suddenly each pixel contains information about every pixel of the original image! You can do some interesting things there, like erasing the centre of the picture (higher frequencies), which will give you blurred original image when you run FFT on the frequency-image to get proper pixels again.

tel · on March 3, 2024

I think that’s basically correct, the FFT representation is a better feature representation.

arolihas · on March 2, 2024

There was actually a very recent blog post claiming that statistical mechanics can explain double descent https://calculatedcontent.com/2024/03/01/describing-double-d...

Some more detail here: https://calculatedcontent.com/2019/12/03/towards-a-new-theor...

iaseiadit · on March 2, 2024

Not an expert, but this paper explores double descent with simple models. The interpretation there: when you extend into the overparameterized regime, that permits optimization towards small-norm weights, which generalize well again. Does that explain DD generally? Does it apply to other models (e.g. DNNs)?

https://arxiv.org/pdf/2303.14151.pdf

a_wild_dandan · on March 2, 2024

No. We don't know. My favorite hypothesis: SGD is...well, stochastic. Meaning you're not optimizing w.r.t the training corpus, but a tiny subset, so your gradient isn't quite right. Over-training allows you to bulldoze over local optima and recurse toward the true distribution rather than drive around a local over-fitting basin.

canjobear · on March 2, 2024

You can get it with full gradient descent though... https://www.nature.com/articles/s41467-020-14663-9

Honestly the fact that there doesn't seem to be a good explanation for this makes me think that we just fundamentally don't understand learning.