The article contains no proof of theorem 3.1 and finding counterexamples seems t...

NaiveBayesian · on July 24, 2024

I believe that counterexample only works in the limit where the sample size goes to infinity. Every finite sample will have μ≠0 almost surely.(Of course μ will still tend to be very close to 0 for large samples, but still slightly off)

So this means the sequence of μₙ will perform a kind of random walk that can stray arbitrarily far from 0 and is almost sure to eventually do so.

bjourne · on July 24, 2024

Fair point about the mean, but I don't see how the random walk causes the standard deviation to shrink towards zero.

lostmsu · on July 24, 2024

I agree. The authors generate a dataset of a similar size as the original and then train on that continuously (e.g. for multiple epochs). That's not what you need to do in order to get new model trained on the knowledge of the teacher. You need to ask the teacher to generate new samples every time, otherwise your generated dataset is not very representative of the totality of knowledge of the teacher. Generating samples every time would (in infinite limit) solve the collapse problem.

NaiveBayesian · on July 24, 2024

Agreed, that's what I struggle to see as well. It's not really clear why the variance couldn't stay the same or go to infinity instead. Perhaps it does follow from some property of the underlying Gamma/Wishart distributions.

mcguire · on July 24, 2024

Does the Supplementary Information (starting on p. 4, for example) help?

https://static-content.springer.com/esm/art%3A10.1038%2Fs415...

In your counterexample, can you quantify "as long as the samples are large enough"? How many samples do you need to keep the s.d. from shrinking?

bjourne · on July 24, 2024

Maybe. "Overall, this only shows us how far on average we go from the original distribution, but the process can only ’terminate’ if the estimated variance at a certain generation becomes small enough, i.e. we effectively turn into a delta function." Iiuc, variance is modeled as a random walk that will sooner or later reach on zero. I'm not sure I buy that because the variance "walks" orders of magnitudes slower than the mean and is much more robust for large sample sizes.