Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The article contains no proof of theorem 3.1 and finding counterexamples seems trivial. Adult male weight can be modeled by N(85, 20). You can recursively "train" the model on data it generates without having it collapse. It will stay stationary as long as the samples are large enough.


I believe that counterexample only works in the limit where the sample size goes to infinity. Every finite sample will have μ≠0 almost surely.(Of course μ will still tend to be very close to 0 for large samples, but still slightly off)

So this means the sequence of μₙ will perform a kind of random walk that can stray arbitrarily far from 0 and is almost sure to eventually do so.


Fair point about the mean, but I don't see how the random walk causes the standard deviation to shrink towards zero.


I agree. The authors generate a dataset of a similar size as the original and then train on that continuously (e.g. for multiple epochs). That's not what you need to do in order to get new model trained on the knowledge of the teacher. You need to ask the teacher to generate new samples every time, otherwise your generated dataset is not very representative of the totality of knowledge of the teacher. Generating samples every time would (in infinite limit) solve the collapse problem.


Agreed, that's what I struggle to see as well. It's not really clear why the variance couldn't stay the same or go to infinity instead. Perhaps it does follow from some property of the underlying Gamma/Wishart distributions.


Does the Supplementary Information (starting on p. 4, for example) help?

https://static-content.springer.com/esm/art%3A10.1038%2Fs415...

In your counterexample, can you quantify "as long as the samples are large enough"? How many samples do you need to keep the s.d. from shrinking?


Maybe. "Overall, this only shows us how far on average we go from the original distribution, but the process can only ’terminate’ if the estimated variance at a certain generation becomes small enough, i.e. we effectively turn into a delta function." Iiuc, variance is modeled as a random walk that will sooner or later reach on zero. I'm not sure I buy that because the variance "walks" orders of magnitudes slower than the mean and is much more robust for large sample sizes.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: