Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In the text they say you need to cram all information needed to predict the next token into a single 6KB word embedding, but isn’t that wrong?

Rather, isn’t the autoregressively predicted single next token a combination (based on attention) of all 6KB word tokens in the attention window.

So the size of memory where all information for next token prediction needs to be ”crammed into” is more like window_size*6KB, right?



This is probably what the author meant to say but elided. I can see why it looks off though.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: