In the text they say you need to cram all information needed to predict the next...

		tsurba on July 24, 2023 \| parent \| context \| favorite \| on: Attention Is Off By One In the text they say you need to cram all information needed to predict the next token into a single 6KB word embedding, but isn’t that wrong? Rather, isn’t the autoregressively predicted single next token a combination (based on attention) of all 6KB word tokens in the attention window. So the size of memory where all information for next token prediction needs to be ”crammed into” is more like window_size*6KB, right?

This is probably what the author meant to say but elided. I can see why it looks off though.