A better metaphor would be to say it compresses the internet, creates a Markov chain based on that compression. Then to make it work it compresses your prompt so that it can find it in the markov chain, move to the next step, and make a lossy decompression into a text token and adds it. The lossy decompression here is the temperature, higher temperature more lossy and more random words, but since it is lossy in the "meaning" space the random words would still have very similar meaning to before.
That isn't a perfect metaphor, but it explains very well how it can do most of the things it can do. The lossy compression means that it can work with large prompts and just capture their essence instead of trying to look them up literally, and the lossy decompression lets it vary its output and the text will move in slightly different directions instead of just repeating text it has seen. The magical bit is that this compression and decompression is much smarter than before, it parses text to a format much closer to its meaning than before, and that lets us do the above much more intelligently.
Edit: Thinking a bit, maybe you could make these model way cheaper to run if we would make them work as a compression to meaning rather than the huge models they are now? They do have internal understanding/meaning of the tokens it gets, so it should be possible to create a compression/decompression function based on these models that transforms text into its world model state, and then once we start working with world model states things should be super cheap relative to what we have now.
Also maybe it doesn't have lossy decompression and get words with similar meaning, but that is another way I see the models could be smaller and cheaper while keeping their essence. The Markov chain step could be all it uses currently. But it definitely creates that space and Markov chain, because it parses the previous thousand or so tokens and uses those to guess the next token, that is a Markov chain. It just has a very sophisticated way of parsing those thousand tokens into a logical format.
> creates a Markov chain based on that compression
I dislike that interpretation. It suggests it builds a very basic statistical model, but a very basic statistical model simply wouldn't be able to do what these models can do.
Or alternatively, if you want to consider the model as a markov chain mapping the probability from the previous four thousand tokens to the next token then the space is astronomically large. Beyond astronomically and even economically large, there are ~50,000^4096 possible input states.
> but a very basic statistical model simply wouldn't be able to do what these models can do.
Why do you think that? Why do you think a basic statistical continuation of the logic of a text wouldn't do what the current model does? There are trillions of conversations out there it can rely on to continue the text, people playing theatre, people roleplaying, tutorials, people playing opposite games, people brainstorming etc. Create a parser that can parse those down to logic, then make a markov chain based on that, and I have no problem seeing the current ChatGPT skills manifesting from that.
> Or alternatively, if you want to consider the model as a markov chain mapping the probability from the previous four thousand tokens to the next token then the space is astronomically large. Beyond astronomically and even economically large, there are ~50,000^4096 possible input states.
Yes, that is the novel thing, it compresses the states down to something manageable without losing the essence of the text, and then builds a model there of likely next token.
That isn't a perfect metaphor, but it explains very well how it can do most of the things it can do. The lossy compression means that it can work with large prompts and just capture their essence instead of trying to look them up literally, and the lossy decompression lets it vary its output and the text will move in slightly different directions instead of just repeating text it has seen. The magical bit is that this compression and decompression is much smarter than before, it parses text to a format much closer to its meaning than before, and that lets us do the above much more intelligently.
Edit: Thinking a bit, maybe you could make these model way cheaper to run if we would make them work as a compression to meaning rather than the huge models they are now? They do have internal understanding/meaning of the tokens it gets, so it should be possible to create a compression/decompression function based on these models that transforms text into its world model state, and then once we start working with world model states things should be super cheap relative to what we have now.
Also maybe it doesn't have lossy decompression and get words with similar meaning, but that is another way I see the models could be smaller and cheaper while keeping their essence. The Markov chain step could be all it uses currently. But it definitely creates that space and Markov chain, because it parses the previous thousand or so tokens and uses those to guess the next token, that is a Markov chain. It just has a very sophisticated way of parsing those thousand tokens into a logical format.