> . But the weights are trained by the game transcripts to produce activation patterns that could represent the board state
A slight phrasing thing here just to be clear - the model is not trained to produce a representation of the board state explicitly. It is never given [moves] = [board state] and it is not trained on correctly predicting the board state by passing it in like [state] + move. The only thing that is trained on that is the probes, which is done after the training of OthelloGPT and does not impact what the model does.
Their argument is that the state is represented in the activation patterns and that this is then used to determine the next move, are you countering that to suggest it instead may be "local position patterns learnt during training. Positional representation (attention) of the N-1 tokens in the autoregressive task"?
If the pattern of activations did not correspond to the current board state, modifying those activations to produce a different internal model of the board wouldn't work. I also don't follow how the activations would mirror the expected board state.
What I am trying to say is that the game state is encoded as patterns in the attention matrices of the N-1 tokens. So yes, not explicitly trained to represent the game state but that game state is encoded in the tokens and their positions.
A slight phrasing thing here just to be clear - the model is not trained to produce a representation of the board state explicitly. It is never given [moves] = [board state] and it is not trained on correctly predicting the board state by passing it in like [state] + move. The only thing that is trained on that is the probes, which is done after the training of OthelloGPT and does not impact what the model does.
Their argument is that the state is represented in the activation patterns and that this is then used to determine the next move, are you countering that to suggest it instead may be "local position patterns learnt during training. Positional representation (attention) of the N-1 tokens in the autoregressive task"?
If the pattern of activations did not correspond to the current board state, modifying those activations to produce a different internal model of the board wouldn't work. I also don't follow how the activations would mirror the expected board state.