(Probably bc I'm dumb) I'm very confused by this paper. The dimensions are all over the place: first they say M is a N x d x d matrix, then it becomes N x d. And then they are trying to scale M with g_out and add it to E_attn which is a T x d matrix??? Are the gates scalars or vectors or matrices? If they are matrices then the dimensions also don't line up to M
Do they believe that it's likely to scale as high as "standard" LLMs? Or is this another state space machine moment where it turns out to be equivalent to them?
Table 2's results are interesting. If the paper is to be believed, just adding the memory model seems to improve reasoning tasks across the board.
That said, I do wonder if this a bit of mirage. At 1.7B parameters, they are 3 orders of magnitude down from 4o (well that isn't completely fair, I don't know what the average 'expert' size is in 4o, but I doubt the authors are doing mixture of experts at only 1.7B). A model can 'memorize' way more shit with that many parameters.
There are many papers that use a recurrence across sub-sequences and attention within sub-sequences. Google did this with Infini-Attention and one of the variants from the Titans paper. However, I think the earliest example of this is Transformer-XL.
There are some interesting connections between them. If you remove the softmax from the attention formula, you end up with linear attention, which has a recurrent form.
I haven't read it, but the Mamba 2 paper claims to establish a stronger connection.