LM2: Large Memory Models

ghysznje · on Feb 14, 2025

(Probably bc I'm dumb) I'm very confused by this paper. The dimensions are all over the place: first they say M is a N x d x d matrix, then it becomes N x d. And then they are trying to scale M with g_out and add it to E_attn which is a T x d matrix??? Are the gates scalars or vectors or matrices? If they are matrices then the dimensions also don't line up to M

biofox · on Feb 14, 2025

You're not dumb. I think it's just poorly written and full of errors.

ziofill · on Feb 14, 2025

Missed opportunity to call them LMM and enjoy the onslaught of typos

bayindirh · on Feb 14, 2025

Some Chinese researchers have worked on copper nanotubes and abbreviated them as CuNTs [0].

[0]: https://www.researchgate.net/publication/260800015_Structura...

esafak · on Feb 14, 2025

Or Models with Large Memory. MLMs :)

x86hacker1010 · on Feb 14, 2025

This seems to be the most appropriate

stuartjohnson12 · on Feb 14, 2025

I can't wait to enhance my STT-RAG-LLM-TTS with some LMMs

bloomingkales · on Feb 14, 2025

You forgot SOTA, that one is flying around a lot.

janderson215 · on Feb 14, 2025

Also left out SoDoSoPa

pylotlight · on Feb 14, 2025

Wouldn't it be STT-LLM-RAG-TTS

nkozyra · on Feb 14, 2025

They're still language models

LMLM

DebtDeflation · on Feb 14, 2025

Just as Graph RAG should have just been GAG.

mgfist · on Feb 14, 2025

That's why they named it LM2 ..

pylotlight · on Feb 14, 2025

Seems like LM² would have made more sense in that case. Or L²M² Or LLMM

DiogenesKynikos · on Feb 14, 2025

The question is whether L and M commute.

mgfist · on Feb 14, 2025

Because that makes it harder to type out

gpderetta · on Feb 14, 2025

(LM)²

kadushka · on Feb 14, 2025

The largest model they tested is 1.7B.

free_bip · on Feb 14, 2025

Do they believe that it's likely to scale as high as "standard" LLMs? Or is this another state space machine moment where it turns out to be equivalent to them?

kadushka · on Feb 14, 2025

The concept of a memory module, separate from a prediction engine, makes sense to me, our brains might operate like that (short vs long term memory).

Many research groups have been trying to implement this idea, recent paper from Meta comes to mind: https://arxiv.org/html/2412.09764

If they believe it’s likely to scale, they will try to scale it, so we will either hear about it in a few months, or we won’t.

soganess · on Feb 14, 2025

Table 2's results are interesting. If the paper is to be believed, just adding the memory model seems to improve reasoning tasks across the board.

That said, I do wonder if this a bit of mirage. At 1.7B parameters, they are 3 orders of magnitude down from 4o (well that isn't completely fair, I don't know what the average 'expert' size is in 4o, but I doubt the authors are doing mixture of experts at only 1.7B). A model can 'memorize' way more shit with that many parameters.

igleria · on Feb 14, 2025

This immediately makes my mind bring up Hopfield networks https://arxiv.org/abs/2008.02217

when I worked with them circa 2012 they were practically toys. Maybe we are in a better place now?

ottaborra · on Feb 14, 2025

RNN with extra steps?

tripplyons · on Feb 14, 2025

There are many papers that use a recurrence across sub-sequences and attention within sub-sequences. Google did this with Infini-Attention and one of the variants from the Titans paper. However, I think the earliest example of this is Transformer-XL.

biofox · on Feb 14, 2025

Isn't that all of modern AI?

immibis · on Feb 14, 2025

Transformers are completely unlike RNNs.

tripplyons · on Feb 14, 2025

There are some interesting connections between them. If you remove the softmax from the attention formula, you end up with linear attention, which has a recurrent form.

I haven't read it, but the Mamba 2 paper claims to establish a stronger connection.

kadushka · on Feb 14, 2025

* If you remove the softmax from the attention formula, you end up with linear attention*

Sorry, what?

tripplyons · on Feb 14, 2025

Here is a paper explaining it: https://arxiv.org/abs/2006.16236

anentropic · on Feb 14, 2025

GitHub link in the paper is a 404 - private repo?