Do large language models need all those layers?

reqo · on Dec 15, 2023

>Finding that 70% of attention heads and 20% of feed-forward networks can be excised with minimal effect on in-context learning suggests that large language models are undertrained.

I thought that it was common knowledge that LLMs are undertrained, none of the publicly available loss graphs show any sign of convergence!

novaRom · on Dec 15, 2023

It's a disadvantage of current SOTA models: they are easy to train, but they must be large wasting lots of weights in order to generalize well. Maybe another architecture, transformer's successor will be more economical - having less weights with more skills and knowledge.

two_in_one · on Dec 16, 2023

I think I've seen somewhere years ago an article claiming size is needed for training. Which means probably that after training model can be optimized to minimize the size. Purging? However, smaller model cannot be trained that well. From my experience with image processing the bigger the better, and, they all have their limits. Nothing new here.

versteegen · on Dec 15, 2023

Maybe that's why the Phi models do so well for their size. I'm guessing they may have been trained close to convergence, but the loss graphs aren't published. Phi-1.5 (1.3B parameters) was trained on 150B tokens (5 epochs), yet phi-1.5-web was trained on 300B so they didn't stop for lack of compute. Phi-2 (2.7B params) was trained on 1.4T tokens, epochs unknown.

littlestymaar · on Dec 15, 2023

I dont think that's the case, tinyllama has been trained on much more data already (2.5T for 1.1B params, and they aim for 3T) and while it didn't show signs of convergence, it also has much worse results than the Phi models. The main takeaway of Microsoft models seems to be the one they got in Textbooks are all you need, that is: data quality gets you very far, even if you use artificial data.

Legend2440 · on Dec 15, 2023

Especially the OPT models they study here were known to be undertrained. It would be interesting to compare to a more modern model like llama 2.

jjallen · on Dec 16, 2023

So why not train them more? Is it to save costs?

k__ · on Dec 15, 2023

Can we use unoptimized to train optimized ones?

duchenne · on Dec 15, 2023

The most important paper to understand this issue is "Sacling Laws of Neural Language Models" by Open AI in 2020 [1]. Many consider it the most important paper that predicted the high performance of modern LLMs.

This paper shows how the loss decreases when you increase the model size, compute, or training dataset size.

From the article:

> Convergence is inefficient: When working within a fixed compute budget C but without any other restrictions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence.

It clearly states that when you are limited by your training time compute, you should under-train your model.

[1] https://arxiv.org/abs/2001.08361

swyx · on Dec 16, 2023

that paper is now considered to be a psyop fwiw - but in the direction of too little data, not too many layers

highfrequency · on Dec 16, 2023

Can you clarify what you mean?

versteegen · on Dec 17, 2023

Because the training data/model size/compute tradeoff derived from that paper is highly suboptimal (too many parameters) compared to the ones from the later Deepmind scaling laws [1]. And then Meta researchers recommended using even smaller models, to trade-off training- and inference-time compute [2] (which I thought was pretty obvious if you care about more than just benchmarks).

[1] https://arxiv.org/abs/2203.15556 Training Compute-Optimal Large Language Models

[2] https://arxiv.org/abs/2302.13971 LLaMA: Open and Efficient Foundation Language Models

kristianp · on Dec 16, 2023

He seems to be implying that openai released that paper to throw others off the scent of the direction they were taking.

WhitneyLand · on Dec 15, 2023

There are lots ways to make LLMs more efficient:

- Pruning

- Distillation

- Sparse transformers

- Mixture or experts

- Quantizations

My understanding is that none of these are free and they all come with various trade offs.

For example MOE lets Mistral beat similarly sized models, and the inference performance stays close (only an incremental increase). But the training time is way more than a typical 7B model.

But which of these approaches gives the most bang for the buck?

Also consider it’s not either/or many of these techniques can be combined.

And maybe worst of all, some of the testing that can be done to find out doesn’t give the same answer with smaller/toy models.

rolisz · on Dec 15, 2023

Mistral MOE model is 8*7B parameters, so you should compare training time to similar sized models, not to 7B ones.

versteegen · on Dec 15, 2023

Mixtral 8x7B actually has 46.7B total parameters, not 8*7B = 56B. The reason being that not all parameters are multiplied 8x.

Also it uses 12.9B parameters per token, not quite comparable to 7B models.

29athrowaway · on Dec 16, 2023

- Fast Feedforward Networks

- Lookahead decoding

changoplatanero · on Dec 15, 2023

I can believe that you can get good performance on 14 NLP tasks after pruning 70% of the model. But for things like chatgpt the use cases are way more diverse and you can’t keep high performance for everything when you prune that many of the weights.

sigmoid10 · on Dec 15, 2023

>But for things like chatgpt the use cases are way more diverse and you can’t keep high performance for everything when you prune that many of the weights.

This also has become increasingly obvious from recent developments in the field. Today, we regulalry see new models that come at a fraction of the size of GPT-3 and yet easily outperform it, especially on certain downstream tasks when fine-tuned correctly. These small models also retain some generality, but not as much as the really high end really large models like GPT-4. I'd say a sub 10B parameter model equal or better than GPT-3 overall is achievable, but not for GPT-4. At least not with current technology. However, that would still imply that it's possible to reduce parameter counts in common approaches by 95%. I'm pretty sure in a few years people will look back and smirk at the crude methods we used to train LLMs today.

k__ · on Dec 15, 2023

How big are GPT3 and GPT4 models?

pizza · on Dec 15, 2023

175B and rumored 1.76T params

k__ · on Dec 15, 2023

How much on disk?

pizza · on Dec 15, 2023

if fp32, 4 bytes per param. but the weights may have been quantized to lower precision (eg fp16, so 2B/p)

ponyous · on Dec 16, 2023

GPT3: 175B * 4 = 600GB

GPT4: 4 * 1.76T = 7TB

That feels too much, but I have no idea tbh.

k__ · on Dec 16, 2023

Does it?

IDK...

Nevertheless, it would be cool if it could be reduced 95% in the future, haha.

Der_Einzige · on Dec 15, 2023

Correct, people in the model optimization world will claim all day that it only marginally impacts performance, but I’ve extensively played with quantization and pruning methods, and can report that they do cripple models in ways that the authors either didn’t notice or on purpose omitted from their SOTA benchmark chasing.

Most claims of models being even equal to GPT-3.5 are also significantly overblown. I haven’t seen one yet below 70 billion parameters which even comes close.

Nothing is free in this world.

lolinder · on Dec 15, 2023

You don't need to get good performance on everything with a single model—Mixture of Experts models can outperform equivalently sized monolithic models. My understanding is that GPT-4 is structured this way.

If we can trim a model in different ways to get different specializations, that could be really effective.

novaRom · on Dec 15, 2023

I just tried Mistral MoE 8x7B model and it works a bit faster than llama-2-70B but it looks it has almost the same skills. In fact, all latest models of 13B-70B size are quite similar. Could it be large part of their training data is the same?

cyanydeez · on Dec 15, 2023

there might be a laws of average thing going on? Central Limit Theorem?

there's a finite number of relevant language tokens. the trick of current LLM is finding what's basically the center of a vast series of probability.

jmward01 · on Dec 15, 2023

We quantize models and get nearly the same performance and we keep seeing new 'small' models performing on par with older larger models. Additionally, people don't need nearly the amount of data that models do in order to learn the same tasks. This implies a lot about how fast it may be possible to learn and how far models currently are from that optimum. Because of this I think the authors conclusions about scaling data up really only apply to current models and training techniques. Based on how fast people can learn, I believe we are likely missing a few things that will allow models to learn vastly faster and deeper with fewer total parameters. My own work, as an example, (github jmward01/lmplay) may show how simple changes to embedding training may make drastic improvements to model training.

cyanydeez · on Dec 15, 2023

I assume AGI will find some fractal way of training. models of models.

really what we're looking for is a way to bootstrap appropriate filters so a unit of model matches a unit of language scope.

renonce · on Dec 16, 2023

I think it's not that LLMs have redundant parameters in general - it's a specific problem with OPT-66B, not anything else.

An 2022 paper "Scaling Language Models: Methods, Analysis & Insights from Training Gopher" (http://arxiv.org/abs/2112.11446) has captured it well on page 103, Appendix G:

> The general finding is that whilst compressing models for a particular application has seen success, it is difficult to compress them for the objective of language modelling over a diverse corpus.

The appendix G explores various techniques like pruning and distillation but found that neither method was an efficient way to obtain better loss at lower number of parameters.

So why does pruning work for OPT-66B in particular? I'm not sure but there is evidence that OPT-66B is an outlier: one evidence is in the GPTQ paper ("GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers", https://arxiv.org/abs/2210.17323) that mentions in its footnote on its 7th page:

> [2] Upon closer inspection of the OPT-66B model, it appears that this is correlated with the fact that this trained model has a significant fraction of dead units in the early layers, which may make it harder to compress.

Since this article specifically target OPT-66B and not anything else, I would remain skeptical until they generalize their findings to other language models, such as Llama2 or Mistral.

galaxyLogic · on Dec 16, 2023

I think a big part of the success of LLMs is they know what is good, and what would be absurd natural language. Most every domain will need that capability. Many sentences can be interpreted multiple ways but once you can perceive which interpretations are non-sensical, the LLMs are on the right track to provide reasonable correct answers.

Chomsky famously said "Colorless green ideas sleep furiously". An LLM can detect that such a sentence, such an interpretation, or translation, would make no sense --even if syntactically proper-- because its training materials never include such sentences (except perhaps this particular one). Then knowing it cannot be that "Colorless green ideas sleep furiously", it can prune out many otherwise possible interpretations. This is just my hunch, something like this may be going on.

https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_fu....

blamestross · on Dec 15, 2023

The answer is always "No it doesn't need all those nodes", but we do it that way because it makes doing the math easier. I bet they just removed a bunch of edge weights to nodes, but used the same matrices for calculating output.

I wish arbitrary topology networks were scalable (I love NEAT) but bipartide graphs crunch good in GPUs.

eurekin · on Dec 15, 2023

I'm quite surprised so much was retained; many presentations on convolutional networks suggest adding a lot of features, but prune most in a post learning step. Supposedly lots of features, which are initialized randomly, just raise the probability of being close to the real answer (the concept being learned). Once it's found, most of features typically go unused or are reduntant.

zaptrem · on Dec 15, 2023

Can you elaborate what you mean by "edge weights to nodes, but used the same matrices"?

feanaro · on Dec 15, 2023

The network retains the same topology, and therefore the same matrix dimensions, but some elements are set to zero, removing their contribution.

AlexErrant · on Dec 15, 2023

Relevant excerpt from https://news.ycombinator.com/item?id=38656172

> ...if we meet the shape 512x512 (512 is a power of 2, therefore a very "round" number in computer science), then maybe some kernel will be very fast; but then on a 512x511matrix, the same kernel may need to add some padding first to transform it into a round 512x512 matrix with zeros at the end of each row. Adding those zeros means shifting all rows, which is a very costly operation.

visarga · on Dec 15, 2023

Warning! dated 18 Dec 2022 - Rethinking the Role of Scale for In Context Learning: An Interpretability based Case Study at 66 Billion Scale

https://arxiv.org/abs/2212.09095

since it is at its one year anniversary, we can check citations - it has 16 so far

https://scholar.google.com/scholar?start=0&hl=ro&as_sdt=2005...

fluidcruft · on Dec 15, 2023

I was a RSNA this year (major Radiology conference and trade show) and one of the presenters made the claim that their model was generalized because it works on different body parts. My intuition was they were claiming they trained it on one body parts and it subsequently worked on different body parts (which could be convincing). So at first this seemed fine. But in reality they had trained the same model on all of those body parts. That really go me to thinking about the old myth that we only use 10% of our brains. Anyway I think capacity would be when the model can no longer learn.

But anyway it made me wonder if there's a way to measure "what x% of a model is actually used" similar to the myths about human brains.

cyanydeez · on Dec 15, 2023

"junk" DNA is a similarly dark data

jakobnissen · on Dec 16, 2023

An often repeated claim, but untrue. We have a pretty good understanding of large parts of junk DNA, how it got there and what it is, and know that most of it really is junk.

That being said, only around 2% of the human genome is coding, with perhaps another one or two percent having known no coding function. And estimates of functional DNA is somewhere around 8% of the genome. So, most functional DNA is still unknown.

cicce19 · on Dec 16, 2023

I was also at RSNA. What was the model intended to do in this example?

fluidcruft · on Dec 16, 2023

IIRC it was about emerging MRI reconstructions for MSK in one of the educational sessions. It was more the ambiguous way the claim was stated about trasferrable knowledge initially that got me thinking about the "10% of brain" thing rather than anything specifically presented.

zoogeny · on Dec 15, 2023

Forgive me if this is not a good question, but is there a difference here between training and inference?

ImprobableTruth · on Dec 15, 2023

(I assume this question is about whether models need all those layers during training, even if they don't need them during inference)

Yes. There's the so called "lottery ticket hypothesis". Essentially the idea is that large models start with many randomly initialized subnetworks ("lottery tickets"0 and that training finds which ones work best. Then it's only natural that during inference we can prune all the "losing tickets" away, even though we need them during training.

It's kind of an open question how large this effect is though. As the article mentions, if you can prune a lot away, this could also just mean that the network isn't optimally trained.

zoogeny · on Dec 15, 2023

I figured this must be a well-known property of neural networks. I'll do some reading on the lottery ticket hypothesis. That is almost exactly what I was thinking when reading the article: sure after you have trained it you can prune the layers that aren't used. But I wasn't sure you could know/guess which layers will be unused before you train.

It strikes me as an interesting open question since if it is the case that you need big networks for training but can use significantly smaller "pruned" networks for inference there are many, many reasons why that might be true. Determining which of the possible reasons is the actual reason may be a key in understanding how LLMs work.

jncfhnb · on Dec 15, 2023

Training is making the model (or rather going from something random and useless to something well calibrated and useful). Inference is using it to make a prediction.

This is saying that you don’t need the entire model to make good predictions for specific subsets of tasks. You can literally remove a large part of the model and it will do fine. Which is not very controversial. The model, after being trained, is a large collection of interacting nodes. When this is talking about dropping chunks of the model it means dropping nodes after training to make predictions. The advantage primarily being that smaller models are cheaper and faster to run or modify with further training.

You know that meme about how you only use 10% of your brain at a time? Well, yeah, but the idiot movies that suggest using 100% of your brain would make you impossibly smarter are not correct. 90% of your brain just isn’t relevant. More brain / model is not better than the relevant subset alone.

The important question to be asking is whether you can remove large chunks of the model without hurting its ability to generally to do well on whatever you ask it.

As a very crude example, imagine you trained a simple model to predict rainfall using a weather monitor and the number of farts you did last week. The model will probably learn that the monitor is a useful and the farts are irrelevant. If this were as simple as a linear regression, you could just remove the farts coefficient from the equation and the model would come out to the same outcomes. Neural nets are not so easily observed but it’s still just dropping the irrelevant bits to whatever you’re trying to do.

quadrature · on Dec 15, 2023

Yes. Think of it in terms of fitting a line to some points y=mx+b. Training is finding the right slope m and intercept b of the line to get a good fit to the points. Inference is when you take an x coordinate and find the y value using the "trained" m and b in the line equation

zoogeny · on Dec 15, 2023

I'm not sure if that gives me an intuition on the title of the article: "Do large language models need all those layers"

Am I interpreting you correctly if I say: "Finding the slope (training) may require those extra layers but finding a particular y value given an known x coordinate (inference) may not require those extra layers".

What I mean is, does the answer to the article's question change if one is considering training vs. inference?

quadrature · on Dec 15, 2023

Apologies, I thought you were asking a general question about ML. Will let someone else comment on the specifics here.

xanderlewis · on Dec 15, 2023

I think I’ve misinterpreted it in the same way. I guess you’re asking something like: if we can exorcise parts of a model without affecting quality of inferences (in some particular domain), can we do the same with the training step? That is, is it necessary to train a model on a wide variety of topics in order to get high-quality ‘understanding’ for a particular application?

If we don’t need those weights at inference time, why do the computation to train them in the first place?

ska · on Dec 15, 2023

The real answer is we don't know yet but it's interesting.

To go back to your ax+b example, imagine instead you are fitting a much higher dimensional model, but you don't know how high. ax^n+bx^(n-1) ... where n might be in the millions, or hundreds of millions, or?? So we know if we make the model high enough order (e.g n-1 training points will give "perfect") it will overfit, so we throw some regularization and a bit of handwavy tuning and we end up with a model of say n=7213472123 and a set of a,b .. which behaves pretty well, but from it's behavior we suspect most of them dont' matter. And maybe should be <= 2million, or whatever.

So, a few obvious questions - one is can we find a way to throw out most of the a,b,c ... to get just the core, i.e. if we throw away all |k| <= 0.00001 does it change anything (for inference). A very different question is could we decide that ahead of time (during training). A different class of question looks more like "could we have figured this out from the data".

It's a lot harder to reason about the latter questions, because the former one is empirical: After training, this one doesn't seem to do anything. Ahead of time, how do you know? This has interesting offshoots, like how stable is the distribution of the parts that matter, etc.

cyanydeez · on Dec 15, 2023

wouldn't the most accurate 2D analogy in geometry be in the Discrete Fourier Transform?

xanderlewis · on Dec 15, 2023

The use of the word ‘inference’ in this context can seem a bit weird, but I think it’s borrowed from statistics and it’s quite standard.

Training = optimising model parameters to ‘learn’ from data.

Inference = asking the model to make a prediction, usually assuming the model is already trained.

Instead of inference, you could say running/querying the model.

bjornsing · on Dec 15, 2023

> Finding that 70% of attention heads and 20% of feed-forward networks can be excised with minimal effect on in-context learning suggests that large language models are undertrained.

So why do the larger models perform so much better…?

sdenton4 · on Dec 15, 2023

Here's an explanation I hit on some time ago:

More parameters makes it easier to find solutions with low-energy.

Suppose we have a product of two variables z = x * y. And now suppose that the 'correct' product is z=2, and we're learning x and y. A very good analytical solution is x=1, y=2 (or vice versa) allowing us to eliminate either x or y from our learning problem. The total energy of (x, y) in this case is 1*2 + 2*2 = 5.

However, another solution is x = y = sqrt(2), which has energy 2: this solution is much closer to the origin. The extra variable means that we have a /surface/ of solutions instead of a unique solution, so we can hone in on ones that are easier to get to using our optimizer.

As you add more variables, you can find lower and lower energy solutions.

Consider that we initialize neural networks 'near' zero, and then walk with gradient descent in some direction towards a solution. Then adding lots of extra variables - wiggle room - makes it much easier to find a solution within walking distance of the (noisy) origin.

bjornsing · on Dec 15, 2023

Fits with my first intuitive guess: it’s implicit regularization (that works as you describe).

Would be interesting to try some explicit regularization. But unfortunately you need a million bucks to an experiment on LLMs. :/

gessha · on Dec 15, 2023

Do you know of any literature that looks into this? This is a pretty interesting hypothesis.

youngNed · on Dec 15, 2023

Because 70% of a big number is a lot more than 70% of a smaller number?

Not being facetious, I don't know the answer, but that's my best guess

danielmarkbruce · on Dec 15, 2023

lottery ticket hypothesis might be real

famouswaffles · on Dec 15, 2023

all LLMs are undertrained to some degree.

assuming the models are identical except one is bigger then the bigger model is better because 70% of a bigger number is larger than 70% of a smaller number.

Now if you train a smaller model much longer than the bigger model (more tokens) then you are reducing the level of "under-trainedness" to some degree. at some point, you may have a smaller model that is better than that larger model.

70% of a bigger number may be larger than 70% of a smaller number but no guarantee 70% of a bigger number is larger than say 90% of a smaller number and so on.

sodality2 · on Dec 15, 2023

If they're all equally un-pruned, sounds like they still maintain their linear scale of performance.

Just like quantization!

cubefox · on Dec 15, 2023

How does this answer the question?

danielmarkbruce · on Dec 15, 2023

The question sort of implies you couldn't prune the smaller models and see the same thing. So, the answer given is to consider that in both cases, you sort of only use 30% of the model. Bigger is still bigger. The basic intuition of more parameters = better holds.

cubefox · on Dec 15, 2023

The question seems to be one of the Chinchilla scaling law. We could train smaller models (than recommended by the law) but with more training tokens, and achieve the same loss. But we would need more compute performance for that.

So the question is, perhaps: Why are big models required for compute-efficient training?

danielmarkbruce · on Dec 16, 2023

I don't think so. Pruning a large model and training a smaller model isn't the same thing. It might appear to be the same thing, but it's not.

cubefox · on Dec 16, 2023

Do you expect a model which was overtrained (relative to the Chinchilla law) to be no more affected by pruning than a model of the same size that wasn't overtrained?

danielmarkbruce · on Dec 16, 2023

Can you reformulate this question? It's hard to know what you mean when you say "no more affected". How are you defining "more" ?

cubefox · on Dec 16, 2023

I mean stronger impact on loss or benchmark results.

danielmarkbruce · on Dec 16, 2023

I mean, in some relative (like 10%) or some absolute amount? I think I'd expect the "more trained" model to drop performance by less (as a %, which is hard to define here) but more in absolute sense. Which, is basically impossible to measure but even if it was measurable..I don't feel confident about that prediction, it's speculation.

bjornsing · on Dec 15, 2023

I don’t follow…

sodality2 · on Dec 15, 2023

The article mentions that models have a lot of extra information that is unnecessary. You asked why the large ones still outperform small ones. presumably they all have that inefficiency. But the large ones are still better. 30% of a big number is still bigger than 30% of a small number.

jncfhnb · on Dec 15, 2023

Undertrained does not mean bad. It means it could be better.

But I also disagree with takeaway.

praveen9920 · on Dec 15, 2023

If it is true, does that mean that we can “compress” the models for efficient inference on smaller devices? That would be wonderful

CuriouslyC · on Dec 15, 2023

That is literally what quantization and distillation are

GaggiX · on Dec 15, 2023

They should do the same test on Mistral 7B or Phi-2 instead of OPT-66B.

earthboundkid · on Dec 15, 2023

You only use 10% of your LLM neural network.

dist-epoch · on Dec 15, 2023

Imagine if you could use 100% of your LLM. I think I saw a movie about that.

lgxz · on Dec 15, 2023

Lucy: https://en.wikipedia.org/wiki/Lucy_(2014_film)

usgroup · on Dec 15, 2023

Take my upvote!

tbruckner · on Dec 15, 2023

Accidentally read this as "Do large companies need all those lawyers?"