>Finding that 70% of attention heads and 20% of feed-forward networks can be excised with minimal effect on in-context learning suggests that large language models are undertrained.
I thought that it was common knowledge that LLMs are undertrained, none of the publicly available loss graphs show any sign of convergence!
It's a disadvantage of current SOTA models: they are easy to train, but they must be large wasting lots of weights in order to generalize well. Maybe another architecture, transformer's successor will be more economical - having less weights with more skills and knowledge.
I think I've seen somewhere years ago an article claiming size is needed for training. Which means probably that after training model can be optimized to minimize the size. Purging? However, smaller model cannot be trained that well. From my experience with image processing the bigger the better, and, they all have their limits. Nothing new here.
Maybe that's why the Phi models do so well for their size. I'm guessing they may have been trained close to convergence, but the loss graphs aren't published. Phi-1.5 (1.3B parameters) was trained on 150B tokens (5 epochs), yet phi-1.5-web was trained on 300B so they didn't stop for lack of compute. Phi-2 (2.7B params) was trained on 1.4T tokens, epochs unknown.
I dont think that's the case, tinyllama has been trained on much more data already (2.5T for 1.1B params, and they aim for 3T) and while it didn't show signs of convergence, it also has much worse results than the Phi models. The main takeaway of Microsoft models seems to be the one they got in Textbooks are all you need, that is: data quality gets you very far, even if you use artificial data.
The most important paper to understand this issue is "Sacling Laws of Neural Language Models" by Open AI in 2020 [1].
Many consider it the most important paper that predicted the high performance of modern LLMs.
This paper shows how the loss decreases when you increase the model size, compute, or training dataset size.
From the article:
> Convergence is inefficient: When working within a fixed compute budget C but without any other restrictions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence.
It clearly states that when you are limited by your training time compute, you should under-train your model.
Because the training data/model size/compute tradeoff derived from that paper is highly suboptimal (too many parameters) compared to the ones from the later Deepmind scaling laws [1]. And then Meta researchers recommended using even smaller models, to trade-off training- and inference-time compute [2] (which I thought was pretty obvious if you care about more than just benchmarks).
My understanding is that none of these are free and they all come with various trade offs.
For example MOE lets Mistral beat similarly sized models, and the inference performance stays close (only an incremental increase). But the training time is way more than a typical 7B model.
But which of these approaches gives the most bang for the buck?
Also consider it’s not either/or many of these techniques can be combined.
And maybe worst of all, some of the testing that can be done to find out doesn’t give the same answer with smaller/toy models.
I can believe that you can get good performance on 14 NLP tasks after pruning 70% of the model. But for things like chatgpt the use cases are way more diverse and you can’t keep high performance for everything when you prune that many of the weights.
>But for things like chatgpt the use cases are way more diverse and you can’t keep high performance for everything when you prune that many of the weights.
This also has become increasingly obvious from recent developments in the field. Today, we regulalry see new models that come at a fraction of the size of GPT-3 and yet easily outperform it, especially on certain downstream tasks when fine-tuned correctly. These small models also retain some generality, but not as much as the really high end really large models like GPT-4. I'd say a sub 10B parameter model equal or better than GPT-3 overall is achievable, but not for GPT-4. At least not with current technology. However, that would still imply that it's possible to reduce parameter counts in common approaches by 95%. I'm pretty sure in a few years people will look back and smirk at the crude methods we used to train LLMs today.
Correct, people in the model optimization world will claim all day that it only marginally impacts performance, but I’ve extensively played with quantization and pruning methods, and can report that they do cripple models in ways that the authors either didn’t notice or on purpose omitted from their SOTA benchmark chasing.
Most claims of models being even equal to GPT-3.5 are also significantly overblown. I haven’t seen one yet below 70 billion parameters which even comes close.
You don't need to get good performance on everything with a single model—Mixture of Experts models can outperform equivalently sized monolithic models. My understanding is that GPT-4 is structured this way.
If we can trim a model in different ways to get different specializations, that could be really effective.
I just tried Mistral MoE 8x7B model and it works a bit faster than llama-2-70B but it looks it has almost the same skills. In fact, all latest models of 13B-70B size are quite similar. Could it be large part of their training data is the same?
We quantize models and get nearly the same performance and we keep seeing new 'small' models performing on par with older larger models. Additionally, people don't need nearly the amount of data that models do in order to learn the same tasks. This implies a lot about how fast it may be possible to learn and how far models currently are from that optimum. Because of this I think the authors conclusions about scaling data up really only apply to current models and training techniques. Based on how fast people can learn, I believe we are likely missing a few things that will allow models to learn vastly faster and deeper with fewer total parameters. My own work, as an example, (github jmward01/lmplay) may show how simple changes to embedding training may make drastic improvements to model training.
I think it's not that LLMs have redundant parameters in general - it's a specific problem with OPT-66B, not anything else.
An 2022 paper "Scaling Language Models: Methods, Analysis & Insights from Training Gopher" (http://arxiv.org/abs/2112.11446) has captured it well on page 103, Appendix G:
> The general finding is that whilst compressing models for a particular application has seen success, it is difficult to compress them for the objective of language modelling over a diverse corpus.
The appendix G explores various techniques like pruning and distillation but found that neither method was an efficient way to obtain better loss at lower number of parameters.
So why does pruning work for OPT-66B in particular? I'm not sure but there is evidence that OPT-66B is an outlier: one evidence is in the GPTQ paper ("GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers", https://arxiv.org/abs/2210.17323) that mentions in its footnote on its 7th page:
> [2] Upon closer inspection of the OPT-66B model, it appears that this is correlated with the fact that this trained
model has a significant fraction of dead units in the early layers, which may make it harder to compress.
Since this article specifically target OPT-66B and not anything else, I would remain skeptical until they generalize their findings to other language models, such as Llama2 or Mistral.
I think a big part of the success of LLMs is they know what is good, and what would be absurd natural language. Most every domain will need that capability. Many sentences can be interpreted multiple ways but once you can perceive which interpretations are non-sensical, the LLMs are on the right track to provide reasonable correct answers.
Chomsky famously said "Colorless green ideas sleep furiously". An LLM can detect that such a sentence, such an interpretation, or translation, would make no sense --even if syntactically proper-- because its training materials never include such sentences (except perhaps this particular one). Then knowing it cannot be that "Colorless green ideas sleep furiously", it can prune out many otherwise possible interpretations. This is just my hunch, something like this may be going on.
The answer is always "No it doesn't need all those nodes", but we do it that way because it makes doing the math easier. I bet they just removed a bunch of edge weights to nodes, but used the same matrices for calculating output.
I wish arbitrary topology networks were scalable (I love NEAT) but bipartide graphs crunch good in GPUs.
I'm quite surprised so much was retained; many presentations on convolutional networks suggest adding a lot of features, but prune most in a post learning step. Supposedly lots of features, which are initialized randomly, just raise the probability of being close to the real answer (the concept being learned). Once it's found, most of features typically go unused or are reduntant.
> ...if we meet the shape 512x512 (512 is a power of 2, therefore a very "round" number in computer science), then maybe some kernel will be very fast; but then on a 512x511matrix, the same kernel may need to add some padding first to transform it into a round 512x512 matrix with zeros at the end of each row. Adding those zeros means shifting all rows, which is a very costly operation.
I was a RSNA this year (major Radiology conference and trade show) and one of the presenters made the claim that their model was generalized because it works on different body parts. My intuition was they were claiming they trained it on one body parts and it subsequently worked on different body parts (which could be convincing). So at first this seemed fine. But in reality they had trained the same model on all of those body parts. That really go me to thinking about the old myth that we only use 10% of our brains. Anyway I think capacity would be when the model can no longer learn.
But anyway it made me wonder if there's a way to measure "what x% of a model is actually used" similar to the myths about human brains.
An often repeated claim, but untrue. We have a pretty good understanding of large parts of junk DNA, how it got there and what it is, and know that most of it really is junk.
That being said, only around 2% of the human genome is coding, with perhaps another one or two percent having known no coding function. And estimates of functional DNA is somewhere around 8% of the genome. So, most functional DNA is still unknown.
IIRC it was about emerging MRI reconstructions for MSK in one of the educational sessions. It was more the ambiguous way the claim was stated about trasferrable knowledge initially that got me thinking about the "10% of brain" thing rather than anything specifically presented.
(I assume this question is about whether models need all those layers during training, even if they don't need them during inference)
Yes. There's the so called "lottery ticket hypothesis". Essentially the idea is that large models start with many randomly initialized subnetworks ("lottery tickets"0 and that training finds which ones work best. Then it's only natural that during inference we can prune all the "losing tickets" away, even though we need them during training.
It's kind of an open question how large this effect is though. As the article mentions, if you can prune a lot away, this could also just mean that the network isn't optimally trained.
I figured this must be a well-known property of neural networks. I'll do some reading on the lottery ticket hypothesis. That is almost exactly what I was thinking when reading the article: sure after you have trained it you can prune the layers that aren't used. But I wasn't sure you could know/guess which layers will be unused before you train.
It strikes me as an interesting open question since if it is the case that you need big networks for training but can use significantly smaller "pruned" networks for inference there are many, many reasons why that might be true. Determining which of the possible reasons is the actual reason may be a key in understanding how LLMs work.
Training is making the model (or rather going from something random and useless to something well calibrated and useful). Inference is using it to make a prediction.
This is saying that you don’t need the entire model to make good predictions for specific subsets of tasks. You can literally remove a large part of the model and it will do fine. Which is not very controversial. The model, after being trained, is a large collection of interacting nodes. When this is talking about dropping chunks of the model it means dropping nodes after training to make predictions. The advantage primarily being that smaller models are cheaper and faster to run or modify with further training.
You know that meme about how you only use 10% of your brain at a time? Well, yeah, but the idiot movies that suggest using 100% of your brain would make you impossibly smarter are not correct. 90% of your brain just isn’t relevant. More brain / model is not better than the relevant subset alone.
The important question to be asking is whether you can remove large chunks of the model without hurting its ability to generally to do well on whatever you ask it.
As a very crude example, imagine you trained a simple model to predict rainfall using a weather monitor and the number of farts you did last week. The model will probably learn that the monitor is a useful and the farts are irrelevant. If this were as simple as a linear regression, you could just remove the farts coefficient from the equation and the model would come out to the same outcomes. Neural nets are not so easily observed but it’s still just dropping the irrelevant bits to whatever you’re trying to do.
Yes. Think of it in terms of fitting a line to some points y=mx+b. Training is finding the right slope m and intercept b of the line to get a good fit to the points. Inference is when you take an x coordinate and find the y value using the "trained" m and b in the line equation
I'm not sure if that gives me an intuition on the title of the article: "Do large language models need all those layers"
Am I interpreting you correctly if I say: "Finding the slope (training) may require those extra layers but finding a particular y value given an known x coordinate (inference) may not require those extra layers".
What I mean is, does the answer to the article's question change if one is considering training vs. inference?
I think I’ve misinterpreted it in the same way. I guess you’re asking something like: if we can exorcise parts of a model without affecting quality of inferences (in some particular domain), can we do the same with the training step? That is, is it necessary to train a model on a wide variety of topics in order to get high-quality ‘understanding’ for a particular application?
If we don’t need those weights at inference time, why do the computation to train them in the first place?
The real answer is we don't know yet but it's interesting.
To go back to your ax+b example, imagine instead you are fitting a much higher dimensional model, but you don't know how high. ax^n+bx^(n-1) ... where n might be in the millions, or hundreds of millions, or?? So we know if we make the model high enough order (e.g n-1 training points will give "perfect") it will overfit, so we throw some regularization and a bit of handwavy tuning and we end up with a model of say n=7213472123 and a set of a,b .. which behaves pretty well, but from it's behavior we suspect most of them dont' matter. And maybe should be <= 2million, or whatever.
So, a few obvious questions - one is can we find a way to throw out most of the a,b,c ... to get just the core, i.e. if we throw away all |k| <= 0.00001 does it change anything (for inference). A very different question is could we decide that ahead of time (during training). A different class of question looks more like "could we have figured this out from the data".
It's a lot harder to reason about the latter questions, because the former one is empirical: After training, this one doesn't seem to do anything. Ahead of time, how do you know? This has interesting offshoots, like how stable is the distribution of the parts that matter, etc.
> Finding that 70% of attention heads and 20% of feed-forward networks can be excised with minimal effect on in-context learning suggests that large language models are undertrained.
So why do the larger models perform so much better…?
More parameters makes it easier to find solutions with low-energy.
Suppose we have a product of two variables z = x * y. And now suppose that the 'correct' product is z=2, and we're learning x and y. A very good analytical solution is x=1, y=2 (or vice versa) allowing us to eliminate either x or y from our learning problem. The total energy of (x, y) in this case is 1*2 + 2*2 = 5.
However, another solution is x = y = sqrt(2), which has energy 2: this solution is much closer to the origin. The extra variable means that we have a /surface/ of solutions instead of a unique solution, so we can hone in on ones that are easier to get to using our optimizer.
As you add more variables, you can find lower and lower energy solutions.
Consider that we initialize neural networks 'near' zero, and then walk with gradient descent in some direction towards a solution. Then adding lots of extra variables - wiggle room - makes it much easier to find a solution within walking distance of the (noisy) origin.
assuming the models are identical except one is bigger then the bigger model is better because 70% of a bigger number is larger than 70% of a smaller number.
Now if you train a smaller model much longer than the bigger model (more tokens) then you are reducing the level of "under-trainedness" to some degree. at some point, you may have a smaller model that is better than that larger model.
70% of a bigger number may be larger than 70% of a smaller number but no guarantee 70% of a bigger number is larger than say 90% of a smaller number and so on.
The question sort of implies you couldn't prune the smaller models and see the same thing. So, the answer given is to consider that in both cases, you sort of only use 30% of the model. Bigger is still bigger. The basic intuition of more parameters = better holds.
The question seems to be one of the Chinchilla scaling law. We could train smaller models (than recommended by the law) but with more training tokens, and achieve the same loss. But we would need more compute performance for that.
So the question is, perhaps: Why are big models required for compute-efficient training?
Do you expect a model which was overtrained (relative to the Chinchilla law) to be no more affected by pruning than a model of the same size that wasn't overtrained?
I mean, in some relative (like 10%) or some absolute amount? I think I'd expect the "more trained" model to drop performance by less (as a %, which is hard to define here) but more in absolute sense. Which, is basically impossible to measure but even if it was measurable..I don't feel confident about that prediction, it's speculation.
The article mentions that models have a lot of extra information that is unnecessary. You asked why the large ones still outperform small ones. presumably they all have that inefficiency. But the large ones are still better. 30% of a big number is still bigger than 30% of a small number.
I thought that it was common knowledge that LLMs are undertrained, none of the publicly available loss graphs show any sign of convergence!