Introduction

Current LLMs are pre-trained far into the overtrained regime: they are trained on many multiples of tokens greater than what is considered to be compute optimal (around 20 tokens per model parameter [8]). For example, LLaMA-3 8B [7] was trained on more than 15 Trillion tokens, which approaches 2000 tokens per model parameter. Even though it is not compute optimal, this extended pre-training stage has been shown to improve performance and reduce inference costs given a particular level of performance [6].

Another important method to reduce inference costs is quantization, which reduces the precision that weights (and sometimes activations) are stored in. This reduction in precision leads to memory and compute savings, because you can represent the model in fewer bits and can more easily move weights into GPU memory and do computations when the values have fewer bits.

Generally, after training models are quantized before being served to the public. However, recent work has observed an interesting phenomenon in which models that are pre-trained for longer have higher loss after post-training quantization (PTQ) than models that were trained on fewer tokens[1,2]. This leads to a clear conflict, because pre-training for longer leads to better performance and enables smaller models to be used given a fixed level of performance.

This observation suggests that there will be a clear limit to the amount we can quantize a model, which may place limits on the efficiency gains we can get from techniques. While the current findings to not impact existing methods because they are occuring at very low bits, like 3 bit, as models get pre-trained for longer this observation has the potential to also effect current methods.

It is not clear why this phenomenon is occuring and what the consequences of it will be. In this work, we aim to investigate the causes and effects of this phenomenon. Previous work has shown that LLMs with better performance have large activations with a higher frequency [4]. These large activations are difficult to quantize without errors and are important for model performance [4]. Could these be the culprit behind this degradation in performance? Furthermore, this phenomenon has only been observed in loss. What are the impacts of this on downstream task performance? Lastly, can we find the specific part of the model that is breaking and causing this phenomenon?

Our results can be reproduced with our codebase, which can be found here.

Background

Quantization

Quantization takes values of a certain precision and converts them to a lower precision. This is a lossy process. For example, converting from FP32 to FP16 may lead cause the number to lose some of its least significant values. Making this conversion is good performance, because with this change it is easier to store and move weights and more efficient to do MACs.
The goal of model quantization is to reduce the size of the model while preserving performance as much as possible. While there are many ways to do quantization, this work focuses on two: zero-point round-to-nearest (RTN) and activation-aware quantization (AWQ). Zero-point RTN works by rounding every value to its nearest corresponding value in a new range, which is dictated by the maximum and minimum number in the set of numbers that are going to be quantized, as shown in the graphic below. We call this method 'naive' in this article going forward because it is not informed by the activations: it only acts on the weights.
In contrast, AWQ works by implementing the same method as zero-point RTN, but with a twist: before quantization, weights are scaled up and activations are scaled down in order to reduce the quantization error associated with quantizing large activations. These scaling factors are commonly found via search and using a small calibration set in order to identify channels that commonly have large activations. AWQ has been shown to reduce quantization loss for LLMs.

Quantization-induced Degradation

Recent work has found that models that are pre-trained for longer have higher loss after post-training quantization (PTQ) than models that were trained on fewer tokens[1,2]. Specifically, they find that when using 3-5 bits of weight-only quantization, loss goes up more on models pre-trained for longer. This phenomenon is called Quantization-induced Degradation. Eventually, models trained for a long time actually have higher loss post quantization than lose trained for a shorter period of time. This can be observed in our Figure 1, where we reproduced the finding.

Large Activations in LLMs

Previous work has found that LLMs, particularly those trained to low perplexity, have large activations. These large activations generally occur in specific channels[4] for tokens and certain tokens like space or new line characters[15]. These are difficult to quantize becuase they increase the quantization range (as described above) and increase the error. To combat this, techniques like Mixed-Precision[4] and AWQ[5] have been developed to handle these large activations and reduce the quantization error. talk about llm int8 and massive activations findings.

Methods

For all of our studies, we use the OLMo-1B[9] class of models. What makes these powerful for research is that it was trained for 3 trillion tokens and, importantly, checkpoints were frequently taken during training. They released all checkpoints to the public. This gives us a sequence of models across pre-training that we can then use to study QiD.

Testing the Effect of QiD on Task Performance

Previously, QiD has only been observed on language modelling loss. We are interested in seeing how task performance is impacted by this effect. To measure this, we test the quantized OlMo models across training steps on several common tasks for benchmarking LLMs. We select PIQA[11], Winogrande[12], and LAMBADA[13], because they are found to be strong predictors of improved language modelling[3]. We use the lm-evaluation-harness[10] to evaluate the models and measure zero-shot accuracy for all tasks. We describe our results below and report our results in Figure 2.

Testing the Impact of Activations on QiD

We look for a link between the activations and QiD in two ways.
First, we study the activations of our OLMo models across training steps by tracking several statistics associated with activation size. The activations we study are created from an example of sequence length 512 and are recorded after every single weight matrix in the model. We record and report 5 statistics across training steps: the average across layers of the absolute mean of the activations, the average across layers of the absolute maximum in the activations, and the raw number of activations above 1, 10, and 100 respectively. If the size of activations appear to go up significantly when QiD onsets, then we would have evidence of the effect of large activations on the observed phenomenon of QiD. We report our findings below and in Figure 3.
Second, we implicitly study the impact of the activations on quantization error by measuring the difference in quantization error between naive zero-point RTN and AWQ. Since AWQ is activation aware while our naive method is not, if activations play a big role in this phenomenon then we should expect that AWQ will reduce the quantization error of the models. Is it the case that AWQ doesn't exhibit QiD but naively doing zero-point RTN does? To test this, we quantize the OLMo-1B models across training steps with both methods and compare the loss and task performance that both methods achieve. If AWQ gets much less loss and much better task performance than our naive method, then we would have evidence of the importance of handling activations correctly in order to reduce the phenomenon of QiD. These results are found in Figure 1 & 2.

Testing what Model Component is Causing QiD

We are interested in diagnosing exactly why QiD is occuring. It may not be the case that the entire model is degrading equally and certain components or aspects of the model may be the cause of our observations.
To test if there are certain modules in the model that have an increase in loss in comparison to full precision we measure the norm of the difference in the activations outputted by every weight matrix between the full precision and each quantized model. This can be thought of as measuring the error in the activations incurred by the model at a certain step of the forward pass because of quantization. If we can find that a certain module has constant "activation quantization loss" while examining it across training steps, it likely isn't responsible for QiD. In contrast, if a module has increasing "activation quantization loss", it is likely contributing to QiD. We report this experiment in Figure 4.

Results & Analysis

Here, we discuss the results of our experiments as described in the previous section.

Testing the Effect of QiD on Task Performance

Examining Figure 2, we see that 3 bit quantization has a strong degradation in task performance for all three datasets, irrespective of the quantization method. However, we do notice that for 4 and 5 bit quantization, both naive and AWQ are able to match or nearly match the performance of full precision across OLMo's training steps, even when loss is not similar due to QiD. This is an interesting finding and shows that QiD will not be universal for all things that the model does, and it suggests that more research is needed into what QiD is doing to the model and what it impacts.
This also suggests that language modelling loss may not be a perfect measurement for diagnosing practices that are good and bad for LLMs: what we care about is task performance, and loss here may be leading us astray.

Testing the Impact of Activations on QiD

When examining the statistics that we measured from the activations across training steps, we actually observe the reverse to what we expected: the size of activations appears to be going down across training steps, not up. This holds especcially well for the average absolute mean of the activations and the raw number of activations above 1. Activations may not be the culprit of QiD.

When examining the difference between naive quantization and AWQ, we find something similar to above: naive quantization and AWQ have very little difference between both their loss(Figure 1) and task performance(Figure 2). This further suggests that activations are unlikely to be causing QiD since naive quantization performs as well as AWQ.

Because of these two, we have strong reasons to believe that activations are unlikely to be the cause of QiD.

Testing what Model Component is Causing QiD

Studying the plot for Figure 4, we look for patterns that are different across training steps, since as training occurs QiD gets worse. Across training steps and quantization levels, the norm of the activation error appears to be pretty similar for the attention output, feed forward projection, and feed forward output layers, regardless of the quantization technique used. However, We can notice that across training steps, the norm of the activation error of the attention projection module clearly grows across training steps. This finding is independent of quantization technique and the number of bit used for quantization. This is a very interesting finding. This suggests that the attention projection module is responsible for QiD while the other three modules appear to not have a notable change in activation error across training steps.
This finding suggests several remedies for future work. Could keeping the attention projection layer in higher precision, lets say 8bit, while keeping the other layers in low precision help allieviate QiD?

Discussion

In this work, we investigated the impact of QiD on task performance, the impact of activations on QiD, and which model component is responsible for QiD. We found that although loss does degrade across training steps when you quantize the model, task performance stays similar to that of full precision. We found that large activations are not responsible for QiD. Lastly we found that the attention projection layer seems to be the culprit of QiD.
If large activations are not responsible for QiD, what is? Completely speculatively, it may be the case that as the model is trained for longer and gets more and more into a superposition [14], quantization somehow deletes information in a that harms the model more if it has been trained for longer.
This work has several limitations. Most notably, we only studied one model. Futher, this model is not very large, meaning certain phenomena may occur with it and not larger, more SOTA models. Other limitations include not isolating the causes of the things we have observed: currently they are only observations. We leave to future work replicating our study on OLMo-7B and examining if keeping the attention projection module in full precision eliminates QiD.

References:

[1] Scaling Laws For Precision, Kumar et al. 2024

[2] Low-Bit Quantization Favors Undertrained LLMs, Ouyang et al. 2024

[3] Calibrating the Mosaic Evaluation Gauntlet, Tessa Barton, 2024

[4] LLM.int8(), Dettmers et al. 2022

[5] AWQ, Lin et al. 2023

[6] LLaMA, Touvron et al. 2023

[7] LLaMA3, Grattafiori et al. 2024

[8] Chinchilla, Hoffmann et al. 2022

[9] OLMo, Groeneveld et al. 2024

[10] lm-evaluation-harness, EleutherAI 2024

[11] PIQA, Bist et al. 2019

[12] Winogrande, Sakaguchi et al. 2019

[13] LAMBADA, Paperno et al. 2016

[14] Toy Models of Superposition, Elhage et al. 2016

[15] Massive Activations in Large Language Models, Sun et al. 2024