QLORA: Efficient Finetuning of Quantized LLMs
Last updated
Copyright Continuum Labs - 2023
Last updated
This paper introduces QLORA, a parameter efficient fine tuning approach customising large language models (LLMs) with significantly reduced memory requirements.
QLORA combines 4-bit quantization of the pretrained model with Low Rank Adapters (LoRA) to enable finetuning of a 65B parameter model on a single 48GB GPU, without sacrificing performance compared to full 16-bit finetuning.
4-bit NormalFloat (NF4): An information-theoretically optimal data type for normally distributed weights.
Double Quantization: Quantizing the quantization constants to reduce memory footprint further.
Paged Optimizers: Managing memory spikes during training using NVIDIA unified memory.
The authors use QLORA to finetune over 1,000 models, demonstrating state-of-the-art results with their Guanaco model family.
Guanaco reaches 99.3% of ChatGPT's performance on the Vicuna benchmark while being trainable on a single GPU.
The extensive analysis reveals several key findings
Data quality is more important than dataset size for instruction finetuning and chatbot performance.
Strong performance on the MMLU benchmark does not necessarily imply strong chatbot performance, highlighting the importance of task-specific datasets.
GPT-4 evaluations largely agree with human evaluations in ranking chatbot performance, offering a cheaper alternative to human annotation, albeit with some uncertainties.
The authors release their codebase, CUDA kernels, and integrate their methods into the Hugging Face transformers library, making QLORA accessible to the community. They also release 32 finetuned models across various sizes and instruction datasets.
In summary, the QLORA paper introduces a groundbreaking approach to efficiently finetune large language models, democratising access to LLM finetuning and enabling in-depth analysis of instruction finetuning and chatbot performance at unprecedented scales.
The open-source release of the code and models further contributes to the advancement of the field.
To provide more context on quantization and its mathematical foundations, let's dive deeper into the background and explain dequantization and the potential risks involved.
Quantization is a technique used to reduce the precision of numerical representations, typically by mapping a larger set of values to a smaller set.
In the context of deep learning, quantization is often applied to model weights and activations, converting them from higher-precision data types (e.g., 32-bit floating-point) to lower-precision data types (e.g., 8-bit integers).
This reduces memory consumption and can accelerate computations, especially on hardware optimized for lower-precision arithmetic.
The quantization process involves scaling the input values to fit within the range of the target data type.
For example, when quantizing a 32-bit floating-point tensor to an 8-bit integer tensor with a range of [-127, 127], the quantization formula is:
Here, represents the absolute maximum value in the input tensor.
The scaling factor, , is called the quantization constant or quantization scale, denoted as c.
To mitigate the impact of outliers on the quantization process, block-wise quantization is employed.
The input tensor is divided into smaller blocks, and each block is quantized independently with its own quantization constant.
This ensures better utilization of the available quantization bins.
Dequantization is the inverse process of quantization, where the quantized values are mapped back to their original data type. The dequantization formula for the example above is:
Here, is the quantization constant used during the quantization step.
Risks and Considerations:
Information Loss: Quantization inherently leads to a loss of information due to the reduced precision. This can affect the model's accuracy and performance, especially if the quantization is too aggressive.
Quantization Noise: The quantization process introduces noise into the model, as the original values are approximated by the quantized values. This noise can accumulate across layers and impact the model's behavior.
Outliers and Range: Outliers in the input tensor can significantly affect the quantization process, leading to poor utilization of the available quantization bins. Block-wise quantization helps mitigate this issue, but it's still important to consider the range of values in the tensor.
Hardware Compatibility: While quantization can lead to memory savings and computational speedups, the target hardware must support the specific quantized data types and operations. Not all hardware platforms have efficient support for low-precision arithmetic.
Quantization-Aware Training: To achieve optimal performance with quantized models, quantization-aware training techniques can be employed. These techniques simulate the quantization process during training, allowing the model to adapt to the quantization noise and minimize its impact on accuracy.
Despite these risks, quantization remains a powerful technique for reducing the memory footprint and computational requirements of deep learning models.
By carefully considering the trade-offs and employing appropriate quantization strategies, such as block-wise quantization and quantization-aware training, the impact of quantization on model performance can be minimized while realizing significant efficiency gains.
4-bit NormalFloat Quantization
The authors observe that pretrained neural network weights usually follow a zero-cantered normal distribution with a standard deviation σ.
This means that the weights are symmetrically distributed around zero, and the spread of the distribution is determined by the standard deviation.
To optimises the quantization process for such normally distributed weights, they introduce the 4-bit NormalFloat (NF4) quantization.
The idea is to create a quantization scheme that is information-theoretically optimal for zero-mean normal distributions.
The process involves:
a. Estimating the quantiles of a standard normal distribution to obtain a k-bit quantile quantization data type
b. Normalizing the data type values into the range
c. Quantizing the input weight tensor by normalizing it into the range [-1, 1] using absolute maximum rescaling.
The equation estimates the quantile values for the data type, where is the quantile function of the standard normal distribution.
To ensure an exact representation of zero, they create an asymmetric data type by estimating the quantiles separately for the negative and positive parts and then unifying the sets while removing one of the duplicate zeros.
Double Quantization
Double Quantization (DQ) is introduced to reduce the memory footprint of the quantization constants. It involves quantizing the quantization constants themselves.
The process works as follows:
a. The quantization constants from the first quantization are treated as inputs to a second quantization.
b. The second quantization yields the quantized quantization constants and the second level of quantization constants .
c. 8-bit Floats with a blocksize of 256 are used for the second quantization to avoid performance degradation.
d. Since values are positive, the mean is subtracted from before quantization to centre the values around zero and enable symmetric quantization.
This double quantization reduces the memory footprint per parameter from 0.5 bits to 0.127 bits, achieving a reduction of 0.373 bits per parameter.
QLORA
QLORA combines the 4-bit NormalFloat quantization, Double Quantization, and Low-Rank Adapters (LoRA) to achieve efficient 4-bit quantization.
For a single linear layer in the quantized base model with a single LoRA adapter, QLORA is defined as:
where doubleDequant(·) is the double dequantization process:
QLORA uses NF4 for the weights and FP8 for the quantization constants .
The blocksize is set to 64 for W for higher precision and 256 for c2 to conserve memory.
During the backward pass, only the gradients with respect to the LoRA adapter weights are computed, not for the 4-bit weights .
However, computing involves calculating , which requires dequantizing the storage WNF4 to the computation data type WBF16.
In summary, QLORA uses 4-bit NormalFloat as the storage data type and 16-bit BrainFloat as the computation data type.
The storage data type is dequantized to the computation data type for the forward and backward passes, but gradients are only computed for the LoRA parameters in 16-bit precision.
To compare QLORA with standard finetuning, the authors conduct experiments on various architectures (encoder, encoder-decoder, and decoder-only) and model sizes (up to 3B parameters).
LoRA Adapters: The authors find that applying LoRA to all linear transformer block layers is crucial to match the performance of full finetuning. The number of LoRA adapters used is the most critical hyperparameter.
Hyperparameter Tuning: Default hyperparameters for fully finetuned baselines are often undertuned. The authors perform a hyperparameter search over learning rates (1e-6 to 5e-5) and batch sizes (8 to 128) to establish robust baselines.
NF4 significantly improves performance over FP4 and Int4 data types. Double quantization reduces the memory footprint without degrading performance.
The authors find that 4-bit QLORA with the NF4 data type matches the performance of both 16-bit full finetuning and 16-bit LoRA finetuning on academic benchmarks. This holds true for various model sizes (125M to 65B parameters) and datasets (GLUE, Super-Natural Instructions, Alpaca, and FLAN v2).
In line with previous work on quantization, the authors observe that with a given finetuning and inference resource budget, it is beneficial to increase the number of parameters in the base model while decreasing their precision. This highlights the importance of efficiency benefits from QLORA.
Key Findings
QLORA with NF4 replicates both 16-bit full finetuning and 16-bit LoRA finetuning performance.
NF4 is superior to FP4 in terms of quantization precision.
Double quantization does not degrade performance.
The authors' results consistently show that 4-bit QLORA with the NF4 data type matches the performance of 16-bit methods while offering significant memory savings.
This allows for the exploration of instruction tuning at scales that would be impossible with full 16-bit finetuning on academic research hardware.
Lack of comparison with full 16-bit finetuning at larger scales
While the authors provide evidence that QLORA can replicate 16-bit full finetuning performance with a 4-bit base model and Low-rank Adapters (LoRA), they did not establish this at the 33B and 65B scales due to the immense resource costs involved.
The authors evaluated QLORA on MMLU, the Vicuna benchmark, and the OA benchmark but did not evaluate on other benchmarks such as BigBench, RAFT, and HELM. It is not ensured that their evaluations generalize to these other benchmarks.
The performance of the benchmarks likely depends on how similar the finetuning data is to the benchmark dataset. This highlights the need for better benchmarks and evaluation methods, as well as careful consideration of what is being evaluated in the first place.
While the authors evaluate the likelihood of Guanaco-65B to generate a socially biased sequence of tokens compared to other models, it is unclear if Guanaco performs well when assessed on other types of biases.