Page cover

QLORA: Efficient Finetuning of Quantized LLMs

This paper introduces QLORA, a parameter efficient fine tuning approach customising large language models (LLMs) with significantly reduced memory requirements.

QLORA combines 4-bit quantization of the pretrained model with Low Rank Adapters (LoRA) to enable finetuning of a 65B parameter model on a single 48GB GPU, without sacrificing performance compared to full 16-bit finetuning.

Key innovations of QLORA include

4-bit NormalFloat (NF4): An information-theoretically optimal data type for normally distributed weights.

Double Quantization: Quantizing the quantization constants to reduce memory footprint further.

Paged Optimizers: Managing memory spikes during training using NVIDIA unified memory.

The authors use QLORA to finetune over 1,000 models, demonstrating state-of-the-art results with their Guanaco model family.

Guanaco reaches 99.3% of ChatGPT's performance on the Vicuna benchmark while being trainable on a single GPU.

The extensive analysis reveals several key findings

  1. Data quality is more important than dataset size for instruction finetuning and chatbot performance.

  2. Strong performance on the MMLU benchmark does not necessarily imply strong chatbot performance, highlighting the importance of task-specific datasets.

  3. GPT-4 evaluations largely agree with human evaluations in ranking chatbot performance, offering a cheaper alternative to human annotation, albeit with some uncertainties.

The authors release their codebase, CUDA kernels, and integrate their methods into the Hugging Face transformers library, making QLORA accessible to the community. They also release 32 finetuned models across various sizes and instruction datasets.

In summary, the QLORA paper introduces a groundbreaking approach to efficiently finetune large language models, democratising access to LLM finetuning and enabling in-depth analysis of instruction finetuning and chatbot performance at unprecedented scales.

The open-source release of the code and models further contributes to the advancement of the field.

Background

To provide more context on quantization and its mathematical foundations, let's dive deeper into the background and explain dequantization and the potential risks involved.

Quantization is a technique used to reduce the precision of numerical representations, typically by mapping a larger set of values to a smaller set.

In the context of deep learning, quantization is often applied to model weights and activations, converting them from higher-precision data types (e.g., 32-bit floating-point) to lower-precision data types (e.g., 8-bit integers).

This reduces memory consumption and can accelerate computations, especially on hardware optimized for lower-precision arithmetic.

Block-wise k-bit Quantization

The quantization process involves scaling the input values to fit within the range of the target data type.

For example, when quantizing a 32-bit floating-point tensor to an 8-bit integer tensor with a range of [-127, 127], the quantization formula is:

XInt8=round(127/absmax(XFP32)XFP32)XInt8 = round(127 / absmax(XFP32) * XFP32)

Here, absmax(XFP32)absmax(XFP32) represents the absolute maximum value in the input tensor.

The scaling factor, 127/absmax(XFP32)127 / absmax(XFP32), is called the quantization constant or quantization scale, denoted as c.

To mitigate the impact of outliers on the quantization process, block-wise quantization is employed.

The input tensor is divided into smaller blocks, and each block is quantized independently with its own quantization constant.

This ensures better utilization of the available quantization bins.

Dequantization

Dequantization is the inverse process of quantization, where the quantized values are mapped back to their original data type. The dequantization formula for the example above is:

XFP32=XInt8/cXFP32 = XInt8 / c

Here, cc is the quantization constant used during the quantization step.

Risks and Considerations:

Information Loss: Quantization inherently leads to a loss of information due to the reduced precision. This can affect the model's accuracy and performance, especially if the quantization is too aggressive.

Quantization Noise: The quantization process introduces noise into the model, as the original values are approximated by the quantized values. This noise can accumulate across layers and impact the model's behavior.

Outliers and Range: Outliers in the input tensor can significantly affect the quantization process, leading to poor utilization of the available quantization bins. Block-wise quantization helps mitigate this issue, but it's still important to consider the range of values in the tensor.

Hardware Compatibility: While quantization can lead to memory savings and computational speedups, the target hardware must support the specific quantized data types and operations. Not all hardware platforms have efficient support for low-precision arithmetic.

Quantization-Aware Training: To achieve optimal performance with quantized models, quantization-aware training techniques can be employed. These techniques simulate the quantization process during training, allowing the model to adapt to the quantization noise and minimize its impact on accuracy.

Despite these risks, quantization remains a powerful technique for reducing the memory footprint and computational requirements of deep learning models.

By carefully considering the trade-offs and employing appropriate quantization strategies, such as block-wise quantization and quantization-aware training, the impact of quantization on model performance can be minimized while realizing significant efficiency gains.

Different components of QLORA

4-bit NormalFloat Quantization

The authors observe that pretrained neural network weights usually follow a zero-cantered normal distribution with a standard deviation σ.

This means that the weights are symmetrically distributed around zero, and the spread of the distribution is determined by the standard deviation.

To optimises the quantization process for such normally distributed weights, they introduce the 4-bit NormalFloat (NF4) quantization.

The idea is to create a quantization scheme that is information-theoretically optimal for zero-mean normal distributions.

The process involves:

a. Estimating the 2k+12^k + 1 quantiles of a standard normal distribution N(0,1)N(0,1) to obtain a k-bit quantile quantization data type

b. Normalizing the data type values into the range [1,1][-1, 1]

c. Quantizing the input weight tensor by normalizing it into the range [-1, 1] using absolute maximum rescaling.

The equation qi=1/2(QX(i/(2k+1))+QX((i+1)/(2k+1)))qi = 1/2 * (QX(i/(2^k+1)) + QX((i+1)/(2^k+1))) estimates the quantile values qiqi for the data type, where QX()QX(·) is the quantile function of the standard normal distribution.

To ensure an exact representation of zero, they create an asymmetric data type by estimating the quantiles separately for the negative and positive parts and then unifying the sets while removing one of the duplicate zeros.

Double Quantization

Double Quantization (DQ) is introduced to reduce the memory footprint of the quantization constants. It involves quantizing the quantization constants themselves.

The process works as follows:

a. The quantization constants cFP32cFP32 from the first quantization are treated as inputs to a second quantization.

b. The second quantization yields the quantized quantization constants cFP8cFP8 and the second level of quantization constants cFP32cFP32.

c. 8-bit Floats with a blocksize of 256 are used for the second quantization to avoid performance degradation.

d. Since cFP32cFP32 values are positive, the mean is subtracted from c2c2 before quantization to centre the values around zero and enable symmetric quantization.

This double quantization reduces the memory footprint per parameter from 0.5 bits to 0.127 bits, achieving a reduction of 0.373 bits per parameter.

QLORA

QLORA combines the 4-bit NormalFloat quantization, Double Quantization, and Low-Rank Adapters (LoRA) to achieve efficient 4-bit quantization.

For a single linear layer in the quantized base model with a single LoRA adapter, QLORA is defined as:

YBF16=XBF16doubleDequant(cFP321,ckbit2,WNF4)+XBF16LBF16 YBF16 = XBF16 * doubleDequant(cFP32_1, ck-bit_2, WNF4) + XBF16 * LBF16

where doubleDequant(·) is the double dequantization process:

doubleDequant(cFP321,ckbit2,Wkbit)=doubleDequant(cFP32_1, ck-bit_2, Wk-bit) =
dequant(dequant(cFP321,ckbit2),W4bit)=WBF16dequant(dequant(cFP32_1, ck-bit_2), W4bit) = WBF16

QLORA uses NF4 for the weights (W) (W) and FP8 for the quantization constants (c2)(c2).

The blocksize is set to 64 for W for higher precision and 256 for c2 to conserve memory.

During the backward pass, only the gradients with respect to the LoRA adapter weights (E/Li)(∂E/∂Li) are computed, not for the 4-bit weights (E/W)(∂E/∂W).

However, computing E/Li∂E/∂Li involves calculating X/We2piiξx∂X/∂W* e^{2 pi i \xi x}, which requires dequantizing the storage WNF4 to the computation data type WBF16.

In summary, QLORA uses 4-bit NormalFloat as the storage data type and 16-bit BrainFloat as the computation data type.

The storage data type is dequantized to the computation data type for the forward and backward passes, but gradients are only computed for the LoRA parameters in 16-bit precision.

QLORA vs. Standard Finetuning

To compare QLORA with standard finetuning, the authors conduct experiments on various architectures (encoder, encoder-decoder, and decoder-only) and model sizes (up to 3B parameters).

Best Practices

LoRA Adapters: The authors find that applying LoRA to all linear transformer block layers is crucial to match the performance of full finetuning. The number of LoRA adapters used is the most critical hyperparameter.

Hyperparameter Tuning: Default hyperparameters for fully finetuned baselines are often undertuned. The authors perform a hyperparameter search over learning rates (1e-6 to 5e-5) and batch sizes (8 to 128) to establish robust baselines.

Comparison

4-bit NormalFloat (NF4) vs. 4-bit Floating Point (FP4)

NF4 significantly improves performance over FP4 and Int4 data types. Double quantization reduces the memory footprint without degrading performance.

QLORA vs. 16-bit Full Finetuning and 16-bit LoRA

The authors find that 4-bit QLORA with the NF4 data type matches the performance of both 16-bit full finetuning and 16-bit LoRA finetuning on academic benchmarks. This holds true for various model sizes (125M to 65B parameters) and datasets (GLUE, Super-Natural Instructions, Alpaca, and FLAN v2).

Performance-Precision Trade-off

In line with previous work on quantization, the authors observe that with a given finetuning and inference resource budget, it is beneficial to increase the number of parameters in the base model while decreasing their precision. This highlights the importance of efficiency benefits from QLORA.

Key Findings

  1. QLORA with NF4 replicates both 16-bit full finetuning and 16-bit LoRA finetuning performance.

  2. NF4 is superior to FP4 in terms of quantization precision.

  3. Double quantization does not degrade performance.

The authors' results consistently show that 4-bit QLORA with the NF4 data type matches the performance of 16-bit methods while offering significant memory savings.

This allows for the exploration of instruction tuning at scales that would be impossible with full 16-bit finetuning on academic research hardware.

Limitations

Lack of comparison with full 16-bit finetuning at larger scales

While the authors provide evidence that QLORA can replicate 16-bit full finetuning performance with a 4-bit base model and Low-rank Adapters (LoRA), they did not establish this at the 33B and 65B scales due to the immense resource costs involved.

Limited evaluation on instruction finetuning models

The authors evaluated QLORA on MMLU, the Vicuna benchmark, and the OA benchmark but did not evaluate on other benchmarks such as BigBench, RAFT, and HELM. It is not ensured that their evaluations generalize to these other benchmarks.

Benchmarks?

The performance of models against these benchmarks is measured using various methods and metrics, depending on the specific focus of each benchmark. Here is an overview of how performance is typically measured for each:

MMLU (Massive Multitask Language Understanding)

  • Measurement Method: Multiple-choice questions.

  • Metrics: Accuracy is the primary metric, calculated as the percentage of correct answers out of the total questions.

  • Details: Performance is evaluated across 57 tasks from different domains, and the overall accuracy provides a comprehensive measure of the model's general knowledge and understanding.

Vicuna Benchmark

  • Measurement Method: Evaluation of conversational tasks and scenarios.

  • Metrics: Human evaluation scores, BLEU (Bilingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for Gisting Evaluation), and other dialogue-specific metrics.

  • Details: Human judges often rate the quality of responses based on coherence, relevance, informativeness, and fluency. Automated metrics may also be used to compare generated text against reference responses.

OA (OpenAI) Benchmark

  • Measurement Method: A variety of tasks designed to test different capabilities.

  • Metrics: Task-specific metrics such as accuracy, F1 score, precision, recall, and others depending on the nature of each task.

  • Details: The benchmark includes diverse tasks, and the performance is measured using the appropriate metric for each task to provide a detailed view of the model's strengths and weaknesses.

BigBench (Beyond the Imitation Game Benchmark):

  • Measurement Method: A wide range of tasks developed by the research community.

  • Metrics: Varies by task; common metrics include accuracy, F1 score, and others relevant to the specific task.

  • Details: The benchmark covers reasoning, commonsense understanding, and other advanced skills. Performance is evaluated task by task, and an aggregate score may be used to summarize overall performance.

RAFT (Realistic Adversarial Functionality Test)

  • Measurement Method: Adversarial tasks designed to expose model weaknesses.

  • Metrics: Accuracy, robustness metrics, error rates, and other task-specific metrics.

  • Details: Models are tested with challenging and tricky inputs to assess their robustness and reliability. Performance is measured by how well the model can handle these difficult scenarios.

HELM (Holistic Evaluation of Language Models)

  • Measurement Method: A comprehensive set of evaluations across various dimensions.

  • Metrics: Accuracy, fairness metrics, robustness metrics, efficiency (e.g., speed, computational resources), and others.

  • Details: The benchmark aims to provide a holistic view of performance, considering multiple aspects beyond just accuracy. Metrics are chosen to reflect the model's performance in terms of fairness, robustness, and efficiency.

In summary, each benchmark employs specific methods and metrics tailored to its focus area, providing a nuanced and detailed assessment of language model performance across different tasks and dimensions.

Dependency on similarity between finetuning data and benchmark data

The performance of the benchmarks likely depends on how similar the finetuning data is to the benchmark dataset. This highlights the need for better benchmarks and evaluation methods, as well as careful consideration of what is being evaluated in the first place.

Limited responsible AI evaluation

While the authors evaluate the likelihood of Guanaco-65B to generate a socially biased sequence of tokens compared to other models, it is unclear if Guanaco performs well when assessed on other types of biases.

Last updated

Was this helpful?