LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Key innovations of QLORA include
  • Background
  • Block-wise k-bit Quantization
  • Different components of QLORA
  • QLORA vs. Standard Finetuning
  • Best Practices
  • Comparison
  • Limitations

Was this helpful?

  1. Training
  2. The Fine Tuning Process
  3. Parameter Efficient Fine Tuning

QLORA: Efficient Finetuning of Quantized LLMs

PreviousPractical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)NextBits and Bytes

Last updated 11 months ago

Was this helpful?

This paper introduces QLORA, a parameter efficient fine tuning approach customising large language models (LLMs) with significantly reduced memory requirements.

QLORA combines 4-bit quantization of the pretrained model with Low Rank Adapters (LoRA) to enable finetuning of a 65B parameter model on a single 48GB GPU, without sacrificing performance compared to full 16-bit finetuning.

Key innovations of QLORA include

4-bit NormalFloat (NF4): An information-theoretically optimal data type for normally distributed weights.

Double Quantization: Quantizing the quantization constants to reduce memory footprint further.

Paged Optimizers: Managing memory spikes during training using NVIDIA unified memory.

The authors use QLORA to finetune over 1,000 models, demonstrating state-of-the-art results with their Guanaco model family.

Guanaco reaches 99.3% of ChatGPT's performance on the Vicuna benchmark while being trainable on a single GPU.

The extensive analysis reveals several key findings

  1. Data quality is more important than dataset size for instruction finetuning and chatbot performance.

  2. Strong performance on the MMLU benchmark does not necessarily imply strong chatbot performance, highlighting the importance of task-specific datasets.

  3. GPT-4 evaluations largely agree with human evaluations in ranking chatbot performance, offering a cheaper alternative to human annotation, albeit with some uncertainties.

The authors release their codebase, CUDA kernels, and integrate their methods into the Hugging Face transformers library, making QLORA accessible to the community. They also release 32 finetuned models across various sizes and instruction datasets.

In summary, the QLORA paper introduces a groundbreaking approach to efficiently finetune large language models, democratising access to LLM finetuning and enabling in-depth analysis of instruction finetuning and chatbot performance at unprecedented scales.

The open-source release of the code and models further contributes to the advancement of the field.

Background

To provide more context on quantization and its mathematical foundations, let's dive deeper into the background and explain dequantization and the potential risks involved.

Quantization is a technique used to reduce the precision of numerical representations, typically by mapping a larger set of values to a smaller set.

In the context of deep learning, quantization is often applied to model weights and activations, converting them from higher-precision data types (e.g., 32-bit floating-point) to lower-precision data types (e.g., 8-bit integers).

This reduces memory consumption and can accelerate computations, especially on hardware optimized for lower-precision arithmetic.

Block-wise k-bit Quantization

The quantization process involves scaling the input values to fit within the range of the target data type.

For example, when quantizing a 32-bit floating-point tensor to an 8-bit integer tensor with a range of [-127, 127], the quantization formula is:

To mitigate the impact of outliers on the quantization process, block-wise quantization is employed.

The input tensor is divided into smaller blocks, and each block is quantized independently with its own quantization constant.

This ensures better utilization of the available quantization bins.

Dequantization

Dequantization is the inverse process of quantization, where the quantized values are mapped back to their original data type. The dequantization formula for the example above is:

Risks and Considerations:

Information Loss: Quantization inherently leads to a loss of information due to the reduced precision. This can affect the model's accuracy and performance, especially if the quantization is too aggressive.

Quantization Noise: The quantization process introduces noise into the model, as the original values are approximated by the quantized values. This noise can accumulate across layers and impact the model's behavior.

Outliers and Range: Outliers in the input tensor can significantly affect the quantization process, leading to poor utilization of the available quantization bins. Block-wise quantization helps mitigate this issue, but it's still important to consider the range of values in the tensor.

Hardware Compatibility: While quantization can lead to memory savings and computational speedups, the target hardware must support the specific quantized data types and operations. Not all hardware platforms have efficient support for low-precision arithmetic.

Quantization-Aware Training: To achieve optimal performance with quantized models, quantization-aware training techniques can be employed. These techniques simulate the quantization process during training, allowing the model to adapt to the quantization noise and minimize its impact on accuracy.

Despite these risks, quantization remains a powerful technique for reducing the memory footprint and computational requirements of deep learning models.

By carefully considering the trade-offs and employing appropriate quantization strategies, such as block-wise quantization and quantization-aware training, the impact of quantization on model performance can be minimized while realizing significant efficiency gains.

Different components of QLORA

4-bit NormalFloat Quantization

The authors observe that pretrained neural network weights usually follow a zero-cantered normal distribution with a standard deviation σ.

This means that the weights are symmetrically distributed around zero, and the spread of the distribution is determined by the standard deviation.

To optimises the quantization process for such normally distributed weights, they introduce the 4-bit NormalFloat (NF4) quantization.

The idea is to create a quantization scheme that is information-theoretically optimal for zero-mean normal distributions.

The process involves:

c. Quantizing the input weight tensor by normalizing it into the range [-1, 1] using absolute maximum rescaling.

To ensure an exact representation of zero, they create an asymmetric data type by estimating the quantiles separately for the negative and positive parts and then unifying the sets while removing one of the duplicate zeros.

Double Quantization

Double Quantization (DQ) is introduced to reduce the memory footprint of the quantization constants. It involves quantizing the quantization constants themselves.

The process works as follows:

c. 8-bit Floats with a blocksize of 256 are used for the second quantization to avoid performance degradation.

This double quantization reduces the memory footprint per parameter from 0.5 bits to 0.127 bits, achieving a reduction of 0.373 bits per parameter.

QLORA

QLORA combines the 4-bit NormalFloat quantization, Double Quantization, and Low-Rank Adapters (LoRA) to achieve efficient 4-bit quantization.

For a single linear layer in the quantized base model with a single LoRA adapter, QLORA is defined as:

where doubleDequant(·) is the double dequantization process:

The blocksize is set to 64 for W for higher precision and 256 for c2 to conserve memory.

In summary, QLORA uses 4-bit NormalFloat as the storage data type and 16-bit BrainFloat as the computation data type.

The storage data type is dequantized to the computation data type for the forward and backward passes, but gradients are only computed for the LoRA parameters in 16-bit precision.

QLORA vs. Standard Finetuning

To compare QLORA with standard finetuning, the authors conduct experiments on various architectures (encoder, encoder-decoder, and decoder-only) and model sizes (up to 3B parameters).

Best Practices

LoRA Adapters: The authors find that applying LoRA to all linear transformer block layers is crucial to match the performance of full finetuning. The number of LoRA adapters used is the most critical hyperparameter.

Hyperparameter Tuning: Default hyperparameters for fully finetuned baselines are often undertuned. The authors perform a hyperparameter search over learning rates (1e-6 to 5e-5) and batch sizes (8 to 128) to establish robust baselines.

Comparison

4-bit NormalFloat (NF4) vs. 4-bit Floating Point (FP4)

NF4 significantly improves performance over FP4 and Int4 data types. Double quantization reduces the memory footprint without degrading performance.

QLORA vs. 16-bit Full Finetuning and 16-bit LoRA

The authors find that 4-bit QLORA with the NF4 data type matches the performance of both 16-bit full finetuning and 16-bit LoRA finetuning on academic benchmarks. This holds true for various model sizes (125M to 65B parameters) and datasets (GLUE, Super-Natural Instructions, Alpaca, and FLAN v2).

Performance-Precision Trade-off

In line with previous work on quantization, the authors observe that with a given finetuning and inference resource budget, it is beneficial to increase the number of parameters in the base model while decreasing their precision. This highlights the importance of efficiency benefits from QLORA.

Key Findings

  1. QLORA with NF4 replicates both 16-bit full finetuning and 16-bit LoRA finetuning performance.

  2. NF4 is superior to FP4 in terms of quantization precision.

  3. Double quantization does not degrade performance.

The authors' results consistently show that 4-bit QLORA with the NF4 data type matches the performance of 16-bit methods while offering significant memory savings.

This allows for the exploration of instruction tuning at scales that would be impossible with full 16-bit finetuning on academic research hardware.

Limitations

Lack of comparison with full 16-bit finetuning at larger scales

While the authors provide evidence that QLORA can replicate 16-bit full finetuning performance with a 4-bit base model and Low-rank Adapters (LoRA), they did not establish this at the 33B and 65B scales due to the immense resource costs involved.

Limited evaluation on instruction finetuning models

The authors evaluated QLORA on MMLU, the Vicuna benchmark, and the OA benchmark but did not evaluate on other benchmarks such as BigBench, RAFT, and HELM. It is not ensured that their evaluations generalize to these other benchmarks.

Benchmarks?

The performance of models against these benchmarks is measured using various methods and metrics, depending on the specific focus of each benchmark. Here is an overview of how performance is typically measured for each:

MMLU (Massive Multitask Language Understanding)

  • Measurement Method: Multiple-choice questions.

  • Metrics: Accuracy is the primary metric, calculated as the percentage of correct answers out of the total questions.

  • Details: Performance is evaluated across 57 tasks from different domains, and the overall accuracy provides a comprehensive measure of the model's general knowledge and understanding.

Vicuna Benchmark

  • Measurement Method: Evaluation of conversational tasks and scenarios.

  • Metrics: Human evaluation scores, BLEU (Bilingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for Gisting Evaluation), and other dialogue-specific metrics.

  • Details: Human judges often rate the quality of responses based on coherence, relevance, informativeness, and fluency. Automated metrics may also be used to compare generated text against reference responses.

OA (OpenAI) Benchmark

  • Measurement Method: A variety of tasks designed to test different capabilities.

  • Metrics: Task-specific metrics such as accuracy, F1 score, precision, recall, and others depending on the nature of each task.

  • Details: The benchmark includes diverse tasks, and the performance is measured using the appropriate metric for each task to provide a detailed view of the model's strengths and weaknesses.

BigBench (Beyond the Imitation Game Benchmark):

  • Measurement Method: A wide range of tasks developed by the research community.

  • Metrics: Varies by task; common metrics include accuracy, F1 score, and others relevant to the specific task.

  • Details: The benchmark covers reasoning, commonsense understanding, and other advanced skills. Performance is evaluated task by task, and an aggregate score may be used to summarize overall performance.

RAFT (Realistic Adversarial Functionality Test)

  • Measurement Method: Adversarial tasks designed to expose model weaknesses.

  • Metrics: Accuracy, robustness metrics, error rates, and other task-specific metrics.

  • Details: Models are tested with challenging and tricky inputs to assess their robustness and reliability. Performance is measured by how well the model can handle these difficult scenarios.

HELM (Holistic Evaluation of Language Models)

  • Measurement Method: A comprehensive set of evaluations across various dimensions.

  • Metrics: Accuracy, fairness metrics, robustness metrics, efficiency (e.g., speed, computational resources), and others.

  • Details: The benchmark aims to provide a holistic view of performance, considering multiple aspects beyond just accuracy. Metrics are chosen to reflect the model's performance in terms of fairness, robustness, and efficiency.

In summary, each benchmark employs specific methods and metrics tailored to its focus area, providing a nuanced and detailed assessment of language model performance across different tasks and dimensions.

Dependency on similarity between finetuning data and benchmark data

The performance of the benchmarks likely depends on how similar the finetuning data is to the benchmark dataset. This highlights the need for better benchmarks and evaluation methods, as well as careful consideration of what is being evaluated in the first place.

Limited responsible AI evaluation

While the authors evaluate the likelihood of Guanaco-65B to generate a socially biased sequence of tokens compared to other models, it is unclear if Guanaco performs well when assessed on other types of biases.

XInt8=round(127/absmax(XFP32)∗XFP32)XInt8 = round(127 / absmax(XFP32) * XFP32)XInt8=round(127/absmax(XFP32)∗XFP32)

Here, absmax(XFP32)absmax(XFP32)absmax(XFP32) represents the absolute maximum value in the input tensor.

The scaling factor, 127/absmax(XFP32)127 / absmax(XFP32)127/absmax(XFP32), is called the quantization constant or quantization scale, denoted as c.

XFP32=XInt8/cXFP32 = XInt8 / cXFP32=XInt8/c

Here, ccc is the quantization constant used during the quantization step.

a. Estimating the 2k+12^k + 1 2k+1quantiles of a standard normal distribution N(0,1)N(0,1) N(0,1)to obtain a k-bit quantile quantization data type

b. Normalizing the data type values into the range [−1,1][-1, 1][−1,1]

The equation qi=1/2∗(QX(i/(2k+1))+QX((i+1)/(2k+1)))qi = 1/2 * (QX(i/(2^k+1)) + QX((i+1)/(2^k+1))) qi=1/2∗(QX(i/(2k+1))+QX((i+1)/(2k+1)))estimates the quantile values qiqiqi for the data type, where QX(⋅)QX(·) QX(⋅)is the quantile function of the standard normal distribution.

a. The quantization constants cFP32cFP32 cFP32from the first quantization are treated as inputs to a second quantization.

b. The second quantization yields the quantized quantization constants cFP8cFP8cFP8 and the second level of quantization constants cFP32cFP32cFP32.

d. Since cFP32cFP32cFP32 values are positive, the mean is subtracted from c2c2 c2before quantization to centre the values around zero and enable symmetric quantization.

YBF16=XBF16∗doubleDequant(cFP321,ck−bit2,WNF4)+XBF16∗LBF16 YBF16 = XBF16 * doubleDequant(cFP32_1, ck-bit_2, WNF4) + XBF16 * LBF16YBF16=XBF16∗doubleDequant(cFP321​,ck−bit2​,WNF4)+XBF16∗LBF16
doubleDequant(cFP321,ck−bit2,Wk−bit)=doubleDequant(cFP32_1, ck-bit_2, Wk-bit) = doubleDequant(cFP321​,ck−bit2​,Wk−bit)=
dequant(dequant(cFP321,ck−bit2),W4bit)=WBF16dequant(dequant(cFP32_1, ck-bit_2), W4bit) = WBF16dequant(dequant(cFP321​,ck−bit2​),W4bit)=WBF16

QLORA uses NF4 for the weights (W) (W)(W) and FP8 for the quantization constants (c2)(c2)(c2).

During the backward pass, only the gradients with respect to the LoRA adapter weights (∂E/∂Li)(∂E/∂Li) (∂E/∂Li)are computed, not for the 4-bit weights (∂E/∂W)(∂E/∂W)(∂E/∂W).

However, computing ∂E/∂Li∂E/∂Li∂E/∂Li involves calculating ∂X/∂W∗e2piiξx∂X/∂W* e^{2 pi i \xi x}∂X/∂W∗e2piiξx, which requires dequantizing the storage WNF4 to the computation data type WBF16.

LogoQLoRA: Efficient Finetuning of Quantized LLMsarXiv.org
Page cover image