# Low Rank Adaptation (Lora)

Fine-tuning large language models has become increasingly challenging due to the vast number of parameters involved and the computational resources required.

As models grow in size, traditional fine-tuning methods become impractical and costly.

**Low-Rank Adaptation (LoRA) **is a solution to this problem by efficiently adapting pre-trained language models to downstream tasks while significantly reducing the number of trainable parameters.

This document decomposes the famous** October 2021** paper describing the technique.

### The Intrinsic Rank Hypothesis that underpins LoRA

In a 2020 paper called **"Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning"** from the team at Facebook it was found that pre-trained language models can still learn efficiently* even when their parameters are randomly projected onto a*.

The intrinsic rank hypothesis is a crucial concept that underlies the development of LoRA.

The hypothesis suggests that the model's parameters lie in a lower-dimensional subspace than the full parameter space, referred to as the** "intrinsic dimension"** of the model.

intrinsic dimension represents the minimum number of dimensions required to accurately capture the important information in the model's parameter space.

The intrinsic rank hypothesis* extends this idea to weight updates that occur during fine-tuning of language models*.

It posits that the updates to the weights also have a low **"intrinsic rank"****,** meaning they can be well-approximated by a **low-rank matrix**.

This hypothesis is significant because it implies that the necessary adjustments to the model during fine-tuning can be captured using a lower-dimensional representation,* rather than updating the entire weight matrix*.

By exploiting the **intrinsic low rank** of the **weight updates**, LoRA can significantly reduce the number of trainable parameters while still allowing for effective adaptation to downstream model tasks.

This not only makes fine-tuning more computationally efficient but also enables the adaptation of large language models on resource-constrained devices.

The** intrinsic rank hypothesis **is a fundamental principle that guides the design and implementation of LoRA.

**Weight Matrices in Transformers**

**Weight Matrices in Transformers**

In the Transformer architecture, the** self-attention layer** is a key component that allows the model to attend to different positions of the **input sequence**.

The self-attention layer consists of **multiple attention heads that operate in parallel**.

Each** attention head** performs the following steps:

Compute the

**attention scores**by taking the**dot product**of the**query**and**key**representations.Scale the attention scores and apply a

**softmax function**to obtain the**attention weights**.**Multiply the attention weights**with the**value representations**to get the**weighted values**.

In the Transformer architecture, there are **four ****weight matrices**** in** the self-attention module:

Now, let's focus on the** dimensions of the weight matrices**:

**Weight matrices** enable the Transformer model to learn and capture complex relationships and dependencies within the input sequence. These learned weights in these matrices are crucial for the model's ability to attend to relevant information and generate meaningful representations.

In summary, the self-attention mechanism in the Transformer architecture relies on four key weight matrices: query, key, value, and output projection matrices.

These matrices linearly project the input into different subspaces, enabling the model to capture dependencies and learn meaningful representations. The dimensions of these matrices are typically determined by the model's hyperparameters, such as the dimensionality of the model and the number of attention heads.

Understanding the roles and dimensions of these weight matrices is crucial for grasping the concepts behind LoRA and the Intrinsic Rank Hypothesis, which we will explore in the next section.

### How does LoRA work?

LoRA (Low-Rank Adaptation) introduces a * modification to the weight matrices* to efficiently adapt the pre-trained model to downstream tasks.

Let's see how LoRA gets involved in the process and influences the weights.

Recall the **self-attention mechanism** has four **weight matrices**:

These matrices are typically learned during the pre-training phase and have **full rank**.

In linear algebra, a *matrix is said to have ***full rank*** if its rank is equal to the smaller of its number of rows or columns*.

In the context of neural networks, this means that the **weight matrices** in dense layers are typically not low-rank - they cannot be exactly represented as a product of two smaller matrices.

LoRA **modifies the weight matrices** by introducing a low-rank decomposition of the weight updates.

where:

The key idea behind LoRA is to use a low-rank decomposition of the weight updates. By representing the weight updates as the product of two smaller matrices, A and B, LoRA allows for a more compact and efficient representation.

The **rank (r)** of the decomposition determines the size of these matrices and controls the expressiveness of the adaptation.

A lower rank results in fewer trainable parameters, making the adaptation more memory-efficient, while a higher rank allows for more flexibility in adapting the weights.

### During the **fine-tuning process** with LoRA

**fine-tuning process**with LoRA

A smaller** rank r **results in fewer trainable parameters and more efficient adaptation, while a larger **rank r **allows for more flexibility in adapting the weights.

### What are the low rank matrices?

### Matrices ( A ) and ( B )

This means that significant changes to the neural network can be captured **using a lower-dimensional representation**.

### Impact of Lower Rank on Trainable Parameters

The reduction in the number of trainable parameters, as achieved through the Low-Rank Adaptation (LoRA) method, offers several significant benefits, particularly when fine-tuning large-scale neural networks:

### What does 'rank' mean?

The idea here is to exploit the fact that matrices can contain “duplicate information” in the form of linear dependence.

This means we can use factorisation to **represent a large matrix in terms of two smaller matrices**.

This is similar to how a large number can be represented as the multiplication of two smaller numbers, a matrix can be thought of as the multiplication of two smaller matrices.

The **rank** of a** matrix** is the maximum * number of linearly independent rows or columns in the matrix*.

In other words, it is the dimension of the vector space spanned by the rows or columns of the matrix.

**Example: **Consider the following **matrix** M:

To find the rank of the matrix, we can use **Gaussian elimination** to convert the matrix into **row echelon form**.

**Row echelon form** is a type of matrix in which:

All nonzero rows are above any rows of all zeros.

The leading entry of each nonzero row after the first occurs to the right of the leading entry of the previous row.

The leading entry in any nonzero row is 1.

All entries in the column below a leading 1 are zeros.

Using **Gaussian elimination**, a matrix is transformed into row echelon form. This helps determine properties such as the rank and solve linear systems.

**Gaussian elimination** is a method used to** solve systems of linear equations**, find the **rank r**, and calculate the determinant of a matrix.

The process involves three main steps:

**Forward Elimination:**Transform the matrix into an upper triangular form.**Pivoting:**Swap rows to position the highest absolute value as the pivot element to avoid division by zero and improve numerical stability.**Back Substitution:**Solve for the variables starting from the last row upwards.

By converting the **matrix** into row **echelon form**, Gaussian elimination** simplifies the system**, making it easier to understand its properties and solutions.

After performing Gaussian elimination, we get **matrix M**:

The number of non-zero rows in the** row echelon form **is the rank of the** matrix**.

In this case, the rank is 2, which is the maximum number of linearly independent rows or columns in the **matrix **M.

By choosing a lower **rank r**, we enforce a low-dimensional structure on the weight update matrix, reducing the number of trainable parameters while still capturing the most important aspects of the adaptation.

So to sum up, in LoRA, the **rank r**** **is the hyperparameter that determines the** size of the matrices A and B**.

**Specifically:**

So, the updated equation is:

**where:**

**Expressiveness and Rank**

**Expressiveness and Rank**

The **rank r** of the **weight update matrix** ∆W controls the **expressiveness** of the adaptation.

A * higher rank allows for more flexibility in adapting the weights*, as it can capture more complex patterns and transformations. However, increasing the rank also increases the number of trainable parameters.

On the other hand, a * lower rank restricts the expressiveness of the adaptation* but results in fewer trainable parameters.

A low-rank approximation of the** weight update matrix** can still capture the most important aspects of the adaptation while being more parameter-efficient.

The choice of **rank r** depends on the complexity of the downstream task and the available computational resources. It is a trade-off between expressiveness and efficiency.

The consensus is that when the data is similar to the data used in pre-training, a low **rank r** value is probably sufficient. When fine tuning on very new tasks, which might require substantial logical changes within the model, a high **rank r** value may work better.

**Applying LoRA to Transformer Self Attention Weights**

**Applying LoRA to Transformer Self Attention Weights**

This allows for efficient adaptation

, as the low-rank matrices can be merged with the pre-trained weights during deployment.**without introducing additional inference latency**

This makes it suitable for adapting large pre-trained language models, where standard fine-tuning would be prohibitively expensive in terms of storage and computation.

### The problem with full fine tuning

To remind us of the problem LoRA is solving, below is a brief description of the language modeling problem for a full fine tuning. In particular, the maximisation of conditional probabilities given a task-specific prompt.

**Language Modeling Objective**

**Language Modeling Objective**

#### The goal is to adapt a pre-trained language model to downstream conditional text generation tasks.

#### The model learns to **maximise the conditional probability of the target sequence** given the context.

**maximise the conditional probability of the target sequence**given the context.

**Parameter-Efficient Approach**

**Parameter-Efficient Approach**

The main drawback of full fine-tuning is that for each downstream task, a **separate set of parameters** ∆Φ is learned, which **has the**** same dimension ****as the pre-trained weights **Φ0.

This can be challenging to store and deploy, especially for large models.

To address this, the authors propose a parameter-efficient approach where the **task-specific parameter increment** ∆Φ is encoded by a much** smaller set of parameters** Θ, such that |Θ| << |Φ0|.

The objective becomes:

Here, ∆Φ is a function of Θ, denoted as ∆Φ(Θ).

The goal is to find the optimal Θ that maximises the **conditional language modeling objective**.

**Low-Rank Representation**

**Low-Rank Representation**

### Issues with existing solutions

While there have been other **Parameter Efficient Fine Tuning (PEFT)** solutions for efficient model adaptation in transfer learning, such as** adding adapter layers **or **optimising input layer activations,** have limitations, especially in large-scale and latency-sensitive production scenarios.

**We discuss below:**

### Adapter Layers

Adapter layers, as proposed by Houlsby et al. (2019) and Lin et al. (2020), are **additional layers inserted into the Transformer architecture** to enable parameter-efficient fine-tuning.

While adapters have fewer parameters compared to the original model, they** introduce extra computation** that must be processed sequentially, leading to increased inference latency.

The latency issue becomes more prominent in online inference settings where the batch size is small (for example a batch size of 1).

Even with a small bottleneck dimension in the adapter layers, the added latency can be significant (e.g., 20-30% slower for GPT-2 medium on a single GPU). This problem is amplified when using model parallelism techniques, as the additional depth requires more synchronous GPU operations.

**Optimising Input Layer Activations (Prompt Tuning)**

**Optimising Input Layer Activations (Prompt Tuning)**

Another PEFT approach is prefix tuning (Li & Liang, 2021), which * directly optimises a portion of the input layer activations* (the prompt) while keeping the pre-trained model unchanged.

However, this method faces optimisation challenges and exhibits **non-monotonic performance **changes with respect to the number of trainable parameters.

#### Non-Monotonic Performance Changes in Prompt Tuning

**Non-monotonic performance** changes refer to **fluctuations in model performance** that do not consistently improve or degrade as the number of trainable parameters increases.

In the context of prompt tuning, this means that * increasing the number of trainable parameters does not guarantee a corresponding increase in model performance*.

Instead, performance may improve up to a point, then degrade, and potentially improve again, creating a non-linear relationship. These irregular performance patterns can complicate the optimisation process and make it challenging to determine the optimal parameter configuration for prompt tuning.

### Lora in Practice

**What subset of weight matrices should be adopted for maximum downstream performance?**

**What subset of weight matrices should be adopted for maximum downstream performance?**

The authors experimented with applying LoRA to different subsets of the **self-attention weight matrices **when training GPT3, with a budget of 18 million parameters (this compares to GPT-3's 175 billion parameters! The **weight matrices **are as follows:

This suggests that even with a **low rank** (e.g., r=4), adapting multiple **weight matrices** captures more useful information than adapting a** single weight matrix** with a **higher rank.**

**What is the optimal rank for the adaptation matrix ∆W**

**What is the optimal rank for the adaptation matrix ∆W**

The authors investigated the effect of the LoRA** rank r** on downstream performance.

They further analysed the subspace similarity between the learned adaptation matrices with different ranks (r=8 and r=64) and found that the * top singular vector directions overlapped significantly*, while the other directions did not.

This suggests that the additional directions learned with higher ranks might contain mostly random noise.

The authors conclude that the optimal **adaptation matrix** ∆W can indeed have a very low intrinsic rank, although they note that this may not hold for every task or dataset.

**Connection between ∆W and W**

**Connection between ∆W and W**

The amplification factor is quite large (e.g., 21.5 for r=4 in the 48th layer of GPT-3).

This suggests that the low-rank adaptation matrix **amplifies important features for specific downstream tasks** that were * learned but not emphasised in the general pre-training model*.

**Process for determining the optimal rank r for LoRA when fine-tuning**

**Process for determining the optimal rank r for LoRA when fine-tuning**

Start with a low

**rank r**(e.g., r=1 or r=2) and fine-tune the model on the downstream task.Gradually increase the

**rank r**(e.g., r=4, r=8) and compare the performance on a validation set.If increasing the rank leads to significant improvements, continue increasing

**rank r**until the performance gains plateau or the computational cost becomes too high.Compare the performance and computational cost of different combinations of rank and adapted weight matrices to find the optimal configuration for the specific downstream task and resource constraints.

Keep in mind that the optimal rank may vary depending on the complexity of the downstream task, the size of the dataset, and the similarity between the pre-training and downstream domains.

It's also important to consider the trade-off between performance and computational efficiency when choosing the rank.

### Contents of A and B

Matrix

**matrices**is initialised with a random Gaussian distribution.

These matrices learn to adapt the **pre-trained weights** to the specific downstream task by capturing the important patterns and transformations needed for the adaptation.

The learned values in these matrices represent the low-rank adaptation that modifies the pre-trained weights to better suit the downstream task.

### Why LoRA is Better!

LoRA addresses the limitations of existing solutions by introducing a more efficient and effective approach to model adaptation:

**No Inference Latency:** Unlike adapter layers, LoRA does not introduce additional depth to the model. The low-rank adaptation matrices can be **merged with the pre-trained weights after fine-tuning**, resulting in no extra inference latency compared to a fully fine-tuned model.

**Compute and Memory Efficiency:** LoRA uses a low-rank representation to encode the task-specific parameter increments, significantly reducing the number of trainable parameters. This makes fine-tuning more compute- and memory-efficient, especially for large models.

**Optimisation Stability: **Compared to prompt tuning, LoRA optimises the model parameters directly, which leads to more stable optimisation and monotonic performance improvements with increased model capacity.

**Sequence Length Preservation:** LoRA does not require reserving a portion of the sequence length for adaptation, allowing the full sequence length to be used for downstream tasks, potentially leading to better performance.

**Flexibility and Composability:** LoRA is agnostic to the training objective and can be easily integrated with existing models and architectures. It is also composable with other adaptation techniques, such as prefix tuning, offering further flexibility.

**Enhanced Compatibility**: Works well alongside other fine-tuning techniques like adapters and prefix tuning.

## Conclusion

LoRA (Low-Rank Adaptation) is a technique for fine-tuning large language models that takes advantage of the intrinsically low rank of the weight matrices.

Instead of updating the entire weight matrix during fine-tuning, LoRA decomposes the weight update matrix into two smaller matrices (A and B) with a lower **rank r**.

This significantly reduces the number of trainable parameters and memory requirements while maintaining performance comparable to full fine-tuning.

### Key Insights

**Swappable LoRA Modules**

**Swappable LoRA Modules**

One of the most significant advantages of LoRA is the ability to swap different fine-tuned LoRA modules for various tasks on a single base model.

This allows for a more flexible and efficient deployment of models, as you can easily switch between tasks without needing separate fully fine-tuned models for each task.

**Inference Time Swapping**

**Inference Time Swapping**

The swappable nature of LoRA modules can be used even at inference time.

This means that customers can choose which task they want the model to perform on-the-fly, without requiring multiple models to be running simultaneously. This is a powerful feature that sets LoRA apart from other adaptation methods.

**Potential for Further Optimisation**

**Potential for Further Optimisation**

While LoRA is often applied to the attention weights (specifically query and value matrices) in transformer models, the technique could potentially be* applied to other weight matrices in the model*.

Exploring the application of LoRA to different components of the model architecture could lead to further optimizations and improvements.

**Balancing Rank and Performance**

**Balancing Rank and Performance**

The * rank of the low-rank matrices (A and B) in LoRA is a crucial hyperparameter* that determines the trade-off between model performance and efficiency. While lower ranks lead to greater memory savings and faster training, it's essential to find the right balance for each specific task. Experimenting with different rank values and evaluating the results on a validation set can help determine the optimal configuration.

**Implications for Model Accessibility**

**Implications for Model Accessibility**

By significantly reducing the memory requirements and training costs, LoRA makes fine-tuning large language models more accessible to a wider range of researchers and practitioners.

This could accelerate the development and deployment of specialized models for various tasks and domains.

**Handling large datasets**

Fine-tuning with LoRA on large datasets may still require significant computational resources, even though the number of trainable parameters is reduced. Strategies such as data parallelism, model parallelism, or gradient accumulation can be employed to handle large datasets efficiently.

Last updated