# Prefix-Tuning: Optimizing Continuous Prompts for Generation

This highly cited <mark style="color:blue;">**January 2021**</mark> paper introduced a new technique for efficiently fine-tuning language models (LMs) called <mark style="color:blue;">**prefix-tuning**</mark>.

This method addressed the challenges of efficiently adapting LMs to specific tasks while maintaining their generalisation capabilities and minimising the storage requirements for task-specific parameters.

Prefix tuning adapts pre-trained language models to specific tasks without modifying the original model's weights.

Prefix tuning draws inspiration from the concept of prompting, where task instructions and examples are prepended to the input to steer the LM's generation. However, instead of using discrete tokens, <mark style="color:yellow;">**prefix tuning uses a continuous prefix vector**</mark>.

{% embed url="<https://arxiv.org/abs/2101.00190>" %}
Prefix-Tuning: Optimising Continuous Prompts for Generation" by Xiang Li
{% endembed %}

Prefix-tuning involves prepending a sequence of <mark style="color:blue;">**continuous task-specific vectors**</mark>, called a prefix, to the input of the LM.&#x20;

The Transformer can attend to these <mark style="color:blue;">**prefix vectors**</mark> as if they were a sequence of <mark style="color:yellow;">**"virtual tokens"**</mark>. Unlike prompting, the <mark style="color:blue;">**prefix vectors**</mark> do not correspond to real tokens but are learned during training.

### <mark style="color:blue;">Here's a step-by-step explanation</mark>

#### <mark style="color:green;">Soft Prompt Creation</mark>

* In prefix tuning, we create a <mark style="color:blue;">**tensor**</mark> called a "soft prompt" for each transformer block in the model.
* This soft prompt is a set of <mark style="color:blue;">**learnable parameters**</mark> that are specific to the task we want to adapt the model for.

#### <mark style="color:green;">Soft Prompt Processing</mark>

* Before using the soft prompt, it is passed through a set of <mark style="color:blue;">**fully connected layers**</mark>.
* These layers transform the soft prompt into a suitable representation that can be combined with the main input to the transformer block.

#### <mark style="color:green;">Input Modification</mark>

* The transformed soft prompt is then concatenated with the main input to the transformer block.
* This concatenation happens along the sequence length dimension, meaning the soft prompt is added as additional tokens at the beginning of the input sequence.

#### <mark style="color:green;">Transformer Block Processing</mark>

* The modified input, which now includes the soft prompt, is passed through the standard transformer block operations.
* These operations include self-attention, layer normalisation, and feed-forward neural network layers, along with residual connections.
* The transformer block processes the input as usual, but now it also takes into account the information provided by the soft prompt.

#### <mark style="color:green;">Training</mark>

* During training, only the soft prompts are updated, while the pre-trained model's weights remain frozen.
* The model learns to adapt to the specific task by adjusting the soft prompts based on the task-specific training data.
* By keeping the original model's weights unchanged, prefix tuning allows for efficient adaptation without the need for fine-tuning the entire model.

The key idea behind prefix tuning is that by adding task-specific soft prompts to each transformer block, the model can learn to condition its behavior based on the prompts.

The soft prompts act as a "prefix" that guides the model's attention and computation towards the relevant information for the task at hand.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2Fdkhgf7D0JNx86ICcludX%2Fimage.png?alt=media&#x26;token=4f229d51-fc58-465a-83ec-61517f44ee0a" alt="" width="563"><figcaption></figcaption></figure>

### <mark style="color:purple;">A more granular example</mark>

#### <mark style="color:green;">Define the prefix</mark>

* The prefix is a sequence of <mark style="color:blue;">**continuous vectors**</mark> that are <mark style="color:yellow;">**prepended to the input sequence**</mark>.
* The length of the prefix is a <mark style="color:blue;">**hyperparameter**</mark> that you can choose based on the complexity of the target personality and the available computational resources. Common prefix lengths range from 10 to 50 tokens.
* The prefix is initialized as a trainable matrix $$P$$ of size $$(prefixlength, embeddingdimension)$$, where $$prefixlength$$ is the <mark style="color:blue;">**number of prefix tokens**</mark> and $$embeddingdimension$$ is the <mark style="color:blue;">**size of the model's word embeddings**</mark>.
* Each row of the <mark style="color:blue;">**prefix matrix**</mark> $$P$$ corresponds to a <mark style="color:blue;">**prefix token**</mark>, and the values in that row represent the embedding of that token.
* The <mark style="color:blue;">**prefix matrix**</mark> $$P$$ is randomly initialised or initialised using the activations of real words that are relevant to the target personality. Initialising with relevant words can provide a good starting point for the prefix and potentially speed up convergence during training.

#### <mark style="color:green;">Modify the model architecture</mark>

* In the Transformer architecture, the <mark style="color:blue;">**input sequence**</mark> is typically represented as a <mark style="color:blue;">**matrix**</mark> of <mark style="color:blue;">**word embeddings**</mark>, where each row corresponds to a <mark style="color:blue;">**token in the sequence**</mark>.
* To incorporate the prefix, you concatenate the prefix matrix $$P$$ with the input embeddings matrix along the sequence dimension (usually axis 1). This results in a <mark style="color:blue;">**new input matrix**</mark> $$\[P; X]$$, where $$X$$ is the <mark style="color:blue;">**original input embeddings matrix**</mark>.
* During the <mark style="color:blue;">**forward pass**</mark>, the concatenated matrix $$\[P; X]$$ is passed through the Transformer layers, which include self-attention and feed-forward layers.
* The <mark style="color:blue;">**self-attention mechanism**</mark> in the Transformer layers allows the prefix tokens to attend to and influence the representations of the input tokens, effectively steering the model's behavior.
* Importantly, during training, only the <mark style="color:blue;">**prefix matrix**</mark> $$P$$ is updated, while the pre-trained model's parameters (i.e., the weight matrices in the Transformer layers) remain frozen. This ensures that the prefix adapts to the target personality while preserving the general language understanding captured by the pre-trained model.

#### <mark style="color:green;">Train the prefix</mark>

* To train the prefix, you use a prepared dataset that <mark style="color:yellow;">**consists of input-output pairs**</mark>, where the input is a prompt or context and the output is the corresponding response that reflects the desired personality.
* During training, you feed the <mark style="color:blue;">**input sequence**</mark> through the modified model architecture, which includes the <mark style="color:blue;">**prefix matrix**</mark> $$P$$ concatenated with the <mark style="color:blue;">**input embeddings**</mark>.
* The model generates a probability distribution over the vocabulary for each position in the output sequence, and you compute a language modeling loss (e.g., cross-entropy loss) between the predicted probabilities and the true output tokens.
* The <mark style="color:blue;">**gradients of the loss**</mark> with respect to the <mark style="color:blue;">**prefix matrix**</mark> $$P$$ are computed using <mark style="color:blue;">**backpropagation**</mark>, and the prefix matrix is updated using an optimisation algorithm like Adam.
* The pre-trained model's parameters remain fixed during training, so only the prefix matrix $$P$$ is updated to minimise the language modeling loss.
* You can experiment with different hyperparameters such as the learning rate, batch size, and number of training epochs to find the optimal configuration that achieves the best performance on a validation set.

By training the <mark style="color:blue;">**prefix matrix**</mark> $$P$$ while keeping the pre-trained model's parameters frozen, you allow the prefix to adapt to the target personality while leveraging the general language understanding captured by the pre-trained model.&#x20;

The prefix acts as a <mark style="color:yellow;">**"soft prompt"**</mark> that steers the model's behavior towards generating responses that align with the desired personality.

### <mark style="color:purple;">Prefix tuning has several advantages</mark>

* It allows for efficient adaptation of pre-trained models to new tasks without modifying the original model's weights.
* It requires fewer trainable parameters compared to fine-tuning the entire model, making it more computationally efficient.
* It can be applied to any pre-trained transformer-based model without the need for task-specific architectures.

The diagram below demonstrates:

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FMgHauNjbSGw8qmgUSxUn%2Fimage.png?alt=media&#x26;token=1d0b7508-abde-4cba-9e9d-c59d0e9d37ce" alt="" width="375"><figcaption><p><em>A transformer block modified for prefix tuning</em></p></figcaption></figure>

### <mark style="color:purple;">Utility Benefits of Prefix Tuning</mark>

Prefix-tuning allows for the independent training of tasks, enabling scalable personalisation without data cross-contamination. &#x20;

Each user's data can be isolated, and a *<mark style="color:yellow;">**personalised prefix can be trained for each user**</mark>*, ensuring privacy and modularity.  The independence of tasks also enables efficient batching across users and the creation of ensembles of multiple prefixes trained on the same task.

### <mark style="color:purple;">Experimental Proof</mark>

The paper demonstrates the effectiveness of prefix-tuning through extensive experiments on various natural language generation tasks, such as table-to-text generation and summarisation. &#x20;

The results show that prefix-tuning outperforms other lightweight fine-tuning methods, such as <mark style="color:blue;">**adapter-tuning**</mark>, while using substantially fewer parameters.  It achieves performance comparable to full fine-tuning, especially in low-data regimes and when generalising to unseen topics.

The authors also explore the impact of prefix length on the model's performance, revealing that there is an <mark style="color:yellow;">**optimal prefix length for each task**</mark>.

Increasing the prefix length up to a certain threshold improves performance, but further increases lead to diminishing returns and potential overfitting.

Furthermore, the paper compares prefix-tuning with an embedding-only approach, where only the embeddings of the virtual tokens are optimised.&#x20;

The results demonstrate that the <mark style="color:yellow;">embedding-only approach lacks the expressiveness necessary to achieve optimal performance</mark>, highlighting the importance of optimising the prefix vectors across all layers of the LM.

The discussion section of the paper emphasises the potential of prefix-tuning for real-world applications, particularly in scenarios requiring personalisation, privacy, efficiency, and scalability.&#x20;

The modularity and independence of tasks make <mark style="color:yellow;">prefix-tuning suitable for enterprise-level applications where customer-specific interactions and computational efficiency are crucial.</mark>

In conclusion, the paper introduces a powerful and efficient method for fine-tuning large language models.  By optimising continuous prompts in the form of prefix vectors, prefix-tuning achieves strong performance while significantly reducing the storage requirements for task-specific adaptations.&#x20;

The technique's modularity, privacy-preserving nature, and scalability make it particularly suitable for real-world applications and enterprise-level deployments.&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://training.continuumlabs.ai/training/the-fine-tuning-process/parameter-efficient-fine-tuning/prefix-tuning-optimizing-continuous-prompts-for-generation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
