# P-Tuning

This <mark style="color:blue;">**March 2021**</mark> paper introduced a method called <mark style="color:blue;">**P-Tuning**</mark>.

P-Tuning is aims to improve and stabilise the performance of prompting in natural language tasks by *<mark style="color:yellow;">**using continuous prompt embeddings instead of discrete prompt tokens.**</mark>*&#x20;

The main idea is to *<mark style="color:yellow;">**concatenate learnable continuous prompt embeddings with the input tokens**</mark>* and optimise them through backpropagation to achieve better task performance and reduce the instability caused by discrete prompts.

To add the continuous prompt tokens to the model, you *<mark style="color:yellow;">**modify the embedding layer of the Transformer to include the additional learnable embeddings**</mark>*.  These embeddings are then concatenated with the input token embeddings before being passed through the self-attention layers.

{% embed url="<https://arxiv.org/abs/2103.10385>" %}
"P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks" by Xiao Liu et al
{% endembed %}

### <mark style="color:purple;">How does P-tuning differ to traditional prompting?</mark>

In traditional prompting, you would use *<mark style="color:yellow;">**fixed, manually-created prompts**</mark>* to guide the language model to perform a specific task.&#x20;

For example, if you want the model to answer a question about a country's capital, you might use a prompt like:

"The capital of \[country] is \[answer]."&#x20;

Here, "\[country]" and "\[answer]" are placeholders that will be replaced with the actual country and the model's predicted answer, respectively.

However, *<mark style="color:yellow;">**creating these prompts manually can be time-consuming and may not always lead to the best performance on the task.**</mark>*  This is where P-Tuning comes in.

Instead of using fixed, discrete prompts, P-Tuning introduces learnable, continuous prompt embeddings. These *<mark style="color:yellow;">**embeddings are like a set of "virtual" words that are learned during the training process.**</mark>*&#x20;

They are called "continuous" because they are represented as real-valued vectors, as opposed to discrete tokens like words.

### <mark style="color:purple;">**Here's a simplified step-by-step explanation of how P-Tuning works**</mark>

1. You <mark style="color:yellow;">define a prompt template</mark> that includes placeholders for the input (e.g., the question), the output (e.g., the answer), and the continuous prompt embeddings. These embeddings are randomly initialised at the beginning.
2. The continuous <mark style="color:yellow;">prompt embeddings are added to the embedding layer</mark> of the Transformer model, along with the embeddings of the actual input tokens and output labels.
3. An <mark style="color:yellow;">additional mapping function is used</mark> to map the continuous prompt embeddings to the hidden states of the model. This function can be a simple neural network like an <mark style="color:blue;">**Long Short-Term Memory (LSTM)**</mark> or <mark style="color:blue;">**Multilayer Perceptron (MLP)**</mark>
4. During training, the <mark style="color:yellow;">continuous prompt embeddings are updated based on the model's performance on the task</mark>. The model learns to adjust these embeddings to minimise the task-specific loss, just like it learns to adjust its other parameters.
5. At inference time, the learned <mark style="color:yellow;">continuous prompt embeddings are combined with the input tokens</mark> and fed into the Transformer model to generate predictions.

The key idea is that *<mark style="color:yellow;">**by learning these continuous prompt embeddings, the model can automatically discover the best "prompts" for the task during training**</mark>*, rather than relying on manually-created, fixed prompts.&#x20;

This can lead to better performance and more flexibility in adapting the model to different tasks.

<details>

<summary><mark style="color:blue;"><strong>Definition:</strong></mark> <mark style="color:blue;"><strong>LSTM (Long Short-Term Memory) and MLP (Multilayer Perceptron)</strong></mark> </summary>

LSTM (Long Short-Term Memory) and MLP (Multilayer Perceptron) are two types of neural network architectures that can be used as the mapping function in P-Tuning to transform the continuous prompt embeddings into the hidden states of the model.

<mark style="color:green;">**LSTM (Long Short-Term Memory)**</mark>

* LSTM is a type of recurrent neural network (RNN) architecture designed to handle sequential data and capture long-term dependencies.
* It consists of a unique cell state and multiple gating mechanisms (input gate, forget gate, and output gate) that regulate the flow of information in and out of the cell.
* The cell state acts as a memory unit, allowing the LSTM to selectively remember or forget information over long sequences.
* LSTMs are particularly effective in tasks involving sequential data, such as natural language processing, speech recognition, and time series analysis.
* In the context of P-Tuning, an LSTM can be used to process the continuous prompt embeddings and generate hidden states that capture the contextual information and long-term dependencies within the prompts.

<mark style="color:green;">**MLP (Multilayer Perceptron)**</mark>

* MLP is a feedforward neural network architecture consisting of multiple layers of interconnected nodes (neurons).
* It has an input layer, one or more hidden layers, and an output layer.
* Each neuron in an MLP applies a nonlinear activation function to a weighted sum of its inputs, allowing the network to learn complex nonlinear mappings between the input and output.
* MLPs are versatile and can be used for a wide range of tasks, including classification, regression, and feature learning.
* In the context of P-Tuning, an MLP can be used to transform the continuous prompt embeddings into hidden states by applying a series of linear transformations and nonlinear activations.

Both LSTM and MLP can be used as the mapping function in P-Tuning, depending on the specific requirements of the task and the nature of the prompt embeddings.&#x20;

LSTMs are particularly suitable when the prompts have a sequential structure and capturing long-term dependencies is important.&#x20;

MLPs, on the other hand, are simpler and more straightforward, making them a good choice when the prompts do not have a strong sequential nature or when computational efficiency is a priority.

The choice between LSTM and MLP as the mapping function in P-Tuning ultimately depends on the characteristics of the task, the complexity of the prompts, and the available computational resources.&#x20;

Experimenting with both architectures and comparing their performance can help determine the most suitable choice for a given application.

</details>

### <mark style="color:purple;">What does "concatenate learnable continuous prompt embeddings" mean?</mark>

It means we are combining <mark style="color:yellow;">**two types**</mark> of embeddings:

#### <mark style="color:green;">**Input token embeddings**</mark>

These are the embeddings of the <mark style="color:yellow;">**actual input tokens**</mark> (words or subwords) that represent the text data we want to process.  In the Transformer architecture, each input token is mapped to a dense vector representation (embedding) that captures its semantic meaning.

#### <mark style="color:green;">Learnable continuous prompt embeddings</mark>

These are <mark style="color:yellow;">**additional embeddings**</mark> that are not associated with any specific input token but are learned during the training process.&#x20;

They are called "continuous" because they are *<mark style="color:yellow;">**represented as dense vectors in a continuous space**</mark>*, as opposed to discrete tokens. These embeddings serve as a "prompt" that guides the model to perform better on the specific task.

The process of concatenation involves *<mark style="color:yellow;">**joining these two types of embeddings together to form a single input sequence**</mark>*.&#x20;

The key difference between using learnable continuous prompt embeddings and discrete prompts is that the *<mark style="color:yellow;">**continuous embeddings are optimised through backpropagation during training.**</mark>*&#x20;

This means that the model can learn to adjust these embeddings based on the specific task and the training data, allowing for more flexibility and adaptability. In contrast, discrete prompts are fixed and cannot be optimised during training.

By optimising the continuous prompt embeddings through backpropagation, the model can learn to generate more informative and stable prompts, which can lead to better task performance and reduce the instability caused by manually-crafted discrete prompts.

### <mark style="color:purple;">An example of concatenation in P-Tuning</mark>

Let's break it down into a simple, everyday example to better understand the concept of concatenating input token embeddings and learnable continuous prompt embeddings.

Imagine you're planning a trip and have a list of essential items you need to pack:

* Toothbrush
* Toothpaste
* Shampoo
* Conditioner
* Clothes

These items are like the *<mark style="color:yellow;">**input token embeddings**</mark>* - they are the basic elements you need for your trip.

Now, to make your trip more organised and enjoyable, you decide to add some additional items to your list:

* Travel-sized toothbrush case
* Travel-sized toothpaste tube
* Travel-sized shampoo bottle
* Travel-sized conditioner bottle
* Laundry bag for dirty clothes

These additional items are like the *<mark style="color:yellow;">**learnable continuous prompt embeddings**</mark>* - they enhance and support the basic elements of your trip.

These embeddings *<mark style="color:yellow;">**are not tied to specific words but are learned during the training process to guide the language model in generating relevant and coherent packing list**</mark><mark style="color:yellow;">s.</mark>*

The process of concatenation is like combining these two lists into a single, comprehensive packing list:

\[Travel-sized toothbrush case, Toothbrush, Travel-sized toothpaste tube, Toothpaste, Travel-sized shampoo bottle, Shampoo, Travel-sized conditioner bottle, Conditioner, Laundry bag for dirty clothes, Clothes]

By concatenating the additional items with the essential items, you create a single, organised list that helps you better prepare for your trip.

### <mark style="color:purple;">The process</mark>

The concatenated input sequence, containing both the input token embeddings and the learnable continuous prompt embeddings, is then fed into the language model.

The language model processes this unified input sequence and learns to generate coherent and relevant travel packing lists based on the provided context and prompts.

By concatenating the learnable continuous prompt embeddings with the input token embeddings, P-Tuning allows the language model to leverage both the semantic information from the actual input tokens and the guiding information from the learned prompts.&#x20;

This concatenation helps the model generate more accurate and context-aware outputs for the specific task at hand.

### <mark style="color:purple;">Diagram of Concept from the Paper</mark>

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FzSwA5PL0DnshI6LADoYk%2Fimage.png?alt=media&#x26;token=68d1aadd-13d7-4623-861b-09d63fbb3212" alt=""><figcaption><p>An example of prompt search for “The capital of Britain is [MASK]”. Given the context (blue zone, “Britain”) and target (red zone, “[MASK]”), the orange zone refer to the prompt. In (a), the prompt generator only receives discrete rewards; on the contrary, in (b) the continuous prompt embeddings and prompt encoder can be optimized in a differentiable way.</p></figcaption></figure>

### <mark style="color:purple;">The key advantages of P-Tuning include</mark>

<mark style="color:blue;">**Improved performance:**</mark> By learning optimal prompt embeddings during training, P-Tuning enables language models to achieve better results on a wide range of natural language understanding tasks.

<mark style="color:blue;">**Increased flexibility:**</mark> P-Tuning allows language models to adapt more effectively to different tasks and domains by learning task-specific prompts, reducing the need for extensive fine-tuning or manual prompt engineering.

<mark style="color:blue;">**Enhanced interpretability:**</mark> The learned continuous prompt embeddings provide insights into the language model's behaviour and the important aspects of the task, making the model's decisions more interpretable and explainable.

<mark style="color:blue;">**Efficient adaptation:**</mark> P-Tuning offers a more efficient way to adapt language models to new tasks, as it focuses on learning prompts rather than modifying the entire model architecture or weights.
