The Power of Scale for Parameter-Efficient Prompt Tuning

This highly cited September 2021 paper introduced "prompt tuning," a method for adapting large pre-trained language models to perform specific downstream tasks by learning "soft prompts" that condition the model's behaviour.

Prompt tuning was one of the first concepts around parameter efficient fine tuning

Instead of using discrete text prompts like GPT-3, the authors propose learning continuous "soft prompts" through backpropagation. These soft prompts can incorporate signals from labeled examples and outperform GPT-3's few-shot learning.

Through experiments with the T5 model, the authors show that prompt tuning becomes more competitive with model tuning (where all model weights are tuned) as the model size increases. With billion-parameter models, prompt tuning can match the performance of model tuning.

Prompt tuning is more parameter-efficient than model tuning, as a single frozen model can be reused for multiple downstream tasks by learning task-specific prompts. This is especially beneficial for large models that are costly to share and serve.

The authors compare prompt tuning to similar approaches like "prefix tuning" (Li and Liang, 2021) and show that prompt tuning alone, without intermediate-layer prefixes or task-specific output layers, is sufficient to be competitive with model tuning.

Prompt tuning has additional benefits, such as better resilience to domain shifts compared to model tuning, and the ability to perform efficient "prompt ensembling" by learning multiple prompts for the same task.

Key Features of Prompt Tuning

Parameter efficiency

Prompt tuning is highly parameter-efficient compared to other methods. It requires less than 0.01% task-specific parameters for models over a billion parameters, making it the most parameter-efficient among methods with learnable parameters.

In contrast, model tuning requires a separate copy of the entire model for each task, and adapter-based methods like prefix tuning and WARP involve more parameters.

Continuous vs. discrete prompts

Unlike the discrete text prompts used by GPT-3, prompt tuning learns continuous "soft prompts" through backpropagation. These soft prompts can incorporate signals from labeled examples and outperform GPT-3's few-shot learning.

Prompt location

Prompt tuning prepends the soft prompts to the input embeddings, while other methods like prefix tuning (Li and Liang, 2021) prepend prompts at every transformer layer.

This allows prompt tuning to modify the input representations directly, letting the model update intermediate-layer task representations based on the input example.

Frozen language model

Prompt tuning keeps the pre-trained language model frozen and only tunes the soft prompts. This prevents the model from overfitting to specific datasets by memorising spurious correlations, leading to improved robustness to domain shifts compared to model tuning.

Efficient ensembling

Prompt tuning enables efficient "prompt ensembling" by learning multiple prompts for the same task while sharing the core language model parameters. This improves performance and reduces storage and inference costs compared to traditional model ensembling.

What is ensembling?

Prompt ensembling is a technique that involves training multiple sets of soft prompts for the same task using a single frozen pre-trained language model.

Each set of prompts can be viewed as a separate "model" that adapts the language model to the specific task.

By combining the predictions from these multiple prompt-based models, we can create an ensemble that often outperforms individual prompt-based models and matches the performance of the best single prompt.

The main advantages of prompt ensembling are:

Improved performance: Ensembling multiple prompts leads to better task performance compared to the average single prompt and often matches or exceeds the best individual prompt.

Parameter efficiency: Prompt ensembling allows for the creation of multiple task-specific models while sharing the same core language model parameters. This drastically reduces storage costs compared to traditional model ensembling, where each model in the ensemble is a separate copy of the entire model.

Inference efficiency: During inference, instead of running multiple forward passes for each model in the ensemble, we can process the input with a single forward pass using a batch size equal to the number of prompts in the ensemble. This makes inference more efficient compared to traditional model ensembling.

Here's an example of how prompt ensembling works

Let's say we have a sentiment analysis task where we need to classify movie reviews as positive or negative.

We start by training five different sets of soft prompts (P1, P2, P3, P4, P5) on the same training data using a single frozen pre-trained language model (e.g., T5-XXL).

During inference, given a new movie review, we prepend each set of prompts to the input and run a single forward pass with a batch size of five. This gives us five different sentiment predictions, one for each prompt:

P1: Positive
P2: Positive
P3: Negative
P4: Positive
P5: Positive

To get the final ensemble prediction, we can use a simple majority voting scheme. In this case, four out of five prompts predict "Positive," so the final ensemble prediction is "Positive."

Another example is in question answering tasks, such as SQuAD.

We can train multiple sets of prompts on the SQuAD dataset and use them to generate multiple answers for a given question.

The ensemble prediction can be obtained by combining the answers generated by each prompt, either by voting or by taking the answer with the highest average confidence score.

Prompt ensembling is a technique that leverages the parameter efficiency of prompt tuning to create multiple task-specific models while sharing the same core language model.

This allows for improved performance, reduced storage costs, and efficient inference compared to traditional model ensembling.

Interpretability

Although the learned soft prompts are less interpretable than discrete text prompts, the authors find that the nearest neighbours of prompt tokens form semantic clusters, suggesting that the prompts learn "word-like" representations.

Summary

In summary, prompt tuning is a simple yet effective method for adapting large pre-trained language models to downstream tasks.

By learning continuous soft prompts through backpropagation, prompt tuning can match the performance of model tuning while being more parameter-efficient and enabling the reuse of a single frozen model for multiple tasks.

The effectiveness of prompt tuning increases with model scale, making it a promising approach for efficiently leveraging large language models in various applications.

PreviousP-Tuning NextPrefix-Tuning: Optimizing Continuous Prompts for Generation

Last updated 4 months ago