LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • How does P-tuning differ to traditional prompting?
  • Here's a simplified step-by-step explanation of how P-Tuning works
  • What does "concatenate learnable continuous prompt embeddings" mean?
  • An example of concatenation in P-Tuning
  • The process
  • Diagram of Concept from the Paper
  • The key advantages of P-Tuning include

Was this helpful?

  1. Training
  2. The Fine Tuning Process
  3. Parameter Efficient Fine Tuning

P-Tuning

The highly cited "GPT Understands Too" paper first submitted March 2021, introducing P-Tuning

PreviousParameter Efficient Fine TuningNextThe Power of Scale for Parameter-Efficient Prompt Tuning

Last updated 1 year ago

Was this helpful?

This March 2021 paper introduced a method called P-Tuning.

P-Tuning is aims to improve and stabilise the performance of prompting in natural language tasks by using continuous prompt embeddings instead of discrete prompt tokens.

The main idea is to concatenate learnable continuous prompt embeddings with the input tokens and optimise them through backpropagation to achieve better task performance and reduce the instability caused by discrete prompts.

To add the continuous prompt tokens to the model, you modify the embedding layer of the Transformer to include the additional learnable embeddings. These embeddings are then concatenated with the input token embeddings before being passed through the self-attention layers.

How does P-tuning differ to traditional prompting?

In traditional prompting, you would use fixed, manually-created prompts to guide the language model to perform a specific task.

For example, if you want the model to answer a question about a country's capital, you might use a prompt like:

"The capital of [country] is [answer]."

Here, "[country]" and "[answer]" are placeholders that will be replaced with the actual country and the model's predicted answer, respectively.

However, creating these prompts manually can be time-consuming and may not always lead to the best performance on the task. This is where P-Tuning comes in.

Instead of using fixed, discrete prompts, P-Tuning introduces learnable, continuous prompt embeddings. These embeddings are like a set of "virtual" words that are learned during the training process.

They are called "continuous" because they are represented as real-valued vectors, as opposed to discrete tokens like words.

Here's a simplified step-by-step explanation of how P-Tuning works

  1. You define a prompt template that includes placeholders for the input (e.g., the question), the output (e.g., the answer), and the continuous prompt embeddings. These embeddings are randomly initialised at the beginning.

  2. The continuous prompt embeddings are added to the embedding layer of the Transformer model, along with the embeddings of the actual input tokens and output labels.

  3. An additional mapping function is used to map the continuous prompt embeddings to the hidden states of the model. This function can be a simple neural network like an Long Short-Term Memory (LSTM) or Multilayer Perceptron (MLP)

  4. During training, the continuous prompt embeddings are updated based on the model's performance on the task. The model learns to adjust these embeddings to minimise the task-specific loss, just like it learns to adjust its other parameters.

  5. At inference time, the learned continuous prompt embeddings are combined with the input tokens and fed into the Transformer model to generate predictions.

The key idea is that by learning these continuous prompt embeddings, the model can automatically discover the best "prompts" for the task during training, rather than relying on manually-created, fixed prompts.

This can lead to better performance and more flexibility in adapting the model to different tasks.

Definition: LSTM (Long Short-Term Memory) and MLP (Multilayer Perceptron)

LSTM (Long Short-Term Memory) and MLP (Multilayer Perceptron) are two types of neural network architectures that can be used as the mapping function in P-Tuning to transform the continuous prompt embeddings into the hidden states of the model.

LSTM (Long Short-Term Memory)

  • LSTM is a type of recurrent neural network (RNN) architecture designed to handle sequential data and capture long-term dependencies.

  • It consists of a unique cell state and multiple gating mechanisms (input gate, forget gate, and output gate) that regulate the flow of information in and out of the cell.

  • The cell state acts as a memory unit, allowing the LSTM to selectively remember or forget information over long sequences.

  • LSTMs are particularly effective in tasks involving sequential data, such as natural language processing, speech recognition, and time series analysis.

  • In the context of P-Tuning, an LSTM can be used to process the continuous prompt embeddings and generate hidden states that capture the contextual information and long-term dependencies within the prompts.

MLP (Multilayer Perceptron)

  • MLP is a feedforward neural network architecture consisting of multiple layers of interconnected nodes (neurons).

  • It has an input layer, one or more hidden layers, and an output layer.

  • Each neuron in an MLP applies a nonlinear activation function to a weighted sum of its inputs, allowing the network to learn complex nonlinear mappings between the input and output.

  • MLPs are versatile and can be used for a wide range of tasks, including classification, regression, and feature learning.

  • In the context of P-Tuning, an MLP can be used to transform the continuous prompt embeddings into hidden states by applying a series of linear transformations and nonlinear activations.

Both LSTM and MLP can be used as the mapping function in P-Tuning, depending on the specific requirements of the task and the nature of the prompt embeddings.

LSTMs are particularly suitable when the prompts have a sequential structure and capturing long-term dependencies is important.

MLPs, on the other hand, are simpler and more straightforward, making them a good choice when the prompts do not have a strong sequential nature or when computational efficiency is a priority.

The choice between LSTM and MLP as the mapping function in P-Tuning ultimately depends on the characteristics of the task, the complexity of the prompts, and the available computational resources.

Experimenting with both architectures and comparing their performance can help determine the most suitable choice for a given application.

What does "concatenate learnable continuous prompt embeddings" mean?

It means we are combining two types of embeddings:

Input token embeddings

These are the embeddings of the actual input tokens (words or subwords) that represent the text data we want to process. In the Transformer architecture, each input token is mapped to a dense vector representation (embedding) that captures its semantic meaning.

Learnable continuous prompt embeddings

These are additional embeddings that are not associated with any specific input token but are learned during the training process.

They are called "continuous" because they are represented as dense vectors in a continuous space, as opposed to discrete tokens. These embeddings serve as a "prompt" that guides the model to perform better on the specific task.

The process of concatenation involves joining these two types of embeddings together to form a single input sequence.

The key difference between using learnable continuous prompt embeddings and discrete prompts is that the continuous embeddings are optimised through backpropagation during training.

This means that the model can learn to adjust these embeddings based on the specific task and the training data, allowing for more flexibility and adaptability. In contrast, discrete prompts are fixed and cannot be optimised during training.

By optimising the continuous prompt embeddings through backpropagation, the model can learn to generate more informative and stable prompts, which can lead to better task performance and reduce the instability caused by manually-crafted discrete prompts.

An example of concatenation in P-Tuning

Let's break it down into a simple, everyday example to better understand the concept of concatenating input token embeddings and learnable continuous prompt embeddings.

Imagine you're planning a trip and have a list of essential items you need to pack:

  • Toothbrush

  • Toothpaste

  • Shampoo

  • Conditioner

  • Clothes

These items are like the input token embeddings - they are the basic elements you need for your trip.

Now, to make your trip more organised and enjoyable, you decide to add some additional items to your list:

  • Travel-sized toothbrush case

  • Travel-sized toothpaste tube

  • Travel-sized shampoo bottle

  • Travel-sized conditioner bottle

  • Laundry bag for dirty clothes

These additional items are like the learnable continuous prompt embeddings - they enhance and support the basic elements of your trip.

These embeddings are not tied to specific words but are learned during the training process to guide the language model in generating relevant and coherent packing lists.

The process of concatenation is like combining these two lists into a single, comprehensive packing list:

[Travel-sized toothbrush case, Toothbrush, Travel-sized toothpaste tube, Toothpaste, Travel-sized shampoo bottle, Shampoo, Travel-sized conditioner bottle, Conditioner, Laundry bag for dirty clothes, Clothes]

By concatenating the additional items with the essential items, you create a single, organised list that helps you better prepare for your trip.

The process

The concatenated input sequence, containing both the input token embeddings and the learnable continuous prompt embeddings, is then fed into the language model.

The language model processes this unified input sequence and learns to generate coherent and relevant travel packing lists based on the provided context and prompts.

By concatenating the learnable continuous prompt embeddings with the input token embeddings, P-Tuning allows the language model to leverage both the semantic information from the actual input tokens and the guiding information from the learned prompts.

This concatenation helps the model generate more accurate and context-aware outputs for the specific task at hand.

Diagram of Concept from the Paper

The key advantages of P-Tuning include

Improved performance: By learning optimal prompt embeddings during training, P-Tuning enables language models to achieve better results on a wide range of natural language understanding tasks.

Increased flexibility: P-Tuning allows language models to adapt more effectively to different tasks and domains by learning task-specific prompts, reducing the need for extensive fine-tuning or manual prompt engineering.

Enhanced interpretability: The learned continuous prompt embeddings provide insights into the language model's behaviour and the important aspects of the task, making the model's decisions more interpretable and explainable.

Efficient adaptation: P-Tuning offers a more efficient way to adapt language models to new tasks, as it focuses on learning prompts rather than modifying the entire model architecture or weights.

LogoGPT Understands, TooarXiv.org
"P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks" by Xiao Liu et al
An example of prompt search for “The capital of Britain is [MASK]”. Given the context (blue zone, “Britain”) and target (red zone, “[MASK]”), the orange zone refer to the prompt. In (a), the prompt generator only receives discrete rewards; on the contrary, in (b) the continuous prompt embeddings and prompt encoder can be optimized in a differentiable way.
Page cover image