LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • How were the experiments performed?
  • Evaluation
  • Related Work

Was this helpful?

  1. Training
  2. The Fine Tuning Process
  3. Training Processes

Extending the context window

PreviousTraining ProcessesNextPyTorch Fully Sharded Data Parallel (FSDP)

Last updated 1 year ago

Was this helpful?

The paper introduces Position Interpolation (PI) as a method to extend the context window sizes of RoPE-based pretrained Large Language Models (LLMs) like LLaMA.

This technique allows these models to handle significantly longer text sequences (up to 32,768 tokens) with minimal fine-tuning, showing strong performance on tasks requiring long contexts such as passkey retrieval, language modeling, and long document summarisation.

LLMs have a predefined context window size, often limiting their applicability in scenarios requiring longer text analysis. Traditional methods to extend these windows involve extensive fine-tuning, which is resource-intensive and often ineffective.

Position Interpolation Method

Unlike extrapolation methods that can lead to unstable attention scores, PI scales down input position indices to fit within the original pre-trained context window. This method maintains the stability of the self-attention mechanism and allows the LLM to handle longer sequences without significant architectural changes or extensive retraining.

Position Interpolation vs. Extrapolation

  • Extrapolation involves stretching the model's existing knowledge to cover new, unseen data points. This can lead to unstable or inaccurate results because the model is guessing based on its existing knowledge.

  • Position Interpolation, on the other hand, is the process introduced in this paper. Instead of guessing beyond known data, it compresses or scales down larger inputs to fit within the model's original context window. Imagine trying to fit a long sentence into a small box by slightly reducing the size of each word rather than guessing what words might fit at the end of the sentence if the box were bigger.

How Position Interpolation Works

If you have more text than the model can handle (say 4096 tokens), position interpolation rescales these tokens to fit within the 2048-token limit, allowing the model to process longer texts without actually seeing them as longer. It's like zooming out on a picture to see more of the scene within the same frame.

Theoretical Foundation

The paper presents a theoretical analysis showing that the upper bound of the interpolated attention score is substantially smaller than that of extrapolation, which supports the stability and effectiveness of the PI method.

Empirical Validation

The researchers demonstrate that using PI, they can extend the context window of LLaMA models up to 32,768 tokens with only around 1,000 steps of fine-tuning. This process is shown to be cost-effective and efficient compared to the pre-training expenses.

Results

Models extended via PI not only perform well in tasks requiring long contexts but also maintain their performance on tasks within the original context window size. This demonstrates that PI does not compromise the model's original capabilities while extending its applicability to longer texts.

Application and Performance

The extended models show significant gains in tasks like language modeling and text summarization, leveraging the extended context windows to improve performance.

Preservation of Original Quality

Despite the significant extension of the context window, the models preserve their quality on standard benchmarks within the original context limits, indicating the method's reliability.

In practice, this advancement means that users can employ LLMs for a broader range of applications involving longer text sequences without the need for extensive retraining or compromising the model's original performance, making LLMs more versatile and efficient in handling diverse NLP tasks.

How were the experiments performed?

In the experiments section of the paper, the authors demonstrate how Position Interpolation (PI) can significantly extend the context window size of pre-trained Large Language Models (LLMs) like LLaMA, up to 32 times the original size, with only a few hundred training steps.

They highlight the effectiveness and efficiency of this method in enhancing the model's performance on various NLP tasks.

Model Variants

The authors applied their method to different variants of the LLaMA model (7B, 13B, 33B, and 65B), extending their context window sizes up to 32,768. They compared the performance of models extended using Position Interpolation with those extended through direct fine-tuning.

Training Procedure

They fine-tuned all model variants using the next token prediction objective, a common approach in language modeling. They used the AdamW optimizer with specific hyperparameters (like learning rate and weight decay) and employed a linear learning rate warm-up strategy.

Computational Resources

The number of GPUs and the global batch size varied depending on the model size and the target context window size. They used PyTorch for training, along with Fully Sharded Data Parallel and FlashAttention to manage memory efficiency and training speed.

Fine-tuning Steps

For models extended with Position Interpolation, they fine-tuned for 1,000 steps, which is relatively short, indicating the efficiency of the method. For direct fine-tuning, they used 10,000 steps, highlighting the more intensive training required without Position Interpolation.

Datasets

The primary dataset for fine-tuning was the Pile dataset, with additional comparisons using the Red Pajama dataset. These datasets are used to adapt the models to handle longer context windows effectively.

Results

The extended models showed strong performance on tasks like language modeling, passkey retrieval, and long document summarisation.

Furthermore, the models extended using Position Interpolation maintained their performance on the original LLaMA evaluation benchmarks, indicating that the method preserves model quality while significantly expanding its capabilities.

Overall, the experiments demonstrate the potential of Position Interpolation to efficiently extend the context window of LLMs, enabling them to handle longer sequences with minimal additional training, thereby enhancing their applicability to a broader range of tasks.

Evaluation

The experiment evaluates the language modeling capabilities of extended LLaMA models using Position Interpolation on two datasets: the book corpus (PG-19) and the cleaned Arxiv Mathproof-pile dataset.

Here's a detailed breakdown of the findings and the methodology:

Datasets and Preparation: The researchers used the test splits of PG-19 and the proof-pile dataset, ensuring the documents had a sufficient number of tokens (up to 32,768) for the evaluation.

Perplexity Evaluation: Perplexity, a measure of model performance in language modeling, was assessed at various context window sizes. A sliding window approach was used for this evaluation, allowing the researchers to observe how well the models perform as the context window increases.

Results Overview: Models extended with Position Interpolation showed significant improvements in perplexity, especially as the context window size increased. This indicates that the models could effectively utilize the longer context to improve language modeling performance.

Comparative Analysis: When comparing models extended with Position Interpolation to those extended via direct fine-tuning, the former outperformed the latter, particularly at longer context window sizes. This suggests that Position Interpolation is more effective in leveraging extended context windows.

Minor Performance Degradation: Some degradation in performance was observed for extended models within the original context window size. This was expected due to the narrowing of position encoding regions through Position Interpolation, which might have slightly impacted performance.

Fine-Tuning Impact: Without any fine-tuning, the models already demonstrated some language modeling capability at extended context sizes. However, after a minimal number of fine-tuning steps (around 200), the models exceeded the performance of the original models at the 2048 context window size. This rapid improvement underscores the efficiency of Position Interpolation in adapting the models to longer contexts.

Detailed Results: The tables provided show a clear trend where models fine-tuned with Position Interpolation consistently achieve lower perplexity scores as the context window size increases, highlighting the method's ability to effectively leverage longer contexts.

In summary, the experiments validate that Position Interpolation is an effective and efficient method to extend the context window size of LLaMA models, enhancing their language modeling capabilities over longer sequences without requiring extensive fine-tuning.

Related Work

The related work section discusses various approaches that extend the capabilities of large language models (LLMs) and how the current work complements or differs from these methods:

Retrieval-Augmented LLMs: This line of research involves enhancing LLMs with retrieval modules that fetch related documents to include in the LLM's input context, improving the model's performance by providing it with additional relevant information. The current work is complementary to these methods as the extended context window allows for more documents to be included in the input, offering broader applicability beyond just retrieval-oriented tasks.

Recurrent and Memory Transformers: These works add memory capabilities to Transformers, allowing them to handle longer sequences by attending to a compressed version of past inputs. However, this compression may result in loss of specific details. In contrast, the current work enables attending to all previous tokens without any loss of detail, although it may incur higher inference costs.

Approximated Multi-Head Attention: Research in this area focuses on reducing the computational and memory complexity of the multi-head attention mechanism through various approximation or sparsification techniques. While not directly related to the current paper's focus, the authors note that their method is compatible with these approaches since their changes are limited to position encodings.

Length Extrapolation: Some recent studies aim to train Transformers on short sequences and apply them to longer ones. However, these methods have not been applied to some of the largest models like LLaMA, limiting their ability to extend the context window of these pre-trained models. The current work focuses on extending existing LLMs to save on pre-training costs while preserving the original model's quality.

Interpolation in Vision Transformers: A technique proposed by Dosovitskiy et al. involves interpolating learned position embeddings to support higher input resolutions. This method serves as an inspiration for the current work, which instead interpolates position indices, a more suitable approach for RoPE-like encodings. The current research extends the context window up to 32 times the original size, surpassing the up to 4 times extension explored by Dosovitskiy et al. and demonstrates the effectiveness of this method for language models, hinting at the Transformer's capability to handle much longer sequences than encountered during training.

In summary, this work builds upon and extends existing methods by offering a novel approach to extend the context window of LLMs through position interpolation, enabling more effective handling of longer sequences and preserving the quality of the original models.

LogoExtending Context Window of Large Language Models via Positional...arXiv.org
Page cover image