# Extending the context window

The paper introduces Position Interpolation (PI) as a method to <mark style="color:green;">extend the context window sizes</mark> of RoPE-based pretrained Large Language Models (LLMs) like LLaMA.&#x20;

This technique allows these models to *<mark style="color:yellow;">handle significantly longer text sequences</mark>* (up to 32,768 tokens) with minimal fine-tuning, showing strong performance on tasks requiring long contexts such as passkey retrieval, language modeling, and long document summarisation.

{% embed url="<https://arxiv.org/abs/2306.15595>" %}

LLMs have a predefined context window size, often limiting their applicability in scenarios requiring longer text analysis.  Traditional methods to extend these windows involve extensive fine-tuning, which is resource-intensive and often ineffective.

<mark style="color:green;">**Position Interpolation Method**</mark>

Unlike extrapolation methods that can lead to unstable attention scores, PI scales down input position indices to fit within the original pre-trained context window. This method maintains the stability of the self-attention mechanism and allows the LLM to handle longer sequences without significant architectural changes or extensive retraining.

<mark style="color:green;">**Position Interpolation vs. Extrapolation**</mark>

* <mark style="color:blue;">Extrapolation</mark> involves <mark style="color:yellow;">stretching the model's existing knowledge to cover new, unseen data points</mark>. This can lead to unstable or inaccurate results because *<mark style="color:yellow;">**the model is guessing based on its existing knowledge.**</mark>*
* <mark style="color:blue;">Position Interpolation</mark>, on the other hand, is the process introduced in this paper.  Instead of guessing beyond known data, it *<mark style="color:yellow;">**compresses or scales down larger inputs to fit within the model's original context window**</mark>*.  Imagine trying to fit a long sentence into a small box by slightly reducing the size of each word rather than guessing what words might fit at the end of the sentence if the box were bigger.

<mark style="color:green;">**How Position Interpolation Works**</mark>

If you have more text than the model can handle (say 4096 tokens), position interpolation rescales these tokens to fit within the 2048-token limit, allowing the model to process longer texts without actually seeing them as longer. It's like zooming out on a picture to see more of the scene within the same frame.

<mark style="color:green;">**Theoretical Foundation**</mark>

The paper presents a theoretical analysis showing that the upper bound of the interpolated attention score is substantially smaller than that of extrapolation, which supports the stability and effectiveness of the PI method.

<mark style="color:green;">**Empirical Validation**</mark>

The researchers demonstrate that using PI, they can extend the context window of LLaMA models up to 32,768 tokens with only around 1,000 steps of fine-tuning. This process is shown to be cost-effective and efficient compared to the pre-training expenses.

<mark style="color:green;">**Results**</mark>

Models extended via PI not only perform well in tasks requiring long contexts but also maintain their performance on tasks within the original context window size. This demonstrates that PI does not compromise the model's original capabilities while extending its applicability to longer texts.

<mark style="color:green;">**Application and Performance**</mark>

The extended models show significant gains in tasks like language modeling and text summarization, leveraging the extended context windows to improve performance.

<mark style="color:green;">**Preservation of Original Quality**</mark>

Despite the significant extension of the context window, the models preserve their quality on standard benchmarks within the original context limits, indicating the method's reliability.

In practice, this advancement means that users can employ LLMs for a broader range of applications involving longer text sequences without the need for extensive retraining or compromising the model's original performance, making LLMs more versatile and efficient in handling diverse NLP tasks.

### <mark style="color:purple;">How were the experiments performed?</mark>

In the experiments section of the paper, the authors demonstrate how <mark style="color:yellow;">Position Interpolation (PI) can significantly extend the context window size</mark> of pre-trained Large Language Models (LLMs) like LLaMA, up to 32 times the original size, *<mark style="color:yellow;">**with only a few hundred training steps.**</mark>*

They highlight the effectiveness and efficiency of this method in enhancing the model's performance on various NLP tasks.

<mark style="color:green;">**Model Variants**</mark>

The authors applied their method to <mark style="color:yellow;">different variants of the LLaMA model</mark> (7B, 13B, 33B, and 65B), extending their context window sizes up to 32,768.  They compared the performance of models extended using Position Interpolation with those extended through direct fine-tuning.

<mark style="color:green;">**Training Procedure**</mark>

They <mark style="color:yellow;">fine-tuned all model variants using the next token prediction objective</mark>, a common approach in language modeling. They used the <mark style="color:green;">AdamW optimizer</mark> with specific hyperparameters (like learning rate and weight decay) and employed a linear learning rate warm-up strategy.

<mark style="color:green;">**Computational Resources**</mark>

The number of GPUs and the global batch size varied depending on the model size and the target context window size.  They used PyTorch for training, along with <mark style="color:green;">Fully Sharded Data Parallel</mark> and <mark style="color:green;">FlashAttention</mark> to manage memory efficiency and training speed.

<mark style="color:green;">**Fine-tuning Steps**</mark>

For models extended with Position Interpolation, they fine-tuned for 1,000 steps, which is relatively short, indicating the efficiency of the method.  For direct fine-tuning, they used 10,000 steps, highlighting the more intensive training required without Position Interpolation.

<mark style="color:green;">**Datasets**</mark>

The <mark style="color:yellow;">primary dataset for fine-tuning was the Pile dataset</mark>, with additional comparisons using the <mark style="color:yellow;">Red Pajama dataset</mark>. These datasets are used to adapt the models to handle longer context windows effectively.

<mark style="color:green;">**Results**</mark>

The extended models showed strong performance on tasks like language modeling, passkey retrieval, and long document summarisation.&#x20;

Furthermore, the models extended using Position Interpolation maintained their performance on the original LLaMA evaluation benchmarks, indicating that the method preserves model quality while significantly expanding its capabilities.

Overall, the experiments demonstrate the potential of Position Interpolation to efficiently extend the context window of LLMs, enabling them to handle longer sequences with minimal additional training, thereby enhancing their applicability to a broader range of tasks.

### <mark style="color:purple;">Evaluation</mark>

The experiment evaluates the language modeling capabilities of extended LLaMA models using Position Interpolation on two datasets: the book corpus (PG-19) and the cleaned Arxiv Mathproof-pile dataset.&#x20;

Here's a detailed breakdown of the findings and the methodology:

<mark style="color:green;">**Datasets and Preparation**</mark><mark style="color:green;">:</mark> The researchers used the test splits of PG-19 and the proof-pile dataset, ensuring the documents had a sufficient number of tokens (up to 32,768) for the evaluation.

<mark style="color:green;">**Perplexity Evaluation**</mark><mark style="color:green;">:</mark> Perplexity, a measure of model performance in language modeling, was assessed at various context window sizes.  A sliding window approach was used for this evaluation, allowing the researchers to observe how well the models perform as the context window increases.

<mark style="color:green;">**Results Overview**</mark><mark style="color:green;">:</mark> Models extended with Position Interpolation showed significant improvements in perplexity, especially as the context window size increased. This indicates that the models could effectively utilize the longer context to improve language modeling performance.

<mark style="color:green;">**Comparative Analysis**</mark><mark style="color:green;">:</mark> When comparing models extended with Position Interpolation to those extended via direct fine-tuning, the former outperformed the latter, particularly at longer context window sizes. This suggests that *<mark style="color:yellow;">**Position Interpolation is more effective in leveraging extended context windows**</mark>*.

<mark style="color:green;">**Minor Performance Degradation**</mark><mark style="color:green;">:</mark> Some degradation in performance was observed for extended models within the original context window size. This was expected due to the narrowing of position encoding regions through Position Interpolation, which might have slightly impacted performance.

<mark style="color:green;">**Fine-Tuning Impact**</mark><mark style="color:green;">:</mark> Without any fine-tuning, the models already demonstrated some language modeling capability at extended context sizes. However, after a minimal number of fine-tuning steps (around 200), the models exceeded the performance of the original models at the 2048 context window size. This rapid improvement underscores the efficiency of Position Interpolation in adapting the models to longer contexts.

<mark style="color:green;">**Detailed Results**</mark><mark style="color:green;">:</mark> The tables provided show a clear trend where models fine-tuned with Position Interpolation consistently achieve lower perplexity scores as the context window size increases, highlighting the method's ability to effectively leverage longer contexts.

In summary, the experiments validate that Position Interpolation is an effective and efficient method to extend the context window size of LLaMA models, enhancing their language modeling capabilities over longer sequences without requiring extensive fine-tuning.

### <mark style="color:purple;">Related Work</mark>

The related work section discusses various approaches that extend the capabilities of large language models (LLMs) and how the current work complements or differs from these methods:

<mark style="color:green;">**Retrieval-Augmented LLMs**</mark><mark style="color:green;">:</mark> This line of research involves enhancing LLMs with retrieval modules that fetch related documents to include in the LLM's input context, improving the model's performance by providing it with additional relevant information. *<mark style="color:yellow;">**The current work is complementary to these methods as the extended context window allows for more documents to be included in the input**</mark>*, offering broader applicability beyond just retrieval-oriented tasks.

<mark style="color:green;">**Recurrent and Memory Transformers**</mark><mark style="color:green;">:</mark> These works add memory capabilities to Transformers, allowing them to handle longer sequences by attending to a compressed version of past inputs. However, this compression may result in loss of specific details. In contrast, the current work enables attending to all previous tokens without any loss of detail, although it may incur higher inference costs.

<mark style="color:green;">**Approximated Multi-Head Attention**</mark><mark style="color:green;">:</mark> Research in this area focuses on reducing the computational and memory complexity of the multi-head attention mechanism through various approximation or sparsification techniques. While not directly related to the current paper's focus, the authors note that their method is compatible with these approaches since their changes are limited to position encodings.

<mark style="color:green;">**Length Extrapolation**</mark><mark style="color:green;">:</mark> Some recent studies aim to train Transformers on short sequences and apply them to longer ones. However, these methods have not been applied to some of the largest models like LLaMA, limiting their ability to extend the context window of these pre-trained models. The current work focuses on extending existing LLMs to save on pre-training costs while preserving the original model's quality.

<mark style="color:green;">**Interpolation in Vision Transformers**</mark><mark style="color:green;">:</mark> A technique proposed by Dosovitskiy et al. involves interpolating learned position embeddings to support higher input resolutions. This method serves as an inspiration for the current work, which instead interpolates position indices, a more suitable approach for RoPE-like encodings. The current research extends the context window up to 32 times the original size, surpassing the up to 4 times extension explored by Dosovitskiy et al. and demonstrates the effectiveness of this method for language models, hinting at the Transformer's capability to handle much longer sequences than encountered during training.

In summary, this work builds upon and extends existing methods by offering a novel approach to extend the context window of LLMs through position interpolation, enabling more effective handling of longer sequences and preserving the quality of the original models.
