YaRN: Efficient Context Window Extension of Large Language Models

Nous Research, EleutherAI, University of Geneva

In this November 2023 paper the authors present YaRN (Yet another RoPE extensioN method), a computationally efficient approach to extend the context window of transformer-based language models that employ Rotary Position Embeddings (RoPE) for positional encoding.

As you may know, the context window, which determines the maximum sequence length a model can process, is a critical limitation in large language models.

While some positional encoding schemes like ALiBi allow for limited generalisation beyond the pre-trained context window, most models struggle to effectively handle sequences significantly longer than what they were trained on.

Previous works have attempted to address this limitation by modifying RoPE through techniques such as Position Interpolation (PI) and fine-tuning on small amounts of data. Additionally, "NTK-aware" interpolation and its improvements, "Dynamic NTK" and "NTK-by-parts," have been proposed to extend the context window, with varying emphases on fine-tuning and pre-trained models.

The authors of this paper introduce YaRN, which builds upon these previous works and offers a more efficient solution. YaRN achieves state-of-the-art performance in context window extension while requiring only a small fraction (∼0.1%) of the original pre-training data for fine-tuning.

This is a significant improvement over earlier methods, which often require more data and training steps.

Furthermore, the authors demonstrate that by combining YaRN with an inference-time technique called Dynamic Scaling, they can achieve a context window extension of more than 2x without any fine-tuning at all. This variant, called Dynamic-YaRN, is particularly noteworthy as it allows for improved performance without the need for additional training.

The effectiveness of YaRN is validated through experiments on popular language models such as LLaMA, GPTNeoX, and PaLM, which employ RoPE for positional encoding. The authors provide a comprehensive account of their work, including the previous unpublished works on "NTK-aware," "Dynamic NTK," and "NTK-by-part" interpolations.

In summary, YaRN presents a significant advancement in extending the context window of transformer-based language models, offering a more efficient and effective solution compared to previous approaches. The ability to handle longer sequences without substantial additional training is a valuable contribution to the field of natural language processing and could have wide-ranging implications for various applications.

PreviousTrain Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation NextSliding Window Attention

Last updated 1 year ago

Was this helpful?