Sliding Window Attention
Iz Beltagy, Matthew E. Peters, and Arman Cohan
This December 2020 paper introduces Sliding Window Attention.
The authors highlight the limitations of the standard Transformer architecture, particularly its inability to process long sequences due to the quadratic complexity of self-attention with respect to the sequence length.
This limitation makes it computationally infeasible to apply Transformers to tasks involving long documents, such as long document classification, question answering (QA), and coreference resolution. Existing approaches often resort to shortening or chunking the long context, which can result in loss of important cross-chunk information.
To address this limitation, the authors propose Longformer, a modified Transformer architecture with a novel attention mechanism that scales linearly with the sequence length, enabling efficient processing of long documents.
Related Work
The authors discuss prior work on adapting Transformers for long documents, categorising them into two main approaches: a) Left-to-right (ltr) approaches that process documents in chunks moving from left to right. While successful in autoregressive language modelling, these models are unsuitable for tasks that benefit from bidirectional context. b) Sparse attention approaches that avoid computing the full quadratic attention matrix. Longformer falls into this category.
The authors also mention task-specific models developed to circumvent the limitations of pretrained Transformer models like BERT, which typically have a 512 token limit.
These approaches include truncation, chunking, and two-stage retrieval-then-extraction models. However, these methods may suffer from information loss or cascading errors.
Longformer Architecture
The core idea behind Longformer is to replace the full self-attention mechanism in the Transformer with a combination of local and global attention patterns that scale linearly with the sequence length.
Sliding Window Attention: Longformer employs a fixed-size window attention surrounding each token, allowing each token to attend to its local context. Multiple stacked layers of windowed attention result in a large receptive field, enabling the top layers to build representations that incorporate information across the entire input.
Dilated Sliding Window: To further increase the receptive field without increasing computation, the authors propose using a dilated sliding window, similar to dilated CNNs. This allows the model to attend to distant tokens without sacrificing local context.
Global Attention: To capture task-specific global information, Longformer adds "global attention" on a few pre-selected input locations. These locations attend to all tokens across the sequence, and all tokens in the sequence attend to them. The authors use separate linear projections for global attention to provide flexibility in modeling different types of attention.
Autoregressive Language Modeling Experiments
The authors first evaluate Longformer on autoregressive character-level language modeling tasks (text8 and enwik8). They use a combination of sliding window attention and dilated sliding window attention, with varying window sizes across layers. The model is trained using a staged training procedure, gradually increasing the sequence length and window size across multiple phases. Longformer achieves state-of-the-art results on both datasets, demonstrating its effectiveness in modeling long sequences.
Pretraining and Finetuning
To make Longformer suitable for various downstream tasks, the authors pretrain it using masked language modeling (MLM), starting from the RoBERTa checkpoint. They make minimal changes to support Longformer's attention mechanism and add extra position embeddings to handle longer sequences (up to 4,096 tokens). The new position embeddings are initialized by copying the learned embeddings from RoBERTa.
After pretraining, Longformer is finetuned on various downstream tasks, including question answering (WikiHop, TriviaQA, HotpotQA), coreference resolution (OntoNotes), and document classification (IMDB, Hyperpartisan news detection). Longformer consistently outperforms the RoBERTa baseline, especially on tasks that require long-range context. In some cases, Longformer achieves state-of-the-art results, demonstrating its ability to effectively capture long-range dependencies.
Longformer-Encoder-Decoder (LED)
The authors propose a variant of Longformer called Longformer-Encoder-Decoder (LED), which extends the encoder-decoder architecture of the original Transformer for sequence-to-sequence tasks like summarization. LED uses Longformer's efficient local+global attention pattern in the encoder and full self-attention in the decoder. The model is initialized from BART and evaluated on the arXiv summarization dataset, which contains long documents. LED achieves state-of-the-art results on this dataset, outperforming the contemporaneous BigBird model.
Conclusion and Future Work
Longformer demonstrates the effectiveness of combining local and global attention patterns to process long documents efficiently. The model achieves state-of-the-art results on various tasks, including language modeling, question answering, coreference resolution, and summarization. The authors suggest that future work could explore the application of Longformer to other document-level tasks and the integration of Longformer with other efficient Transformer variants.
In summary, the Longformer paper presents a novel attention mechanism that enables Transformers to process long documents efficiently, addressing a significant limitation of the standard self-attention mechanism. The proposed architecture achieves impressive results on a range of benchmarks and has the potential to facilitate the application of Transformers to a wider variety of long-document NLP tasks.
Last updated