# StreamingLLM

The  <mark style="color:blue;">**September 2023**</mark> paper "Efficient Streaming Language Models with Attention Sinks" introduces StreamingLLM, a framework that enables Large Language Models (LLMs) trained with a finite attention window to generalise to infinite sequence lengths without fine-tuning.&#x20;

The main challenge in applying LLMs to infinite input streams is the quadratic memory and computational complexity of the attention mechanism, which limits the model's ability to handle longer sequences than it was trained on.

{% embed url="<https://arxiv.org/abs/2309.17453>" %}
Efficient Streaming Language Models with Attention Sinks
{% endembed %}

{% embed url="<https://github.com/mit-han-lab/streaming-llm>" %}
Github repository - over 6,000 stars and counting
{% endembed %}

### <mark style="color:purple;">Technical terms</mark>

* <mark style="color:blue;">**Key and Value states (KV):**</mark> In Transformer-based LLMs, the Key and Value states are cached for all previous tokens during the decoding stage.
* <mark style="color:blue;">**Attention window:**</mark> The maximum sequence length that the model is trained on, constraining the model's ability to generalise to longer sequences.
* <mark style="color:blue;">**Quadratic attention:**</mark> The computational complexity of the attention mechanism, which scales quadratically with the sequence length.
* <mark style="color:blue;">**Softmax operation:**</mark> A function that normalizes the attention scores, ensuring they sum up to one for all contextual tokens.
* <mark style="color:blue;">**Autoregressive language modeling:**</mark> A type of language modeling where the model predicts the next token based on the previous tokens in the sequence.

### <mark style="color:purple;">The authors discuss two existing approaches</mark>

#### <mark style="color:green;">**Window attention**</mark>

The window attention is a technique that maintains a fixed-size sliding window on the Key-Value (KV) states of the most recent tokens.&#x20;

While this approach ensures constant memory usage and decoding speed, the model's performance collapses once the sequence length exceeds the cache size, and the initial tokens are evicted.&#x20;

#### <mark style="color:green;">Sliding window with re-computation</mark>

Rebuilds the KV states of recent tokens for each generated token, offering strong performance but significantly slower due to the computation of quadratic attention within its window.

### <mark style="color:purple;">The Concept of Attention Sinks</mark>

The authors discover an interesting phenomenon called "attention sink," where a surprisingly *<mark style="color:yellow;">**large amount of attention score is allocated to the initial tokens**</mark>*, irrespective of their relevance to the language modeling task.&#x20;

They attribute this to the <mark style="color:blue;">**softmax operation**</mark>.

The <mark style="color:blue;">**softmax function**</mark> prevents all attended tokens from having zero values, requiring the model to aggregate information from other tokens even if the current embedding has sufficient self-contained information for prediction. &#x20;

Consequently, the model tends to <mark style="color:yellow;">**dump unnecessary attention values to specific tokens**</mark>, which are typically the initial tokens due to their visibility to all subsequent tokens in autoregressive language modeling.

Based on these insights, the authors propose StreamingLLM, which keeps the attention sink tokens' KV (just 4 initial tokens suffice) together with the sliding window's KV to anchor the attention computation and stabilise the model's performance.&#x20;

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2Fzrph5WI6Jm2c6jks2nTQ%2Fimage.png?alt=media&#x26;token=a9ec50ff-910b-4cf5-a000-2b7c544b30d1" alt=""><figcaption></figcaption></figure>

The diagram above compares StreamingLLM with existing methods for handling long input sequences in language models.&#x20;

The language model is pre-trained on texts of length $$L$$ and is tasked with predicting the $$Tth$$ token, where $$T$$ is much greater than $$L$$.&#x20;

The comparison includes:

<mark style="color:blue;">**(a) Dense Attention:**</mark> It has a time complexity of $$O(T^2)$$ and an increasing cache size. The model's performance decreases when the text length exceeds the pre-training text length.

<mark style="color:blue;">**(b) Window Attention:**</mark> It caches the Key and Value (KV) states of the most recent $$L$$ tokens. While efficient in inference, the performance declines sharply once the starting tokens' keys and values are evicted from the cache.

<mark style="color:blue;">**(c) Sliding Window with Re-computation:**</mark> It rebuilds the KV states from the $$L$$ most recent tokens for each new token.  Although it performs well on long texts, its $$O(TL^2)$$ complexity, stemming from quadratic attention in context re-computation, makes it considerably slow.

<mark style="color:blue;">**(d) StreamingLLM:**</mark> It keeps the attention sink (several initial tokens) for stable attention computation, combined with the recent tokens. It is efficient and offers stable performance on extended texts.

StreamingLLM addresses the limitations of existing methods by introducing an attention sink, which consists of *<mark style="color:yellow;">**several initial tokens that stabilise the attention computation**</mark>*.&#x20;

By combining the attention sink with the most recent tokens, StreamingLLM achieves efficient and stable performance on long input sequences, outperforming dense attention, window attention, and sliding window with re-computation approaches.

### <mark style="color:purple;">Key Advantages of StreamingLLM</mark>

* <mark style="color:green;">**Extended Context Without Re-Training**</mark><mark style="color:green;">:</mark> StreamingLLM allows models to handle text sequences of virtually unlimited length without the need for model retraining or modification.
* <mark style="color:green;">**Efficient and High-Quality Inference**</mark><mark style="color:green;">:</mark> It addresses the challenges of previous methods, offering a solution that is fast, maintains high quality, and requires low memory.
* <mark style="color:green;">**Model Compatibility**</mark><mark style="color:green;">:</mark> StreamingLLM is compatible with various LLMs like Llama-2, Falcon, and Pythia, enabling them to model up to 4 million tokens effectively.

### <mark style="color:purple;">Implementation and Future Potential</mark>

* <mark style="color:green;">**Publicly Accessible Code**</mark><mark style="color:green;">:</mark> The code for StreamingLLM is available on GitHub, offering compatibility with several LLMs and integration with Hugging Face transformers libraries.
* <mark style="color:green;">**Enhanced Language Modeling Applications**</mark><mark style="color:green;">:</mark> With StreamingLLM, LLMs can be applied to tasks requiring processing of much longer text sequences, such as prolonged chat sessions or comprehensive document analysis, without compromising on performance or incurring prohibitive costs.

StreamingLLM presents an innovative approach to extend the context window of Large Language Models (LLMs) like Transformers, but it's not without potential challenges or drawbacks.

### <mark style="color:blue;">Here are problems that might arise with StreamingLLM</mark>

### <mark style="color:purple;">Dependency on Initial Tokens (Attention Sinks)</mark>

* <mark style="color:green;">**Reliance on Specific Tokens**</mark><mark style="color:green;">:</mark> StreamingLLM relies heavily on maintaining the initial tokens (attention sinks) in the model's KV (Key-Value) cache. This reliance could be problematic if the <mark style="color:yellow;">initial tokens are not sufficiently representative or relevant to the ongoing context</mark>.
* <mark style="color:green;">**Potential for Irrelevant Context Preservation**</mark><mark style="color:green;">:</mark> If the initial tokens are not closely related to the current topic of discussion or text, their preservation may not contribute meaningfully to the model's understanding and could even introduce noise or irrelevant context.

### <mark style="color:purple;">Handling of Evolving Contexts in Long Conversations</mark>

* <mark style="color:green;">**Contextual Relevance Over Time**</mark>: In prolonged conversations or text sequences, the relevance of initial tokens might diminish as the subject evolves. StreamingLLM’s mechanism might struggle to adapt to these changes, potentially leading to less accurate or relevant outputs.
* <mark style="color:green;">**Complexity in Dynamic Conversations**</mark><mark style="color:green;">:</mark> The model might face challenges in dynamically evolving conversations where new information significantly changes the context or where the conversation shifts to entirely different topics.

### <mark style="color:purple;">Computational Efficiency and Throughput</mark>

* <mark style="color:green;">**Trade-Offs in Efficiency**</mark><mark style="color:green;">:</mark> While StreamingLLM aims to be computationally efficient, the process of maintaining a rolling KV cache and managing the attention sinks could still introduce computational overhead, especially in very long sequences.
* <mark style="color:green;">**Throughput Concerns**</mark><mark style="color:green;">:</mark> The need to constantly update and manage the KV cache for attention sinks might impact the throughput of the model, affecting its real-time responsiveness in applications like interactive chatbots or live document editing.

### <mark style="color:purple;">Model Generalisation and Training</mark>

* <mark style="color:green;">**Pre-Training Constraints**</mark><mark style="color:green;">:</mark> StreamingLLM’s approach necessitates certain considerations during the pre-training phase, like the inclusion of a global trainable attention sink token. This requirement could impose constraints on the general pre-training process of LLMs.
* <mark style="color:green;">**Potential Impact on Model Flexibility**</mark><mark style="color:green;">:</mark> The specific design choices and architecture adjustments required for StreamingLLM might impact the model's flexibility and generalization capabilities across different types of tasks and datasets.

### <mark style="color:purple;">Quality and Consistency of Outputs</mark>

* <mark style="color:green;">**Quality Maintenance in Extended Contexts**</mark><mark style="color:green;">:</mark> There’s a potential challenge in maintaining the quality and consistency of the model’s outputs as the context window extends significantly. Ensuring that the model remains coherent and contextually accurate over long text sequences is crucial.
* <mark style="color:green;">**Balancing Context and Relevance**</mark><mark style="color:green;">:</mark> StreamingLLM must balance the retention of old context (through attention sinks) with the incorporation of new information. Achieving this balance without losing relevance or coherence can be challenging, especially in complex or nuanced text sequences.

While StreamingLLM offers a promising solution to the context window limitation of Transformers, these potential challenges highlight the complexity and nuances involved in implementing such a system effectively.&#x20;
