LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Technical terms
  • The authors discuss two existing approaches
  • The Concept of Attention Sinks
  • Key Advantages of StreamingLLM
  • Implementation and Future Potential
  • Here are problems that might arise with StreamingLLM
  • Dependency on Initial Tokens (Attention Sinks)
  • Handling of Evolving Contexts in Long Conversations
  • Computational Efficiency and Throughput
  • Model Generalisation and Training
  • Quality and Consistency of Outputs

Was this helpful?

  1. INFERENCE
  2. Why is inference important?

StreamingLLM

PreviousFlash Attention 2NextPaged Attention and vLLM

Last updated 11 months ago

Was this helpful?

The September 2023 paper "Efficient Streaming Language Models with Attention Sinks" introduces StreamingLLM, a framework that enables Large Language Models (LLMs) trained with a finite attention window to generalise to infinite sequence lengths without fine-tuning.

The main challenge in applying LLMs to infinite input streams is the quadratic memory and computational complexity of the attention mechanism, which limits the model's ability to handle longer sequences than it was trained on.

Technical terms

  • Key and Value states (KV): In Transformer-based LLMs, the Key and Value states are cached for all previous tokens during the decoding stage.

  • Attention window: The maximum sequence length that the model is trained on, constraining the model's ability to generalise to longer sequences.

  • Quadratic attention: The computational complexity of the attention mechanism, which scales quadratically with the sequence length.

  • Softmax operation: A function that normalizes the attention scores, ensuring they sum up to one for all contextual tokens.

  • Autoregressive language modeling: A type of language modeling where the model predicts the next token based on the previous tokens in the sequence.

The authors discuss two existing approaches

Window attention

The window attention is a technique that maintains a fixed-size sliding window on the Key-Value (KV) states of the most recent tokens.

While this approach ensures constant memory usage and decoding speed, the model's performance collapses once the sequence length exceeds the cache size, and the initial tokens are evicted.

Sliding window with re-computation

Rebuilds the KV states of recent tokens for each generated token, offering strong performance but significantly slower due to the computation of quadratic attention within its window.

The Concept of Attention Sinks

The authors discover an interesting phenomenon called "attention sink," where a surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task.

They attribute this to the softmax operation.

The softmax function prevents all attended tokens from having zero values, requiring the model to aggregate information from other tokens even if the current embedding has sufficient self-contained information for prediction.

Consequently, the model tends to dump unnecessary attention values to specific tokens, which are typically the initial tokens due to their visibility to all subsequent tokens in autoregressive language modeling.

Based on these insights, the authors propose StreamingLLM, which keeps the attention sink tokens' KV (just 4 initial tokens suffice) together with the sliding window's KV to anchor the attention computation and stabilise the model's performance.

The diagram above compares StreamingLLM with existing methods for handling long input sequences in language models.

The comparison includes:

(d) StreamingLLM: It keeps the attention sink (several initial tokens) for stable attention computation, combined with the recent tokens. It is efficient and offers stable performance on extended texts.

StreamingLLM addresses the limitations of existing methods by introducing an attention sink, which consists of several initial tokens that stabilise the attention computation.

By combining the attention sink with the most recent tokens, StreamingLLM achieves efficient and stable performance on long input sequences, outperforming dense attention, window attention, and sliding window with re-computation approaches.

Key Advantages of StreamingLLM

  • Extended Context Without Re-Training: StreamingLLM allows models to handle text sequences of virtually unlimited length without the need for model retraining or modification.

  • Efficient and High-Quality Inference: It addresses the challenges of previous methods, offering a solution that is fast, maintains high quality, and requires low memory.

  • Model Compatibility: StreamingLLM is compatible with various LLMs like Llama-2, Falcon, and Pythia, enabling them to model up to 4 million tokens effectively.

Implementation and Future Potential

  • Publicly Accessible Code: The code for StreamingLLM is available on GitHub, offering compatibility with several LLMs and integration with Hugging Face transformers libraries.

  • Enhanced Language Modeling Applications: With StreamingLLM, LLMs can be applied to tasks requiring processing of much longer text sequences, such as prolonged chat sessions or comprehensive document analysis, without compromising on performance or incurring prohibitive costs.

StreamingLLM presents an innovative approach to extend the context window of Large Language Models (LLMs) like Transformers, but it's not without potential challenges or drawbacks.

Here are problems that might arise with StreamingLLM

Dependency on Initial Tokens (Attention Sinks)

  • Reliance on Specific Tokens: StreamingLLM relies heavily on maintaining the initial tokens (attention sinks) in the model's KV (Key-Value) cache. This reliance could be problematic if the initial tokens are not sufficiently representative or relevant to the ongoing context.

  • Potential for Irrelevant Context Preservation: If the initial tokens are not closely related to the current topic of discussion or text, their preservation may not contribute meaningfully to the model's understanding and could even introduce noise or irrelevant context.

Handling of Evolving Contexts in Long Conversations

  • Contextual Relevance Over Time: In prolonged conversations or text sequences, the relevance of initial tokens might diminish as the subject evolves. StreamingLLM’s mechanism might struggle to adapt to these changes, potentially leading to less accurate or relevant outputs.

  • Complexity in Dynamic Conversations: The model might face challenges in dynamically evolving conversations where new information significantly changes the context or where the conversation shifts to entirely different topics.

Computational Efficiency and Throughput

  • Trade-Offs in Efficiency: While StreamingLLM aims to be computationally efficient, the process of maintaining a rolling KV cache and managing the attention sinks could still introduce computational overhead, especially in very long sequences.

  • Throughput Concerns: The need to constantly update and manage the KV cache for attention sinks might impact the throughput of the model, affecting its real-time responsiveness in applications like interactive chatbots or live document editing.

Model Generalisation and Training

  • Pre-Training Constraints: StreamingLLM’s approach necessitates certain considerations during the pre-training phase, like the inclusion of a global trainable attention sink token. This requirement could impose constraints on the general pre-training process of LLMs.

  • Potential Impact on Model Flexibility: The specific design choices and architecture adjustments required for StreamingLLM might impact the model's flexibility and generalization capabilities across different types of tasks and datasets.

Quality and Consistency of Outputs

  • Quality Maintenance in Extended Contexts: There’s a potential challenge in maintaining the quality and consistency of the model’s outputs as the context window extends significantly. Ensuring that the model remains coherent and contextually accurate over long text sequences is crucial.

  • Balancing Context and Relevance: StreamingLLM must balance the retention of old context (through attention sinks) with the incorporation of new information. Achieving this balance without losing relevance or coherence can be challenging, especially in complex or nuanced text sequences.

While StreamingLLM offers a promising solution to the context window limitation of Transformers, these potential challenges highlight the complexity and nuances involved in implementing such a system effectively.

The language model is pre-trained on texts of length LLL and is tasked with predicting the TthTthTth token, where TTT is much greater than LLL.

(a) Dense Attention: It has a time complexity of O(T2)O(T^2)O(T2) and an increasing cache size. The model's performance decreases when the text length exceeds the pre-training text length.

(b) Window Attention: It caches the Key and Value (KV) states of the most recent LLL tokens. While efficient in inference, the performance declines sharply once the starting tokens' keys and values are evicted from the cache.

(c) Sliding Window with Re-computation: It rebuilds the KV states from the LLL most recent tokens for each new token. Although it performs well on long texts, its O(TL2)O(TL^2) O(TL2) complexity, stemming from quadratic attention in context re-computation, makes it considerably slow.

LogoEfficient Streaming Language Models with Attention SinksarXiv.org
Efficient Streaming Language Models with Attention Sinks
LogoGitHub - mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention SinksGitHub
Github repository - over 6,000 stars and counting
Page cover image