# Flash Attention

The authors of this <mark style="color:blue;">**June 2022**</mark> paper propose FlashAttention, an approach to computing exact attention that optimises memory usage and computational efficiency by leveraging the memory hierarchy of modern hardware.

The key idea behind FlashAttention is to <mark style="color:yellow;">exploit the fast on-chip memory (SRAM) of GPUs to store intermediate computations and minimise data movement between different levels of the memory hierarchy</mark>.&#x20;

By carefully designing the attention computation to fit within the SRAM and optimising the data layout, FlashAttention achieves significant speedups and memory savings compared to traditional attention implementations.

{% embed url="<https://arxiv.org/abs/2205.14135>" %}
FlashAttention: Memory Hierarchy with Bandwidth & Memory Size Attention on GPT-2
{% endembed %}

One of the strengths of this paper is the comprehensive analysis of the memory hierarchy and bandwidth considerations in the context of attention computation.&#x20;

The authors provide a <mark style="color:yellow;">detailed breakdown of the memory requirements and data movement patterns for different attention variants, such as scaled dot-product attention and masked attention</mark>.&#x20;

They demonstrate that by carefully managing data movement and maximising the utilisation of fast on-chip memory, FlashAttention can achieve up to 2.4 times speedup and 7.6 times memory reduction compared to the standard PyTorch implementation.

### <mark style="color:purple;">**The Essence of FlashAttention**</mark>

The evaluation of FlashAttention on the GPT-2 language model showcases its practical impact.&#x20;

The authors show that FlashAttention can significantly reduce the memory footprint and computational time of attention layers, enabling the training and inference of larger models with limited GPU memory.&#x20;

This is particularly relevant in the context of recent advancements in large-scale language models, where the <mark style="color:yellow;">attention mechanism is a major bottleneck in terms of memory and computational efficiency.</mark>

Attention mechanisms are crucial for these models as they help determine which parts of the input data the model should focus on. However, they can be resource-intensive, requiring significant memory and processing power.

The paper introduces a novel approach to managing the memory hierarchy during the attention process, focusing on three main types of memory:

* <mark style="color:green;">**SRAM (Static Random-Access Memory):**</mark> Fast and located on the GPU, but limited in size (19 TB/s bandwidth, 20 MB size).
* <mark style="color:green;">**HBM (High Bandwidth Memory):**</mark> Also on the GPU, slower than SRAM but with a larger capacity (1.5 TB/s bandwidth, 40 GB size).
* <mark style="color:green;">**DRAM (Dynamic Random-Access Memory):**</mark> Located on the CPU, with the largest capacity but the slowest bandwidth (12.8 GB/s, over 1 TB size).

### <mark style="color:purple;">**How FlashAttention Works**</mark>

FlashAttention optimises the use of these memory types by carefully managing where and how data is stored and accessed during the computation process.

It introduces a method to minimise the need to access slower memory types (like DRAM) by efficiently using SRAM and HBM.&#x20;

This is achieved through smart data copying and computation strategies that reduce the need for data movement across different memory types, significantly speeding up the attention computation process.

The technical details involve breaking down the computation into smaller blocks that can be efficiently processed within the faster, but smaller, memory spaces (SRAM), and then combining the results. This process involves several steps, including:

* Copying necessary data blocks to SRAM.
* Computing attention blocks within SRAM.
* Efficiently outputting results to HBM for further processing or storage.

### <mark style="color:purple;">**Impact and Applications**</mark>

The FlashAttention method shows significant improvements in the time it takes to compute the attention mechanism in GPT-2.  This not only means faster training times for these large models but also opens up possibilities for their deployment in environments where computational resources are limited.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://training.continuumlabs.ai/inference/why-is-inference-important/flash-attention.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
