FlashAttention

A memory-efficient attention algorithm that speeds up transformer training and inference by avoiding materialization of the full attention matrix.

FlashAttention, introduced by Tri Dao in 2022, is an IO-aware implementation of attention that computes results in tiles without materializing the full N×N attention matrix in memory. This dramatically reduces memory usage and improves speed.

It's mathematically identical to standard attention but uses the memory hierarchy of modern GPUs more efficiently — keeping intermediate results in fast SRAM instead of slow HBM.

Impact: 2-4x faster training and 5-20x memory savings, enabling much longer context windows on the same hardware.

FlashAttention is now standard in virtually all transformer training and inference. Later versions (FlashAttention-2, FlashAttention-3) pushed speed and efficiency further. It's one of the most impactful systems-level optimizations in modern AI.

Related Terms

← Back to Glossary