Inference & Optimization

KV Cache

A memory structure that stores previously computed attention keys and values, allowing LLMs to generate tokens without recomputing from scratch.

The KV cache stores the key and value projections computed for each token during LLM generation. Without it, every new token would require recomputing attention over the entire sequence — a quadratic cost that makes generation painfully slow.

With the KV cache, generation becomes linear in sequence length: each new token only attends to cached keys and values from previous tokens, plus its own new projection.

Tradeoff: the KV cache consumes substantial GPU memory, scaling with sequence length. Long contexts require significant memory for the cache alone.

Optimizations like PagedAttention, KV cache compression, and shared prefixes reduce memory consumption. The KV cache is why context length and memory are tightly linked in LLM deployment.

Related Terms

← Back to Glossary