Multi-Head Attention

A transformer technique that runs multiple attention operations in parallel so the model can capture different kinds of relationships at once.

Multi-head attention extends self-attention by running several attention calculations in parallel. Each "head" can learn to focus on a different kind of pattern, such as grammar, entity relationships, or topic structure.

This parallel attention makes transformers much more expressive. One head might connect pronouns to nouns, another might track sentence boundaries, and another might capture semantic similarity. The results are then combined into a single richer representation.

Why use multiple heads? A single attention map is limited. Multiple heads let the model examine the same sequence from different perspectives at the same time.

Benefits of Multi-Head Attention

Richer representations — learn multiple contextual patterns simultaneously
Better language understanding — capture syntax and meaning together
Stronger generation quality — improves coherence and relevance
Core transformer block — appears in nearly every modern LLM

Multi-head attention is one reason transformers outperform older architectures so decisively. It is now a standard component in text, image, audio, and multimodal AI systems.

Related Terms

← Back to Glossary