HomeGlossaryMulti-Head Attention
Models & Architecture

Multi-Head Attention

A transformer technique that runs multiple attention operations in parallel so the model can capture different kinds of relationships at once.

Multi-head attention extends self-attention by running several attention calculations in parallel. Each "head" can learn to focus on a different kind of pattern, such as grammar, entity relationships, or topic structure.

This parallel attention makes transformers much more expressive. One head might connect pronouns to nouns, another might track sentence boundaries, and another might capture semantic similarity. The results are then combined into a single richer representation.

Why use multiple heads? A single attention map is limited. Multiple heads let the model examine the same sequence from different perspectives at the same time.

Benefits of Multi-Head Attention

  • Richer representations — learn multiple contextual patterns simultaneously
  • Better language understanding — capture syntax and meaning together
  • Stronger generation quality — improves coherence and relevance
  • Core transformer block — appears in nearly every modern LLM

Multi-head attention is one reason transformers outperform older architectures so decisively. It is now a standard component in text, image, audio, and multimodal AI systems.

Related Terms

← Back to Glossary