Self-Attention

A form of attention where each token in a sequence looks at every other token in the same sequence to build context-aware representations.

Self-attention is the operation that lets each token in a sequence inspect all other tokens in that same sequence and decide which ones are important. The output for each token becomes a weighted combination of the other tokens, which gives the model a richer understanding of context.

In practice, self-attention helps language models resolve ambiguity. For example, in the sentence "The animal didn't cross the street because it was too tired," self-attention helps the model infer that "it" refers to "animal," not "street."

Core idea: Each token builds a context-aware representation by comparing itself to all other tokens. This is what gives transformers their strong contextual understanding.

Why Self-Attention Is Powerful

Context awareness — words are interpreted based on surrounding tokens
Flexible dependency modeling — handles near and far relationships equally well
Scalable training — supports highly parallel GPU execution
Foundation for LLMs — central to GPT, Claude, Gemini, and similar models

Self-attention is usually implemented repeatedly across many transformer layers. Combined with feedforward blocks and residual connections, it forms the backbone of modern large language models.

Related Terms

← Back to Glossary