Self-Attention
A form of attention where each token in a sequence looks at every other token in the same sequence to build context-aware representations.
Self-attention is the operation that lets each token in a sequence inspect all other tokens in that same sequence and decide which ones are important. The output for each token becomes a weighted combination of the other tokens, which gives the model a richer understanding of context.
In practice, self-attention helps language models resolve ambiguity. For example, in the sentence "The animal didn't cross the street because it was too tired," self-attention helps the model infer that "it" refers to "animal," not "street."
Why Self-Attention Is Powerful
- Context awareness — words are interpreted based on surrounding tokens
- Flexible dependency modeling — handles near and far relationships equally well
- Scalable training — supports highly parallel GPU execution
- Foundation for LLMs — central to GPT, Claude, Gemini, and similar models
Self-attention is usually implemented repeatedly across many transformer layers. Combined with feedforward blocks and residual connections, it forms the backbone of modern large language models.