Models & Architecture

Transformer

The neural network architecture behind most modern AI — uses attention mechanisms to process sequences in parallel.

The Transformer is the neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Google researchers. It replaced recurrent networks (RNNs) as the dominant approach to sequence modeling and became the foundation for virtually every state-of-the-art AI model — from GPT and Claude to DALL-E and Whisper.

The core innovation is the attention mechanism, which allows the model to relate any position in a sequence to any other position directly, regardless of distance. This makes transformers far better at capturing long-range dependencies than RNNs, and allows training to be fully parallelized on GPUs.

Architecture: A transformer consists of stacked encoder and/or decoder blocks, each containing a multi-head attention layer and a feedforward layer, with residual connections and layer normalization.

Transformer Variants

Encoder-only — BERT, RoBERTa; good for classification and understanding
Decoder-only — GPT series, Claude, Llama; best for text generation
Encoder-Decoder — T5, BART; used for translation and summarization
Vision Transformer (ViT) — applies transformer to image patches

Positional encoding is added to token embeddings so the model knows the order of the sequence. Without it, the transformer would treat sequences as unordered sets. Modern variants use Rotary Position Embedding (RoPE) or ALiBi for better generalization to longer sequences.

Related Terms

← Back to Glossary