Distillation
A technique for transferring knowledge from a large "teacher" model to a smaller "student" model that can run faster and cheaper.
Knowledge distillation trains a small student model to mimic the outputs of a large teacher model. The student learns from the teacher's full probability distributions, not just hard labels, capturing richer information about how the teacher reasons.
Distilled models are often 5-10x smaller than their teachers but retain 90%+ of their capability. This is how companies ship small, fast models derived from massive frontier models.
Examples include DistilBERT (distilled from BERT), Alpaca (distilled from LLaMA via GPT outputs), and countless production models distilled from proprietary frontier LLMs. Distillation is a core technique for making AI efficient at scale.