Distillation

A technique for transferring knowledge from a large "teacher" model to a smaller "student" model that can run faster and cheaper.

Knowledge distillation trains a small student model to mimic the outputs of a large teacher model. The student learns from the teacher's full probability distributions, not just hard labels, capturing richer information about how the teacher reasons.

Distilled models are often 5-10x smaller than their teachers but retain 90%+ of their capability. This is how companies ship small, fast models derived from massive frontier models.

Common pattern: train an expensive teacher, then distill many smaller, task-specific students for production deployment.

Examples include DistilBERT (distilled from BERT), Alpaca (distilled from LLaMA via GPT outputs), and countless production models distilled from proprietary frontier LLMs. Distillation is a core technique for making AI efficient at scale.

Related Terms

← Back to Glossary