Speculative Decoding

An inference acceleration technique where a small draft model predicts multiple tokens that a larger model then verifies in parallel.

Speculative decoding speeds up LLM inference by having a small, fast "draft" model generate candidate tokens, which a larger model then verifies in a single parallel pass. If the draft tokens are accepted, you got several tokens for the cost of one large-model call.

This works because most next-token predictions are easy — a small model gets them right. The large model only needs to correct the draft when the small model is wrong.

Speedup: 2-3x faster inference with no loss in output quality, because the final output matches what the large model alone would have produced.

Speculative decoding is now standard in production LLM serving. Variants like Medusa and EAGLE push the technique even further, generating multiple parallel draft heads within the same model.

Related Terms

← Back to Glossary