Speculative Decoding
An inference acceleration technique where a small draft model predicts multiple tokens that a larger model then verifies in parallel.
Speculative decoding speeds up LLM inference by having a small, fast "draft" model generate candidate tokens, which a larger model then verifies in a single parallel pass. If the draft tokens are accepted, you got several tokens for the cost of one large-model call.
This works because most next-token predictions are easy — a small model gets them right. The large model only needs to correct the draft when the small model is wrong.
Speculative decoding is now standard in production LLM serving. Variants like Medusa and EAGLE push the technique even further, generating multiple parallel draft heads within the same model.