Inference & Optimization
Top-K Sampling
A text generation strategy that restricts sampling to the K most likely next tokens at each step.
Top-K sampling is a decoding strategy where the model only considers the K tokens with the highest probability at each generation step, then samples from those K according to their relative probabilities.
It strikes a balance between greedy decoding (always pick the single best token) and full sampling (consider all tokens). Small K values make output more deterministic; larger K values increase diversity.
Typical values: K=40 to K=100 for general generation. K=1 is equivalent to greedy decoding.
Top-K is often combined with top-P sampling and temperature for fine-grained control over generation. Modern LLMs typically use top-P as the primary cutoff.