Inference & Generation
Inference
The process of using a trained AI model to generate predictions, classifications, or responses on new input data.
Inference is what happens after training: the model is given new input and produces an output. In an LLM, inference means reading the prompt and generating tokens one by one. In an image classifier, it means assigning a label to a new image.
Inference is the stage users actually interact with. It is also where latency, throughput, memory usage, and serving cost matter most in production environments.
Training teaches the model. Inference uses what it learned. These are distinct phases with different engineering concerns.
Why Inference Matters in Production
- User experience — affects response speed and reliability
- Infrastructure cost — serving large models can be expensive
- Scalability — production systems must handle many requests at once
- Optimization opportunity — quantization and batching can reduce cost
Teams often spend as much time optimizing inference as they do training. Techniques like caching, batching, and quantization can make the difference between an experimental model and a deployable product.