Inference & Generation

Inference

The process of using a trained AI model to generate predictions, classifications, or responses on new input data.

Inference is what happens after training: the model is given new input and produces an output. In an LLM, inference means reading the prompt and generating tokens one by one. In an image classifier, it means assigning a label to a new image.

Inference is the stage users actually interact with. It is also where latency, throughput, memory usage, and serving cost matter most in production environments.

Training teaches the model. Inference uses what it learned. These are distinct phases with different engineering concerns.

Why Inference Matters in Production

User experience — affects response speed and reliability
Infrastructure cost — serving large models can be expensive
Scalability — production systems must handle many requests at once
Optimization opportunity — quantization and batching can reduce cost

Teams often spend as much time optimizing inference as they do training. Techniques like caching, batching, and quantization can make the difference between an experimental model and a deployable product.

Related Terms

← Back to Glossary