Quantization

A technique that reduces model size and inference cost by storing weights and activations with lower numerical precision.

Quantization makes AI models smaller and faster by representing numbers with fewer bits. Instead of storing weights in 16-bit or 32-bit precision, a quantized model might use 8-bit, 4-bit, or even lower precision formats.

The result is lower memory usage and cheaper inference, which is especially important for deploying large models on limited hardware or serving many users in production.

Tradeoff: quantization improves efficiency, but very aggressive precision reduction can hurt output quality or stability if not done carefully.

Why Teams Quantize Models

Lower memory footprint — fit larger models onto smaller devices
Faster inference — reduce compute and bandwidth needs
Lower serving cost — especially important at scale
Broader deployment — enables local and edge use cases

Quantization is one of the most practical optimization techniques in AI deployment. It is widely used in both local inference stacks and commercial serving systems, especially for LLMs.

Related Terms

← Back to Glossary