Training & Learning

Batch Size

The number of training examples processed together in one forward/backward pass of the model.

Batch size determines how many training examples the model sees before updating its weights. Larger batches provide smoother gradient estimates but require more memory. Smaller batches are noisier but can act as a regularizer.

For LLMs, effective batch sizes can be in the thousands or millions of tokens using gradient accumulation — where gradients from many small batches are combined before a single weight update.

Rule of thumb: larger batch sizes pair with higher learning rates. Halving the batch typically calls for halving the learning rate.

Batch size affects training speed, memory consumption, generalization, and final model quality. It's one of the key knobs to tune for efficient training.

Related Terms

← Back to Glossary