Training Data

The dataset used to teach a machine learning model the patterns it needs to make predictions or generate outputs.

Training data is the foundation of any machine learning system. It consists of examples the model learns from — the raw material that shapes what the model knows and how it behaves.

For supervised learning, each example includes an input and the correct output. For self-supervised models like LLMs, the data is raw text where the model predicts masked or next tokens.

Quality over quantity: clean, diverse, well-curated training data often beats massive noisy datasets. Bad data produces bad models.

Modern LLMs are trained on trillions of tokens from books, websites, code, and articles. The dataset composition — what's in it and what isn't — deeply influences the final model's capabilities and biases.

Related Terms

← Back to Glossary