TrainingReinforcement Learning from Human Feedback

RLHF

Reinforcement Learning from Human Feedback — a training technique that aligns AI models with human preferences using human ratings.

Reinforcement Learning from Human Feedback (RLHF) is the training technique largely responsible for the dramatic improvement in AI assistant usability from 2022 onward. It takes a pre-trained language model and makes it helpful, harmless, and honest by incorporating human preferences directly into the training loop.

RLHF has three stages: First, fine-tune the base model on human-written demonstrations. Second, train a reward model on human comparisons — humans are shown pairs of model outputs and choose the better one, teaching the reward model what "good" looks like. Third, use reinforcement learning (PPO) to optimize the language model to score highly according to the reward model.

Why it works: Human language and preferences are too complex to specify as rules. RLHF lets humans implicitly encode their preferences through comparisons, which the model then generalizes. This is how ChatGPT went from "technically impressive but hard to use" to "feels like talking to a helpful assistant."

RLHF Stages

Stage 1 — SFT: Supervised fine-tuning on human-written demonstrations
Stage 2 — Reward Model: Train on human preference comparisons
Stage 3 — RL: Optimize policy with PPO against reward model

Variants of RLHF include DPO (Direct Preference Optimization, which skips the reward model), Constitutional AI (Anthropic's approach using AI feedback), and RLAIF (using AI rather than human ratings). All aim to make models that behave as humans want them to — aligned, helpful, and safe.

Related Terms

← Back to Glossary