RLHF
Reinforcement Learning from Human Feedback — a training technique that aligns AI models with human preferences using human ratings.
Reinforcement Learning from Human Feedback (RLHF) is the training technique largely responsible for the dramatic improvement in AI assistant usability from 2022 onward. It takes a pre-trained language model and makes it helpful, harmless, and honest by incorporating human preferences directly into the training loop.
RLHF has three stages: First, fine-tune the base model on human-written demonstrations. Second, train a reward model on human comparisons — humans are shown pairs of model outputs and choose the better one, teaching the reward model what "good" looks like. Third, use reinforcement learning (PPO) to optimize the language model to score highly according to the reward model.
RLHF Stages
- Stage 1 — SFT: Supervised fine-tuning on human-written demonstrations
- Stage 2 — Reward Model: Train on human preference comparisons
- Stage 3 — RL: Optimize policy with PPO against reward model
Variants of RLHF include DPO (Direct Preference Optimization, which skips the reward model), Constitutional AI (Anthropic's approach using AI feedback), and RLAIF (using AI rather than human ratings). All aim to make models that behave as humans want them to — aligned, helpful, and safe.