Reinforcement Learning from Human Feedback is a training technique where humans rank model outputs and the model learns to maximize human preferences. Generate multiple outputs for the same prompt, have humans rank them from best to worst, train a reward model that predicts human preferences, then use that reward model to fine-tune the original model toward generating outputs humans prefer.
RLHF is how GPT-4 became significantly better than its predecessor. The base model generated text that was technically coherent but often unhelpful, misleading, or harmful. RLHF taught it to generate outputs that humans find more useful and trustworthy. The technique works because it aligns the model's optimization toward human preferences.
Without RLHF, optimizing raw language prediction likelihood often produces undesired behavior. With RLHF, the model learns that certain outputs get higher reward signals. A limitation is human preferences vary and can be wrong. You're encoding whatever biases exist in your human labelers. You're also creating labor-intensive training pipelines.
Interactive Visualizer
Reinforcement Learning from Human Feedback (RLHF)
Interactive demonstration of how AI models learn from human preferences
Step 1: Generate Multiple Outputs
The model generates multiple responses to the same prompt:
The cat sat on the mat. It was a nice day.
The graceful feline settled comfortably upon the woven rug, enjoying the warm afternoon sunlight.
Cat mat sit good yes very much so indeed.