Constitutional AI

Constitutional AI is a training approach where an AI system evaluates its own outputs against a set of principles, a constitution, and improves through self-critique and revision. The system generates an initial response, then evaluates whether that response violates any principles in its constitution, then revises it if needed. This self-improvement loop happens without human intervention.

Anthropic developed this technique as an alternative to traditional human feedback loops. Training on human feedback is expensive. You need thousands of human raters evaluating model outputs. It's slow. It's inconsistent. Different humans have different preferences. Constitutional AI bypasses this by encoding principles directly and letting the model critique itself.

If your constitution emphasizes honesty, the model learns to avoid hallucinating. If it emphasizes helpfulness, it learns to provide thorough responses. If it emphasizes safety, it learns to refuse harmful requests. The constitution becomes the training signal. One concern is that self-critique might be false. The model might think it violated a principle when it didn't, or vice versa.

Another concern is that the constitution itself might be flawed or biased.

Interactive Concept: constitutional ai