veda.ng
Back to Glossary

Knowledge Distillation

Knowledge Distillation infographic

Knowledge distillation is a model compression technique where a smaller student model learns to replicate the behavior of a larger teacher model by training on the teacher's output probability distributions rather than just hard labels. ' The near-miss classifications, the confidence levels, the relationships between classes, all of this transfers to the student.

The student learns not just what the right answer is, but how the teacher reasons about uncertainty. This produces smaller models that punch above their weight. A distilled model with 10% of the parameters might achieve 95% of the teacher's accuracy. The technique was formalized by Geoffrey Hinton in 2015 and has become required for deploying AI in resource-constrained environments.

Mobile applications, edge devices, and real-time systems all benefit from distilled models that run faster and use less memory. Modern applications include distilling large language models like GPT-4 into smaller models, distilling vision transformers into efficient CNNs, and multi-teacher distillation where students learn from ensembles.

The temperature parameter in distillation controls how much the soft labels are softened, with higher temperatures revealing more about the teacher's learned structure.

Interactive Visualizer

Knowledge Distillation

Watch how a student model learns from soft probability distributions instead of hard labels. The teacher's uncertainty and near-miss predictions contain valuable knowledge.

Input Sample

🐱
True Label: Cat

🎓 Teacher Model (Large)

Confident Predictions:
Cat
89.7%
Dog
8.1%
Rabbit
2.2%
Soft Targets (T=3)
Cat
57.4%
Dog
25.8%
Rabbit
16.7%

🎒 Student Model (Small)

Learning Progress:
Cat
55.1%
Dog
26.8%
Rabbit
18.1%
Training Step: 0/35
💡 Key Insights:
Temperature Scaling: Higher T=3 makes teacher predictions softer, revealing more about class relationships
Rich Supervision: Student learns from probability distributions, not just hard labels
Dark Knowledge: The teacher's uncertainty and near-miss predictions transfer valuable knowledge