Knowledge Distillation

Knowledge distillation is a model compression technique where a smaller student model learns to replicate the behavior of a larger teacher model by training on the teacher's output probability distributions rather than just hard labels. ' The near-miss classifications, the confidence levels, the relationships between classes, all of this transfers to the student.

The student learns not just what the right answer is, but how the teacher reasons about uncertainty. This produces smaller models that punch above their weight. A distilled model with 10% of the parameters might achieve 95% of the teacher's accuracy. The technique was formalized by Geoffrey Hinton in 2015 and has become required for deploying AI in resource-constrained environments.

Mobile applications, edge devices, and real-time systems all benefit from distilled models that run faster and use less memory. Modern applications include distilling large language models like GPT-4 into smaller models, distilling vision transformers into efficient CNNs, and multi-teacher distillation where students learn from ensembles.

The temperature parameter in distillation controls how much the soft labels are softened, with higher temperatures revealing more about the teacher's learned structure.

Interactive Concept: knowledge distillation

Knowledge Distillation

Watch how a student model learns from soft probability distributions instead of hard labels. The teacher's uncertainty and near-miss predictions contain valuable knowledge.

Input Sample

🐱

True Label: Cat

🎓 Teacher Model (Large)

Confident Predictions:

Cat

89.7%

Dog

8.1%

Rabbit

2.2%

Soft Targets (T=3)

Cat

57.4%

Dog

25.8%

Rabbit

16.7%

🎒 Student Model (Small)

Learning Progress:

Cat

55.1%

Dog

26.8%

Rabbit

18.1%

Training Step: 0/35

💡 Key Insights:

• Temperature Scaling: Higher T=3 makes teacher predictions softer, revealing more about class relationships

• Rich Supervision: Student learns from probability distributions, not just hard labels

• Dark Knowledge: The teacher's uncertainty and near-miss predictions transfer valuable knowledge

The temperature parameter in distillation controls how much the soft labels are softened, with higher temperatures revealing more about the teacher's learned structure.

Interactive Concept: knowledge distillation

Knowledge Distillation

Watch how a student model learns from soft probability distributions instead of hard labels. The teacher's uncertainty and near-miss predictions contain valuable knowledge.

Input Sample

🐱

True Label: Cat

🎓 Teacher Model (Large)

Confident Predictions:

Cat

89.7%

Dog

8.1%

Rabbit

2.2%

Soft Targets (T=3)

Cat

57.4%

Dog

25.8%

Rabbit

16.7%

🎒 Student Model (Small)

Learning Progress:

Cat

55.1%

Dog

26.8%

Rabbit

18.1%

Training Step: 0/35

💡 Key Insights:

• Temperature Scaling: Higher T=3 makes teacher predictions softer, revealing more about class relationships

• Rich Supervision: Student learns from probability distributions, not just hard labels

• Dark Knowledge: The teacher's uncertainty and near-miss predictions transfer valuable knowledge

Knowledge Distillation

Knowledge Distillation

Input Sample

🎓 Teacher Model (Large)

🎒 Student Model (Small)

💡 Key Insights:

Related Terms

Knowledge Distillation

Knowledge Distillation

Input Sample

🎓 Teacher Model (Large)

🎒 Student Model (Small)

💡 Key Insights:

Related Terms