Model Distillation

Model distillation is a technique for creating a smaller, faster model that approximates the behavior of a larger, more capable one. The large model is called the teacher, the smaller model the student. During distillation, the student model is trained not just on labeled data, but on the output probability distributions of the teacher.

' This richer signal transfers more of the teacher's knowledge than hard labels alone. The result is a student model that performs close to the teacher on most tasks but requires far less compute to run. Distilled models are necessary for deployment on mobile devices, edge hardware, and cost-sensitive applications.

Many small, fast models available today, including variants of Llama and Mistral, use distillation in their training pipelines. Distillation is also how reasoning capabilities are transferred: DeepSeek-R1 distilled its reasoning behavior into smaller models by training them on the reasoning traces generated by the full R1 model.

Interactive Visualizer