computerweekly.com

SLM series - Domino Data Lab: Distillation brings LLM power to SLMs

This is a guest post for the Computer Weekly Developer Network written by Jarrod Vawdrey in his capacity as field chief data scientist atDomino Data Lab.

Vawdrey writes in full as follows…

Small Language Models (SLMs) are compact AI models typically containing fewer than 10 billion parameters, designed to run efficiently on devices with limited resources, while large language models (LLMs) contain tens to hundreds of billions of parameters. As organisations seek to implement AI capabilities on edge devices, mobile applications and privacy-sensitive contexts, SLMs have gained significant attention.

Using an educational analogy: LLMs provide the knowledge of a college graduate; fine-tuning and RAG applied to LLMs give that graduate a PhD in specific domains; while SLMs are like talented high school students with limited knowledge depth.

Knowledge distillation enables effective transfer from LLMs to SLMs, helping these “high school students” perform beyond their capabilities by learning from their “college graduate” counterparts. This knowledge transfer represents one of the most promising approaches to democratising advanced language capabilities without the computational burden of billion-parameter models.

Background

As LLMs scale, creating efficient SLMs for resource-constrained applications has become critical. Knowledge distillation, introduced by Hinton et al. (2015), enables effective knowledge transfer from large models (teachers) to smaller models (students).

Unlike direct pretraining, which is computationally efficient but limited, or parameter-efficient fine-tuning, which adapts models for specific tasks, distillation transfers broad capabilities, making SLMs more robust. By learning from the teacher’s probability distributions, rather than just hard labels, student models gain nuanced decision boundaries and improved generalisation – enhancing performance while maintaining efficiency.

Methodology

Knowledge distillation from LLMs to SLMs begins with two key components: a pre-trained LLM that serves as the “teacher,” and a smaller architecture that will become the SLM “student.” The smaller architecture is typically initialised either randomly or with basic pre-training.

The distillation process can be implemented through different methods using both structured data (like labelled datasets with clear categories) and unstructured data (such as text corpora, conversations, or code):

Response-based distillation trains the SLM to match the output probability distribution of the LLM across a large corpus, focusing on final outputs.

Feature-based distillation goes beyond just copying answers – it helps the smaller student model learn how the larger teacher thinks by mimicking its reasoning process at different stages.

Multi-stage distillation represents a sequential approach where knowledge is transferred through intermediate models of decreasing size. This works like a tutoring system where a college graduate first teaches a bright high school senior, who then simplifies and passes down that knowledge to a younger student. This step-by-step approach makes it easier for smaller models to learn, since jumping straight from a massive LLM to a tiny SLM would be like expecting a high school freshman to grasp an upper college-level lecture in one go.

Distilled SLMs improve response quality and reasoning while using a fraction of the compute of LLMs. Unlike fine-tuning or RAG, which specialise LLMs, distillation transfers broad capabilities to smaller, more efficient models. Benchmarks show that well-distilled SLMs with 1 billion parameters perform comparably to much larger models on tasks like classification, summarisation and question answering.

Compared to traditional ML models, distilled SLMs offer superior language understanding while maintaining lower inference costs than full LLMs – balancing efficiency and capability.

Enterprises are adopting on-premise and edge-deployed SLMs for customer service and data-sensitive applications, reducing costs while ensuring data sovereignty and security without relying on cloud-based inference. In healthcare, SLMs deployed on medical devices enable real-time patient monitoring and diagnostic assistance while maintaining HIPAA compliance through on-device processing that keeps sensitive patient data local.

The financial sector has further expanded SLM applications through advanced fraud and AML detection systems that operate directly on transaction processing servers, allowing for immediate analysis without exposing sensitive financial data to external services. In remote defence applications, SLMs enable intelligence analysis and communication systems that can function effectively in contested environments with degraded or denied connectivity.

Challenges & limitations

Despite their effectiveness, distillation approaches face key challenges. The fundamental capacity constraint of smaller models means certain capabilities require a minimum model size to function effectively. Complex reasoning and extensive factual knowledge often degrade even in well-distilled SLMs.

Additionally, knowledge distillation can suffer from catastrophic forgetting, where the student model retains only a subset of the teacher’s knowledge, potentially losing critical capabilities in the process – especially when transferring diverse skills.

The distillation process itself also presents challenges. While the resulting SLMs offer significant advantages during inference – lower infrastructure costs, reduced computational requirements and smaller environmental footprints – the initial distillation process involves substantial computational expense. This creates a complex trade-off: invest considerable upfront resources to realise the long-term benefits of more efficient inference. High-volume applications can yield positive returns across all dimensions, but with variable break-even points based on deployment scale.

Advancing AI efficiency

Promising research directions include adaptive frameworks that dynamically adjust the learning signal based on the student’s progress, task-specific distillation focused on transferring the most relevant capabilities and multi-teacher approaches that combine insights from diverse LLMs. Additionally, reinforcement learning-based distillation is emerging as a powerful technique, to refine model knowledge dynamically through iterative feedback and reward mechanisms, helping optimise learning efficiency.

Knowledge distillation offers a compelling pathway for enhancing SLMs with LLM-derived capabilities. While these “high school student” models may never completely match their “graduate” LLM counterparts or “PhD” fine-tuned LLMs with RAG, distillation efficiently narrows the gap – an important advance in democratizing AI capabilities, making advanced language technologies accessible where deploying full LLMs remains impractical.

The Domino Data Lab team, outside the lab.

Read full news in source page