computerweekly.com

SLM series - Digital Workforce: Beyond hallucinations & anthropomorphisation

This is a guest post for the Computer Weekly Developer Network written by Rami Luisto, PhD, the healthcare AI lead data scientist at Digital Workforce.

Luisto writes in full as follows…

Large Language Model (LLM) capabilities have been dominating tech news for the past few years.

But one should not forget that there is more to Natural Language Processing (NLP) than just LLMs. Though Small Language Models (SLMs) get less media coverage, their capabilities have also been marching steadily forward are nowadays part of the standard NLP toolbox.

Everything is a tradeoff and it is important to be aware of the differences between SLMs and LLMs when choosing the best tool for a job. Our purpose here is to discuss the where, when and why either one should be used.

Where do the differences lie?

Here by LLMs we mean mdoels like GPTs, LLaMa, or DeepSeek that tend to be languagegenerating models with more than a billion parameters. By SLMs we mean models like BERT and its variants that are usually languageunderstanding models with a few hundred million parameters.

We’ll next look at some more specific dimensions on which these model types differ in practice.

Jack of all trades, master of none

Few people really seem to appreciate thebreadth of skill of modern LLMs. We often fixate on the mistakes they make when we use them in our particular narrow setting. We might notice the suboptimal algorithm choice, or a weird turn of phrase in translation, not remembering that it is actually doing reasonable work inall of the languages. It can write poetry in French, algorithms in Python and create a Japanese cooking instructions in the format of a fantasy story for children. So even though there are very few, if any, specific fields where LLMs outperform humans, their breadth of skill is very much beyond any human realm of possibility.

This capability of attacking almost any issue in any field is naturally one of the main reasons why LLMs can be so quickly adapted for almost any new task that relies on using language. It does come at a cost, though. The scope of things you can do with a given SLM is much narrower compared to the almost universal capabilities of a modern LLM, but this narrowed scope is compensated, among other things, by potentially much better performance.

While an LLM can do an okay job on a given task on the fly, an SLM can be trained to be hyperfocused to that one particular task, running fast and with a fraction of the cost.

Black boxes & Vantablack boxes

It is a truth universally acknowledged that deep neural networks can be very much opaque to human understanding. But opaqueness does come in shades. Even though we do notfully understand how e.g. the SLM called BERT functions, we do understand it much better than any LLM; see e.g. (Rogers, Kovaleva and Rumshisky 2020).

The fact that we know so much about BERT owes both to the openness of the model and its size. BERT strikes good balance of being an industrial-grade LM while still being trainable and usable on household-level hardware. We naturally do not know exactly how well e.g. OpenAI has been able to grok the inner workings of GPT4o, but we would doubt that they would fare better than the full scientific community that has been studying the openly available models.

Besides understanding the models,trust can often be crucial. For example, in the healthcare sector, many systems need to be vetted and approved by medical professionals. An SLM trained specifically on data related to the task at hand can be easier to trust as we can know (and control) exactly what data goes into training and we can modify unwanted behaviour on the fly. Furthermore, the narrower scope of an SLM also yields better explainability – to understand even GPT-3 we need to understand how it works with all of human language, while for an SLM it might suffice how it processes referral data related to hip replacements.

Luisto: Language models can write poetry in French, algorithms in Python & create Japanese cooking instructions in the format of a fantasy story for children.

Let’s look at whether we should be fitting to local machines, or lifting to the clouds?

One of the major questions in any AI project is about where to run your models. All of the major cloud providers are nowadays offering solutions specifically aimed at various types of secure AI-development and hosting, but sometimes we want or need to run them locally.

In the setting of LMs a dominant factor is the massive sizes and compute requirements of LLMs. Setting up a local supercomputer is doable, but requires expert knowledge. We also run to the issue of the economies of scale: running an LLM locally might be particulary expensive if you only use the system 10% of the time.

Large actors like Azure, on the other hand, will in any case need teams of high-performance computing specialists and they might more easily be able to use idle server time for other compute purposes. Smaller LLM variants and SLMs in particular, require vastly smaller resources and especially offer the possibility to run them on very limited local hardware.

Hallucinations & anthropomorphisation

As a rule of thumb, all AI-systems make mistakes. Their function tends to be statistical in nature, usually producing their outputs in the form of distributions rather than specific choices. But the certain types of errors that LLMs do, where they seem to “make up” facts,feel very different to most users when compared to more classical AI mistakes. Part of the difference is probably due the anthropomorphisation of LLMs that seems to come naturally for us – the LLM talks to us like human and we can’t help thinking about it in human terms.

A crucial point here is that unlike an SLM, the “error space” of LLM is not bounded.

When an SLM classifier makes a mistake, it means that it predicted class B instead of class A. But when an LLM classifierhallucinates, it might not even output a class name at all but a poem or, in a worse case, an SQL injection targeting your database. LLMs have been trained to understand any and all language from email contents to cyberattacks. Any fine-tuning we do to the model tends to only alter the superficial behaviour of the model and the underlying capabilities (and thus dangers) remain.

Pre-trained SLMs also come with a statistical model of human language, but they also have much more restricted output. Training will also affect their behaviour much more deeply. In an extreme case, one might even train a domain-specific SLM from scratch; in the context of e.g. medical texts this can be particularly effective as the texts involved rarely contain fictional stories or lies that the model would learn how to mimic.

When to use SLMs or LLMs?

On the one hand, anything we can do with SLMs can be done with a modern LLM.

While an LLM will be more expensive during inference, prompting it with “Classify the following based on its sentiment: positive/neutral/negative.” is much faster than training a custom SLM for the task. Especially since LLMs already have a good internal model of human language, we can get them to do tasks with minimal training data. Giving a few examples (known as few-shot prompting or in-context learning) is often enough to get the model to understand what the context is and thus how to do the task with at least some accuracy. Because of this LLMs are excellent in discovery-style PoC projects. With an LLM-based approach we can cook up a quick estimate on how hard the problem seems to be.

LLMs also tend to be text-generating models. This gives rise to thechain-of-thought techniques for LLMs: LLM performance can bemassively boosted by having it “think out loud” before giving an answer. For SLMs this kind of technique is not possible and we cannot improve accuracies by giving it more computational power at inference time.

On the other hand, the simplest tool is often the best.

Where SLMs excel is when the task is simple enough for an SLM, you have sufficient training data and you can affort the up-front cost of training a custom model. SLMs can also provide a higher degree of transparency to their inner workings and their outputs and thus error modes are more restricted. When explainability and trust are crucial, auditing an SLM can be much simpler compared to trying to extract reasons for an LLM’s behaviour. They might also be the only choice if compute resources are an issue.

Finally, having to choose one or the other might be a false dichotomy. Many tasks might require more or less thought based on the data and it would be natural to divide the work between various SLMs and LLMs. This kind of division can bring about the best of both worlds: we use simple reliable tools whenver we can, but lever the more powerful LLMs when needed. The tradeoff here then is that we are adding complexity to the whole system. LLM is not simple, but neither is a system using several different SLMs and LLMs.

So one should try to strike a balance between the system complexity and component complexities involved.

Read full news in source page