This is a guest post for the Computer Weekly Developer Network written by Yaron Vaxman, senior director of deep learning at Cloudinary.
Vaxman writes in full as follows…
There are many complex, repetitive jobs in a visual media workflow, like intelligent cropping, moderating user-generated images and optimising content to look and perform well across different devices.
The industries we serve include retail, travel and media, who publish visual media at ‘hyperscale’ levels to drive online sales, engage customers and deliver support. Managing the workflow at scale gets impractically slow, inefficient, risky and expensive without automation and AI.
In visual media, SLMs are growing in popularity as alternatives to LLMs. They are considerably more cost and energy efficient, with much lower latency that translates into faster output. In retail, for example, they’re ideal for high throughput tasks like real-time content moderation. They also help mitigate the problem of abandoned shopping carts associated with high visual media latency.
SLMs also train much faster on specialised datasets with deep domain knowledge rather than squandering time and resources training on vast amounts of general data.
SLM trade-offs
However, SLMs come with certain trade-offs. When applied to visual media, they tend to be significantly larger than task-specific models designed for image and video processing. This makes sense, as interpreting free text is inherently complex and demands greater model capacity. At the same time, SLMs are more susceptible to errors compared to LLMs due to their smaller size. With limited capacity, achieving the same level of quality as larger language models becomes a challenge.
Like any AI model, training on small datasets requires careful attention to accuracy. For example, consider an SLM designed to enforce brand guidelines for user-generated images. If it hasn’t been trained on a diverse range of data, it may struggle to generalise, increasing the risk of misclassifications. An eCommerce model trained exclusively on product images, without exposure to real-world photos, might learn to associate shoes only with store displays.
This could result in misleading contextual assumptions – much like the classic “QA Engineer walks into a bar” joke.
A hybrid API-first approach
The challenges with managing hyperscale visual media will continue growing thanks to trends like personalisation and augmented reality. In our experience, a hybrid approach that integrates SLMs with larger models offers the best of both worlds for managing this scale.
SLMs offer resource-efficient solutions for specific visual media tasks and are easy to integrate into workflows. The SLM is akin to a specialist doctor, with deep knowledge, trained to solve specific problems efficiently, whereas the LLM is the generalist practitioner with broad, shallow knowledge.
We use SLMs for speed-critical tasks and larger models to manage complex, high-dimensional queries. In this setup, intelligent routing analyses each query and directs it to the most appropriate model, optimising performance, balancing costs and ensuring appropriate processing power.
To make this type of hybrid architecture scale, flex and interoperate well, an API-first approach is essential. This also helps maintain a consistent data flow and simplifies deployment and maintenance.
Use cases in visual media
There are countless ways to use SLMs to automate essential processes in the visual media workflow.
For example:
Cloudinary’s Vaxman: SLMs are helping to improve the visual media workflow.
SLMs generate text descriptions and metadata for images and videos to improve accessibility, searchability and content organisation. They create alt text for product photos and complex scenes, enhancing SEO and usability and add relevant tags and metadata.
SLMs filter images and videos based on predefined brand guidelines. They analyse visual content for compliance, automatically decide whether an image should be accepted, rejected, or flagged for manual review when necessary.
By simplifying generative AI-driven image and video manipulation, SLMs make it more intuitive and accessible for non-technical users to transform and personalise images and video.
A view on the future
SLMs are set to dramatically improve the visual media workflow by offering more dynamic, conversational and context-aware interactions. Rather than having to navigate complex functionalities, users will be able to accomplish tasks using natural language.
I predict custom SLMs will be much more prevalent than today’s centralised LLMs. They’re easy to embed in diverse applications like messaging apps, customer service interfaces or virtual shopping assistants. For example, an SLM-powered chatbot in an AR shopping experience could allow users to ask, “How would this sofa look in my living room?” and receive an AI-generated preview.
Food for thought…
Without question, SLMs are helping to improve the visual media workflow by working faster, with more domain expertise and at a lower cost financially and to the environment.
As long as you mind those caveats mentioned earlier, deploying SLMs for the right specialised tasks in concert with LLM can help you harness all the advantages of hyperscale visual media, today and in the future.