
Sesame AI recently open-sourced its Conversational Speech Model (CSM), a speech generation tool capable of producing authentic-sounding audio using trained or custom voices. One caveat: it excels with shorter audio snippets—think sentences rather than lengthy paragraphs. And don’t expect testing it out of the box conversationally.
Watching my demo, you can see the setup process isn’t exactly plug-and-play. You’ll need a Hugging Face account, a decent GPU, and some patience with Python. But once everything’s running, the results literally speak for themselves. While the first demo did sound a bit robotic, the late voice demos sounded pretty natural, not the robotic monotone we’ve grown accustomed to from virtual assistants like the first (and often current iterations of) Siri and Alex. To my ears, the open source CSM sounds better than OpenAI’s Advanced Voice Mode, too.
That said, the system really shines when you feed it proper reference recordings. I grabbed a few samples from Mozilla’s Common Voice dataset, and the similarity to the input voices was striking. The model effectively clones voice characteristics when you match the transcript exactly to what’s being said in the reference audio.
This raises obvious ethical questions. Sesame explicitly prohibits using CSM for impersonation or creating misleading content. Its terms of use prohibit impersonation, fraud, misinformation, deception, and illegal or harmful activities. As I recall stories of fraud using cloned voices though, I worry about the barrier to causing harm here being virtually non-existent. All you need is an MP3 of someone speaking, and you can generate new phrases in their voice.
For practical applications, CSM sits in an interesting middle ground. At only 1B parameters, it’s perhaps lightweight enough to run on a mid-range modern gaming GPU. In my test here, I used an A100.
As alluded to earlier, one tradeoff comes in context length limitations. Forget generating a step-by-step guide on how to use an electron microscope for newbies. The CSM works best with concise phrases and sentences.
Since CSM doesn’t include conversational abilities, you’ll need to pair it with an LLM if you’re building interactive systems. The Python API is straightforward enough that connecting these components shouldn’t be too challenging for experienced developers, of which I am not, so I can’t comment on bespoke implementations of the CSM.
From an R&D perspective, the release of the CSM is noteworthy in that it is an open source release. It thus marks another step in dismantling the walled gardens that have dominated voice technology for over a decade. Think about it—first-generation voice assistants like Siri and Alexa were completely closed ecosystems. The voice was the product, the brand, the experience. And those were all carefully controlled by Apple, Amazon, and other Big Tech companies.
Sesame is effectively democratizing high-quality voice synthesis, potentially inspiring a fresh wave of research on voice AI that can detect emotion and even sarcasm.
Smaller companies and independent developers who couldn’t afford to build proprietary voice systems might now incorporate natural-sounding speech into their products. We might soon see voice interfaces appearing in unexpected places—new cars, next-gen IoT devices that move beyond the flat delivery and simplistic interactions revolving around, say, weather and timers. .
The next wave of voice interfaces seems less likely to be defined by the tech giants but by creative implementations from a diverse ecosystem of developers. So my review of this demo? Let’s call it an 8/10. The use cases for it out of the box aren’t totally clear. And its quality is a bit variable at times. But the CSM still sounds miles better than many other AI voice generators that had varying degrees of flat intonation for, well, decades. It’s about time for something new.