Disruption Part 1 – Speaking Out

How often have you heard the phrase “disruptive technology”? I bet you’ve heard it—a lot! We encounter it constantly in the media and in business presentations. But have you ever witnessed such disruptive technology as it actually happened? I’ve been fortunate enough to experience a couple of these paradigm shifts during my professional career. This article is about one of them.

The time is the early 1990s in Stockholm, Sweden. I worked as a software development consultant, and one of my projects involved a company called Infovox in Stockholm. The company specialized in text-to-speech technology and was one of the market leaders in that segment. They had developed a number of text-to-speech products, primarily for people who were visually impaired. That way, people with disabilities were able to read documents and texts independently.

The main approach used by Infovox was formant synthesis, which can be described as a mathematical model of a person’s speech organs. This was the most common approach to speech synthesis at the time. The method uses a set of rules that convert text into phonemes, phonemes into articulatory targets, which are then finally converted into parameters driving the formant-based sound generation. (My project involved creating a lexical database that allowed rapid lookup of phonetic representations of words, including inflection rules.)

A challenge with formant synthesis was that the generated speech was quite robotic. Whilst it was intelligible and efficient—requiring relatively modest computational resources—the unnatural sound quality made it suitable for accessibility applications but not really appealing to the general public.

Around this time, a different approach to speech synthesis became more prevalent: concatenative synthesis. Instead of using mathematical models to generate speech, this approach took snippets of real speech from a real person and assembled them into full, flowing sentences. This made it sound far more natural than formant synthesis.

The challenge with concatenative synthesis was that it required a reasonably large database of voice snippets and substantial computational power to select and blend them together seamlessly. The computers of the 1980s weren’t powerful enough for this, but that changed in the early 1990s. The sound quality of concatenative synthesis was good enough for the general public, and by the late 1990s it had become the dominant approach for text-to-speech.

Having worked on solutions around formant synthesis, I could see new and better ways of approaching the same problem emerging before my very eyes—or ears. The concatenative synthesis approach was gradually refined to sound more and more natural over the following decade. Remember that this was over 30 years ago, when today’s generative AI approach to speech generation was just a dream!

But in the next article, I’ll share another disruption story—one where I witnessed an even more dramatic paradigm shift unfold at a conference in Boston…


Reach out to us!

Get in touch with us today to claim your free one-day consultation for new customers. Explore how Realitech’s expertise in Test-Driven Integration, agile delivery, and technology transformation can help reduce risk, accelerate progress, and deliver real value to your projects.