Go to Menu

Speech Synthesis: How It Works and Where to Get It

Ever wonder how Alexa reads you the weather every day? Learn the basics of speech synthesis—and meet the ReadSpeaker speech synthesis library.

April 26, 2023 by Gaea Vilage
Speech Synthesis: How It Works and Where to Get It

Humans have been building talking machines for centuries—or at least trying to. Inventor Wolfgang von Kempelen nearly got there with bellows and tubes back in the 18th century. Bell Labs legend Homer Dudley succeeded in the 1930s. His “Voder” manipulated raw electronic sound to produce recognizable spoken language—but it required a highly trained operator and would have been useless for an iPhone.

When we talk about speech synthesis today, we usually mean one technology in particular: text to speech (TTS). This voice-modeling software translates written language into speech audio files, allowing Alexa to keep talking about new things. So how does speech synthesis work in the era of AI voice assistants and smart speakers?

A few technologies do the trick. One approach to TTS is called unit selection synthesis (USS). A USS engine sews chunks of recorded speech into new utterances. But in order to minimize audible pitch differences at the seams, the professional voice talent must record hours of speech in a fairly neutral and unvarying speaking style. As a result, USS voices sound less natural, and there is no flexibility to synthesize more expressive or emotional speaking styles without doubling or tripling the amount of recorded speech.

Instead, let’s look at neural text to speech, a form of speech synthesis AI that uses machine learning to produce more lifelike results. In this article, we’ll describe how neural TTS works.

Of course, USS voices are still being used in low-power applications like automotive audio systems. That might not last long, as scientists are continually reducing the computational power required by neural TTS. Soon, neural TTS may simply create all the synthetic voices you interact with—making it all the more important to understand.

Adding Speech Synthesis to Your Business Project

If you’re interested in speech synthesis, maybe you need TTS for a product, website, or project. Before we get into the details of how it works, you should know where to access TTS technology. That answer is simple: ReadSpeaker.

The ReadSpeaker speech synthesis library is an ever-growing collection of lifelike TTS solutions, all ready to deploy in your voicebot, smart speaker application, or voice user interface. Fill out the form below to start exploring the contents of our ready-made TTS voice portfolio for your organization’s needs. If you need more details first, read the last section of this article to learn what sets ReadSpeaker apart from the crowd.

Request TTS Voice Samples

Listen to ReadSpeaker’s neural TTS voices in dozens of languages and personas—or inquire about your brand’s very own custom voice. Start the conversation today!

Now that you know where to get your TTS, here are the basic steps a neural TTS engine uses to speak:

1. The TTS Engine Learns to Pronounce the Text

The first step in neural speech synthesis is called linguistic pre-processing, in which the TTS software converts written language into a detailed pronunciation guide.

First and foremost, the TTS engine needs to understand how to pronounce the text. That requires translation into a phonetic transcription, a pronunciation guide with words represented as phonemes. (Phonemes are the building blocks of spoken words. For instance, “cat” is made up of three phonemes: the /k/ sound represented by the letter “c,” the short vowel /a/ represented by the letter “a,” and the plosive /t/ at the end.)

The TTS engine matches combinations of letters to corresponding phonemes to build this phonetic transcription. The system also consults pre-programmed rules. These rules are especially important for numerals and dates—the system needs to decide whether “1920” means “one thousand, nine hundred and twenty” or “nineteen-twenty” before it can break the text down into its constituent parts, for instance.

In addition to phonemes, the TTS engine identifies stresses: syllables with a slightly raised pitch, some extra volume, and/or an incrementally longer duration, like the “but” in “butter.” At the end of linguistic pre-processing, the text represents a string of stressed and unstressed phonemes. That’s the input for the neural networks to come.

2. A DNN Translates Text Into Numbers

Next comes sequence-to-sequence processing, in which a deep neural network (DNN) translates text into numbers that represent sound.

The sequence-to-sequence network is software that translates your prepared script into a two-dimensional mathematical model of sound called a spectrogram. At its simplest, a spectrogram is a Cartesian plane in which the X axis represents time and the Y axis represents frequency.

The system generates these spectrograms by consulting training data. The neural network has already processed recordings of a human speaker. It has broken down those recordings into phoneme models (plus lots of other parts, but let’s keep this simple). So it has an idea of what the spectrograms for a given speaker look like. When it encounters a new text, the network maps each speech element to a training-derived spectrogram.

Long story short: The sequence-to-sequence network matches phonetic transcriptions to spectrogram representations inferred from the original training data.

What does the spectrogram do?

The spectrogram contains numerical values for each frame, or a temporal snapshot, of the represented sound—and the TTS engine needs these numbers to build a voice audio file. Essentially, the sequence-to-sequence model maps text onto spectrograms, which translate text into numbers. Those numbers represent the precise acoustic characteristics of whoever’s voice was in the training data recordings, if that speaker were to say the words represented in the phonetic transcription.

The Role of Generative Neural Networks in Neural TTS

You’ve probably heard of generative AI models, like ChatGPT. Increasingly, we use similar types of generative neural networks to create lifelike synthetic voices.

A generative neural network creates novel speech samples in a random but controllable way. Such a model leads to more natural speech waveforms, especially when applied at the vocoder stage (see below).

Not all DNNs are generative, of course. Generative neural networks are just the latest in a rapidly developing series of AI technologies—and we use several of them to create neural TTS voices.

3. A Vocoder Produces Waveforms You Can Listen To

The final step in neural speech synthesis is waveform production, in which the spectrogram is converted into a playable audio medium: a waveform that is playable or streamable. These waveforms can be stored as audio files. That makes the completed neural TTS voice available in audio file production systems or as real-time streaming audio.

But first, we must convert the spectrogram into the speech waveform.

We’ve translated text into phonemes and phonemes into spectrograms and spectrograms into numbers: How do you turn those numbers into sound? The answer is another type of neural network called a vocoder. Its job is to translate the numerical data from the spectrogram into a playable audio file.

The vocoder requires training from the same audio data you used to create the sequence-to-sequence model. That training data provides information that the vocoder uses to predict the best mapping of spectrogram data onto a digital audio sample. Once the vocoder has performed its translation, the system gives you its final output: An audio file, synthesized speech in its consumable form.

That’s a highly simplified picture of how speech synthesis works, of course. Dig deeper by learning how ReadSpeaker creates bespoke neural voices for brands and application creators. There’s one more question we should address, though:

With dozens of TTS providers just a Google search away, why should you choose ReadSpeaker? Here are eight good reasons.

Why choose TTS voices from the ReadSpeaker speech synthesis library?

1. Every voice offers accurate pronunciation, lifelike expression, and AI-driven quality.

Text-to-speech voice quality is essential for providing outstanding customer experiences in voice-first environments. ReadSpeaker has been at the forefront of machine speech for more than 20 years, and we continually invest in R&D to push the technology forward. Our DNN technology allows us to synthesize human voices with remarkable—and constantly improving—accuracy.

As a result, the voices in our speech synthesis library sound as good as TTS can—and they always will, regardless of how the technology advances. As we develop new neural networks for modeling lifelike human speech, we’ll reprocess original source recordings to keep our TTS voices future-proof. This makes ReadSpeaker text to speech a unique solution.

2. The ReadSpeaker team offers personalized, ongoing customer support.

As a company solely focused on TTS, ReadSpeaker has a dedicated customer support team. We help you choose the right TTS product; we help you integrate it with your systems; and we remain available to resolve any issues that arise for the full duration of your TTS journey. Most providers of TTS simply sell a tech product, then leave you on your own. At ReadSpeaker, we partner with you over the long term to ensure success.

3. Custom pronunciation dictionaries ensure accurate speech.

The most advanced TTS engines in the world still run into some pronunciation problems. Acronyms, proper nouns, and industry jargon can throw them for a loop—and lead to inaccurate speech, adding to the user’s confusion instead of removing it.

At ReadSpeaker, we offer personalized pronunciation dictionaries. While our linguists work to ensure accurate speech from the start, your custom dictionary covers the unique language involved in your use case. That could be obscure scientific terms, niche industry buzzwords, names of people and places, or anything else. Customization ensures accurate speech, and ReadSpeaker’s pronunciation dictionaries put you in control.

4. ReadSpeaker provides global reach with a local touch.

ReadSpeaker TTS voices are available in more than 35 languages and dialects, allowing you to reach a global audience while serving distinct communities—whether they speak the Dutch of Belgium or the Netherlands; Australian, British, or U.S. English; Mandarin or Cantonese; or many other languages and dialects. Our list is always growing to meet the needs of new customers.

Don’t see your customers’ language represented? Contact us to discuss your project’s language requirements. Meanwhile, with offices in over 10 countries, ReadSpeaker linguists are always close at hand to solve pronunciation challenges including industry- and brand-specific jargon.

5. ReadSpeaker is 100% focused on voice technology.

Many major providers of TTS voices do so as an adjunct to professional services; they create conversational AI solutions, devoting the bulk of their R&D efforts to related technologies like natural language understanding (NLU) or conversation management systems rather than TTS.

At ReadSpeaker, TTS is all we do—and all of our R&D concentrates on improving synthetic speech. This deep, narrow focus also gives us the flexibility to work hand-in-hand with customers, ensuring expectation-defying TTS experiences before, during, and after launch. As part of our continuous improvement policy, we take feedback from our users, constantly update our products, and maintain an international team of computational linguists who will help you update custom pronunciation dictionaries. This ensures ongoing perfect speech, even for changing industry jargon, acronyms, initialisms, and proper nouns—the very specifics other TTS engines struggle to express accurately.

6. Get TTS for all your voice channels in one place.

What are your TTS goals? You might need to dynamically voice-enable your website, voicebot, or device; produce video voiceovers; integrate voice into a learning management system (LMS); or embed runtime speech into an application or a video game. Maybe you need all of these options and then some. Many TTS providers can only help with one or two of these technologies. ReadSpeaker provides solutions for every voice scenario.

By choosing a single TTS provider for all your voice touchpoints, you’ll cut down on costs and vendor management challenges. And if you create a custom TTS voice, ReadSpeaker allows you to provide brand consistency everywhere you meet your customers: Voice ads, automated customer experiences, brand identity, interactive voice response (IVR) systems, owned voice assistants, and more.

7. Our TTS engines ensure full privacy, every time and for everyone.

Currently, some of the leading providers of ReadSpeaker-quality TTS voices are Big Tech giants. Often, conversational AI providers simply rely on these industry behemoths to supply their speech synthesis libraries. That can create potential conflicts of interest, as vendors may access and analyze user data.

That’s not a risk with ReadSpeaker, and the reason is simple: ReadSpeaker TTS solutions never collect data, not from our customers, and certainly not from yours. As an added bonus, this assured privacy can help you comply with local regulatory laws, whether that’s GDPR in the EU, HIPAA in the U.S., or any other privacy protection.

8. Choose licensing based on your business model, not ours.

Many TTS providers stick to rigid contracts, often with hefty minimum purchase volumes. At ReadSpeaker, we’ll work with you to create a contract that reflects your business model, whether that’s licensing the perfect voice for a certain duration or partnering with you to meet other pre-agreed goals.

For even greater branding gains, ask us about our custom branded voices—a one-of-a-kind TTS voice built to express your brand traits and establish you as a distinct voice in all your consumer-outreach channels.

Sound interesting? Explore the ReadSpeaker speech synthesis library to find your language and listen to TTS voice samples. Better yet, contact us to request a curated selection of neural TTS voices for your unique application—or to develop a custom voice as an audio brand signature.

Related articles