Principles of Speech Synthesis and Recognition

💡 Learning Guide: This chapter takes you deep into the underlying principles of AI audio. We'll explore not just the "dry" acoustic jargon (like STFT, flow matching, timbre embeddings), but also use intuitive analogies and interactive demos to help you thoroughly understand how AI "comprehends human speech" and "speaks aloud." Even if you're a complete beginner, you'll grasp these concepts with ease!

🎵

Choose a scenario to experience AI audio

💡TTS: text to speech, letting AI read any text aloud

🎯ASR: speech recognition, converting speech into text

🎭Voice cloning: copy a voice from only a few seconds of audio

0. Introduction: The "Digital Translation" of Physical Sound Waves

Human speech and the various sounds in our world are, at their core, continuous physical sound waves produced by air vibrations. But a computer's brain only knows 0 and 1 — it can't hear sound. Therefore, the first step in enabling AI to process sound is bridging the gap between the "physical world" and the "digital world."

This process is called Analog-to-Digital Conversion (A/D Conversion), and its core output is the Pulse-Code Modulation (PCM) waveform — the audio data we commonly encounter. It is defined by two key metrics:

Sample Rate: How many "snapshots" are taken of the sound wave per second. For example, 16kHz means 16,000 amplitude values are recorded every second.
Bit Depth: The precision of the "ruler" used for each snapshot. 16-bit means amplitude is distinguished across 65,536 levels.

But this introduces a problem: 16,000 numbers per second, hundreds of thousands of numbers for a single sentence — the information load is massive and redundant. Feeding this long, one-dimensional waveform directly into a neural network is like asking someone to judge whether a sweater's pattern looks good by examining the structure of each individual wool fiber up close — clearly an extremely difficult computational challenge.

1. Feature Engineering: Giving AI "Human Ears"

Since directly inspecting the "one-dimensional waveform (Time-Domain)" doesn't work, scientists devised a dimensionality-reduction approach: transforming one-dimensional sound into a two-dimensional frequency map (Frequency-Domain).

1.1 From a Line to a Picture: Short-Time Fourier Transform (STFT)

Imagine listening to a symphony. We rarely care about the total air displacement at any given instant — we care much more about which instruments are playing (different frequencies) and how loud they are (energy) during that stretch of time.

Through the mathematical magic of the Short-Time Fourier Transform (STFT), we can decompose a flat, linear sound wave into a two-dimensional matrix image containing "time, frequency, and energy (color intensity)" — this is called a Spectrogram. At this point, the problem of processing sound has been cleverly transformed into a "visual recognition" problem, which AI handles far more adeptly.

1.2 Catering to Auditory Habits: The Mel Scale

In physics, frequency distribution is linear (the span from 0–100Hz is the same length as 10,000–10,100Hz). However, human ears are profoundly "biased": we are extremely sensitive to changes in low, deep sounds (low frequencies) but remarkably indifferent to subtle differences in sharp, high-fidelity sounds (high frequencies).

To help AI, like humans, "focus its limited attention on what matters most," researchers introduced the nonlinear Mel Filterbanks. They partition low-frequency regions very finely while coarsely wrapping high-frequency regions. After a logarithmic transformation, we obtain the cornerstone of modern audio AI — the Mel-Spectrogram.

👇 Try it yourself: Observe below how a one-dimensional machine waveform is transformed into a two-dimensional color map aligned with human perception.

FFT window1024

Mel filters80

🔊 Waveform (time domain)Raw audio amplitude over time

STFT transform⬇

📈 Linear spectrumLow high-frequency resolution

🎯 Mel spectrogramMatches human hearing

🎧 Why use the Mel scale?

Human hearing
100Hz→200Hz and 10000Hz→10100Hz can feel similarly different

Linear scale
Equal frequency intervals do not match human perception

💡

Mel spectrogram principle: The Mel scale models the nonlinear way humans perceive frequency. We are more sensitive to low-frequency changes and less sensitive to high-frequency changes. Mel spectrograms map frequency to this scale so AI focuses on perceptually important regions.

2. Teaching Large Models a "Foreign Language": Two Mainstream Generation Paradigms

Once features are extracted, how do we teach AI to generate sound? Academia and industry currently employ two parallel "magic circles."

2.1 Paradigm 1: Treating Sound as Text (Audio Tokenization)

Riding the wave of ChatGPT's popularity, scientists wondered: if we could turn sound into a sequence of "characters (Tokens)," could large language models (LLMs) directly sing and speak?

Compression & Quantization: Leveraging powerful Neural Codecs (e.g., EnCodec) and VQ-VAE architectures, an audio clip several megabytes in size is extremely compressed, ultimately turned into a series of discrete codes in a dictionary (e.g., the sequence: [82, 105, 33...]).
Generative Next-Token Prediction: The AI model simply predicts what the next sound token should be, just like a text autocomplete game. This greatly unifies the underlying architecture of multimodal learning!

🔽 Encoder

Raw waveform

24kHz, 16-bit

Conv 1

Conv 2

Conv 3

Conv 4

CNN downsampling

320x dimension reduction

VQ quantization

Discrete token

Compressed: ~1.5 kbps

🔼 Decoder

4212872553391

Discrete token

Codebook index

ConvT 4

ConvT 3

ConvT 2

ConvT 1

Transposed convolution

Upsampling

Reconstructed waveform

24kHz

📊 Bitrate comparison

1.5 kbps

EnCodec-24k

Sample rate:24 kHz

Frame rate:75 Hz

Codebook size:1024

3.0 kbps

EnCodec-48k

Sample rate:48 kHz

Frame rate:75 Hz

Codebook size:1024

6.0 kbps

SoundStream

Sample rate:16 kHz

Frame rate:50 Hz

Codebook size:1024

4.5

0.98 kbps

SNAC

Sample rate:24 kHz

Frame rate:43 Hz

Codebook size:4096

🔢 Token sequence visualization

0.1s0.2s0.30000000000000004s0.4s0.5s0.6000000000000001s0.7000000000000001s0.8s0.9s1s1.1s1.2000000000000002s1.3s1.4000000000000001s1.5s1.6s1.7000000000000002s1.8s1.9000000000000001s2s

Low-frequency components Mid-frequency components High-frequency components

🎯 Why audio tokenization?

🚀

Efficient transfer

Compress audio to ~1.5 kbps, about 256x smaller than raw audio, making it suitable for network transfer.

🧠

Language-model friendly

Discrete tokens can be processed directly by LLMs, enabling unified text-to-audio modeling.

🎵

Music generation

Models such as MusicGen and AudioLDM use audio tokens to generate music and sound effects.

🗣️

Speech synthesis

TTS models such as VALL-E and SoundStorm can generate audio tokens directly.

💡Neural audio codecs: Models such as EnCodec (Meta), SoundStream (Google), and SNAC use VQ-VAE style architectures to compress audio into discrete tokens. These tokens can be handled by language models for high-quality audio generation and compression.

2.2 Paradigm 2: Treating Sound as a Painting (Spectrogram Generation)

This is the foundational approach behind much of today's mature speech software, offering excellent controllability.

Spectrogram Generation: The AI model doesn't output the final audio waveform directly. Instead, it learns the mapping from "text" to a "two-dimensional Mel-Spectrogram," painting an acoustic feature map like an artist.
Waveform Reconstruction (Vocoder): Since spectrograms lose phase and other detail information and can't be played directly, we need a Vocoder (e.g., HiFi-GAN) to act as an interpreter, faithfully converting this image back into the one-dimensional waveform that drives speaker vibrations.

3. Bidirectional Inversion: The Collaborative Translation of ASR and TTS

Giving machines "ears" and a "voice" is essentially performing two diametrically opposed translations:

Automatic Speech Recognition (ASR): Translating sound into text. This is a many-to-one convergent multiple-choice problem. Models (like Whisper) must sift through vast amounts of audio — filled with noisy environments, accent variations, and homophone interference — to pinpoint the single correct semantic text.
Text-to-Speech (TTS): Translating text into sound. This is a one-to-many divergent creative task. The same dry utterance of "Hello" can carry ten thousand different speeds, emotions, pauses, and vocal qualities. The model must be capable of inferring these missing parameters.

🎙️

ASR Speech Recognition

Audio → Text

🔊

TTS Speech Synthesis

Text → Audio

Choose voice:

📊 ASR vs TTS

🎙️

ASR

Input:Audio waveform

Output:Text sequence

Challenge:Noise, accents, homophones

🔊

TTS

Input:Text sequence

Output:Audio waveform

Challenge:Prosody, emotion, naturalness

🔀 Architecture comparison

ASR Pipeline

Audio

→

Features

→

Encoder

→

Decoder

→

Text

TTS Pipeline

Text

→

Encoder

→

Decoder

→

Vocoder

→

Audio

💡

Inverse relationship: ASR and TTS are two core directions in speech technology and inverse processes of each other. ASR converts continuous audio signals into discrete text, while TTS converts discrete text into continuous audio signals. Both rely on acoustic models and language models.

4. From "Squeezing Toothpaste" to "Express Lane": TTS Core Architecture Evolution

After understanding the basic pipeline, let's look at how TTS engines pursue extreme speed and coherence.

Sequential Brute Force (Autoregressive, AR): Older-generation models had to follow a strict time sequence — generating the previous millisecond before using it as a reference to predict the next. While reliable, this approach is prone to stuttering and painfully slow.
Divine Anticipation (Non-Autoregressive, NAR): Subsequent models introduced a Duration Predictor. No longer generating in a queue, it "fortunetells" the duration each phoneme deserves in one shot, then outputs the entire sentence's audio in parallel across multiple paths simultaneously.
ODE Express Lane (Flow Matching): This is the ultimate cutting-edge approach (e.g., F5-TTS). It employs continuous normalizing flows and Ordinary Differential Equations (ODEs), abandoning traditional rigid construction. The model learns an optimal direct motion trajectory (probability flow) from "pure white noise" to "perfect spectrogram." Not only does computational efficiency rise exponentially, but the smoothness and naturalness of the voice also reach their peak.

📝

Text processing

Tokenize & phonemes

→

🔢

Text embedding

Feature extraction

→

🌊

Flow matching

Optimal transport

→

🔊

Vocoder

Spectrum to waveform

📝

Text processing

Convert input text into a phoneme sequence

Input:Raw text

Output:Phoneme sequence

Tech:G2P

📊 Architecture comparison

Feature

Autoregressive

Non-autoregressive

Flow matching

Generation speed

Slow

Fast

Very fast

Audio quality

High

Medium-high

High

Stability

Medium

High

Controllability

Medium

High

🏆 Representative models

Tacotron 2

Classic AR model with excellent audio quality

FastSpeech 2

NAR

Parallel generation with high speed

F5-TTS

Flow

Recent SOTA, generated in 10 steps

CosyVoice

Flow

Alibaba open-source model with multilingual support

💡

TTS evolution trend: TTS has moved from early autoregressive models such as Tacotron, to non-autoregressive models such as FastSpeech, and now to flow matching models such as F5-TTS. The direction is faster, more stable, and higher-quality synthesis.

5. Zero-Shot Voice Cloning

Just a few years ago, imitating someone's voice with AI required them to record tens of thousands of sentences in an extremely quiet studio and spend days training a model. Today, with just 3 seconds of audio, AI can produce a convincingly realistic clone.

This relies on a core technology: the Speaker Encoder and metric learning.

This is not merely a listener but a "genetic extractor." Its task is to strip away background noise and the specific words spoken (Text) from the audio, forcibly and uniquely capturing only your constant physiological traits: How wide are your vocal cords? How large is your resonant cavity? What are your articulation habits?
These features are ultimately compressed into a several-hundred-dimensional Speaker Embedding vector (e.g., x-vector). This string of numbers, like a barcode, fully represents your vocal identity. When the subsequent TTS model performs conditional generation "carrying this vector," any language it produces will carry the distinctive character of your voice.

1 Provide reference audio

👨

Male voice A

Low and magnetic

👩

Female voice B

Gentle and sweet

🧒

Child voice

Lively and cute

👴

Elder voice

Weathered and steady

2 AI learns voice features

📂

Load audio

→

🔢

Encode features

→

🎨

Extract timbre

→

💎

Build embedding

3 Enter text to generate speech

💡 Voice cloning tips

⏱️

Reference duration

3-10 seconds is enough; quality matters more than length.

🔇

Environment

Use a quiet environment and avoid background noise.

🗣️

Content choice

Audio with varied pitch and speaking speed works better.

🔬

Technical principle: Voice cloning extracts timbre, intonation, and speaking style from reference audio to build a speaker embedding. During generation, the TTS model combines text content with this speaker embedding to synthesize speech similar to the reference voice.

6. Breathing in a Soul: Emotional Rhythm and Fine-Grained Style Control

A phrase like "Really?" can express surprise or angry disbelief. Commercial-grade advanced AI must not only "read words correctly" but also "convey emotion."

Academia has proposed Global Style Tokens (GST) and feature bottleneck mechanisms. Large models can cluster and extract corresponding abstract soft vectors — "sadness," "excitement," "laziness" — from massive corpora of human performance recordings. In engineering practice, we also introduce intuitive adapter tuning parameters like fundamental frequency (F0, controlling pitch rises and falls) and energy (controlling volume and plosives), giving creators the ability to finely sculpt "vocal emotion" much like molding a game character's facial features.

Choose emotion style

😐

Neutral

Steady and natural

😊

Happy

Light and cheerful

😢

Sad

Low and slow

😠

Angry

Forceful and intense

🤩

Excited

Warm and energetic

😌

Calm

Relaxed and soothing

Emotion Embedding Space

Neutral Happy Sad Angry Excited Calm

🎚️ Fine-grained controls

Speed1x

SlowNormalFast

Pitch0

LowNormalHigh

Energy dynamics100%

SoftModerateIntense

Pause control150ms

CompactNaturalRelaxed

🎙️ Preview synthesis

💡Emotion control: Modern TTS systems can synthesize natural speech and precisely control emotion, speed, pitch, and other style features. This lets AI voiceover adapt to different scenarios, from calm customer-service dialogs to energetic speeches.

7. Conclusion

From basic digital signal conversion (PCM), to dimensionality reduction and purification (Mel-Spectrogram), to the currently booming multimodal foundation models based on "Flow Matching algorithms" and "Neural Codecs," audio AI is undergoing a leap from mechanical simulation to native understanding.

Future AI Agents will thoroughly bridge the high-dimensional links of human vision, hearing, and speech, responding to every interaction with genuine human-like intuition!

8. Core Terminology Glossary

Term	Full Name	Definition
PCM	Pulse-Code Modulation	The most primitive and voluminous method of recording one-dimensional audio waveforms.
STFT	Short-Time Fourier Transform	A mathematical analysis method that transforms sound from time-varying single amplitude values into a representation combining both frequency and energy.
Mel-Spectrogram	Mel-Spectrogram	The foundational feature for large-model audio processing: a high-value two-dimensional audio spectrogram adjusted through logarithmic transformation and nonlinear human auditory preferences.
Neural Codec	Neural Codec	An AI component that relies on extremely hardcore variational autoencoder residual techniques to highly compress large continuous sound waves into discrete labels (Tokens).
Vocoder	Vocoder	The "reverse interpreter": responsible for physically rendering a two-dimensional Mel-Spectrogram back into a one-dimensional audio waveform that can drive speakers.
Speaker Embeddings	Speaker Embeddings	A high-dimensional, immutable mathematical ID (e.g., x-vector) that captures and fixes a specific person's unique vocal timbre.
Flow Matching	Flow Matching	A cutting-edge AI inference process that transforms a normal distribution into an empirical data distribution by establishing a straight-line smooth generation path along an ordinary differential equation — without expensive differential stochastic computation.

Principles of Speech Synthesis and Recognition ​

0. Introduction: The "Digital Translation" of Physical Sound Waves ​

1. Feature Engineering: Giving AI "Human Ears" ​

1.1 From a Line to a Picture: Short-Time Fourier Transform (STFT) ​

1.2 Catering to Auditory Habits: The Mel Scale ​

2. Teaching Large Models a "Foreign Language": Two Mainstream Generation Paradigms ​

2.1 Paradigm 1: Treating Sound as Text (Audio Tokenization) ​

2.2 Paradigm 2: Treating Sound as a Painting (Spectrogram Generation) ​

3. Bidirectional Inversion: The Collaborative Translation of ASR and TTS ​

4. From "Squeezing Toothpaste" to "Express Lane": TTS Core Architecture Evolution ​

5. Zero-Shot Voice Cloning ​

6. Breathing in a Soul: Emotional Rhythm and Fine-Grained Style Control ​

7. Conclusion ​

8. Core Terminology Glossary ​

Principles of Speech Synthesis and Recognition

0. Introduction: The "Digital Translation" of Physical Sound Waves

1. Feature Engineering: Giving AI "Human Ears"

1.1 From a Line to a Picture: Short-Time Fourier Transform (STFT)

1.2 Catering to Auditory Habits: The Mel Scale

2. Teaching Large Models a "Foreign Language": Two Mainstream Generation Paradigms

2.1 Paradigm 1: Treating Sound as Text (Audio Tokenization)

2.2 Paradigm 2: Treating Sound as a Painting (Spectrogram Generation)

3. Bidirectional Inversion: The Collaborative Translation of ASR and TTS

4. From "Squeezing Toothpaste" to "Express Lane": TTS Core Architecture Evolution

5. Zero-Shot Voice Cloning

6. Breathing in a Soul: Emotional Rhythm and Fine-Grained Style Control

7. Conclusion

8. Core Terminology Glossary