AI Voice

What is Text-to-Speech (TTS)?

Definition

Technology that converts written text into natural-sounding spoken audio — the speaking layer of AI voice agents, powered by models like ElevenLabs and OpenAI TTS.

In more detail

Text-to-Speech (TTS) is the final stage of an AI voice agent pipeline: after the language model generates a text response, TTS converts that text into audio that is streamed back to the caller. Modern neural TTS models produce speech that is remarkably natural — with appropriate pacing, intonation, and even emotional nuance — far beyond the robotic synthesisers of legacy IVR systems.

Key TTS providers include ElevenLabs (best-in-class naturalness, supports voice cloning and custom voices, higher cost), OpenAI TTS (good quality, cost-effective, integrated with the OpenAI stack), Google Cloud TTS (reliable, extensive language support), and PlayHT (strong voice cloning capabilities). Voice selection significantly affects the perception of a brand's AI voice agent.

For real-time voice agents, streaming TTS is essential. The system should begin speaking the first sentence of its response while the rest is still being generated — rather than waiting for the full response before starting audio output. This 'first-word latency' optimisation is often the most impactful change for making a voice agent feel natural in conversation.

Why it matters

The voice is the brand in a voice AI system. Poor TTS makes agents sound robotic and erodes caller trust, while natural, well-chosen TTS creates a professional experience that reflects well on the business.

Working with Text-to-Speech?

I offer AI Integration & Agentic Workflows for businesses ready to move from understanding to implementation.

Learn about AI Integration & Agentic Workflows →

← Back to Glossary

What is Text-to-Speech (TTS)?

In more detail

Why it matters

Related terms

Further reading

Working with Text-to-Speech?