What is the latency budget for conversational voice AI?

Human conversation has natural turn-taking pauses of about 200-300ms. A voice AI agent that takes longer than 500ms to start responding feels laggy. The budget breaks down roughly as: STT 50-150ms, LLM 200-500ms (time to first token), TTS 50-150ms, and network overhead 20-50ms.

What is the difference between STT and TTS in voice AI?

STT (Speech-to-Text) converts raw audio from the caller into a text transcript. Modern STT engines process audio as a stream, emitting partial transcripts as the caller speaks. TTS (Text-to-Speech) converts the LLM's text response into audio the caller hears. Modern TTS uses neural codec models that generate speech with natural prosody, breathing patterns, and emotional tone.

What makes a voice AI agent different from a chatbot?

Function calling is what separates a voice agent from a voice chatbot. When the LLM decides an action is needed, it emits a structured tool call (e.g., book_appointment). The orchestration layer executes the tool, feeds the result back to the LLM, and the LLM incorporates it into its spoken response. A voice agent takes real actions, not just talks.

What is Cloudflare Voice in voice AI?

Cloudflare Voice provides the browser-to-Cloudflare realtime voice protocol for Agents. In this demo, Cloudflare Agents and Durable Objects hold the live session, Workers AI performs STT and TTS, and the Worker routes or escalates the result without depending on a separate voice SaaS provider.

Understanding Voice AI - From Speech to Action

TL;DR

A voice AI agent is a three-stage pipeline: Speech-to-Text (STT) converts the caller's voice into text, a Large Language Model (LLM) processes the text and decides what to do, and Text-to-Speech (TTS) converts the response back into natural-sounding audio. The entire round trip needs to complete in under 500 milliseconds to feel conversational. Cloudflare Voice, Agents, and Workers AI keep this pipeline in one edge-native application, managing turn-taking, interruption handling, and tool execution so the voice agent can take real actions, not just talk.

The Three-Stage Pipeline

Every voice AI interaction follows the same fundamental architecture. The caller speaks, the system listens, thinks, and responds. What makes modern voice AI different from the IVR phone trees of the past is that each stage is now powered by a different specialized model, and the orchestration between them happens in real time.

  [Caller Speaks]
       |
       v
  [STT: Speech-to-Text]
  Deepgram / Whisper / Google STT
  Audio stream --> text transcript
       |
       v
  [LLM: Language Model]
  GPT-4o / Claude / Gemma
  Text --> reasoning --> tool calls --> response text
       |
       v
  [TTS: Text-to-Speech]
  ElevenLabs / PlayHT / Cartesia
  Response text --> natural audio stream
       |
       v
  [Caller Hears Response]

STT (Speech-to-Text) LLM (Language Model) TTS (Text-to-Speech) WebRTC / SIP Function Calling

Stage 1: Speech-to-Text (STT)

The STT model converts raw audio into a text transcript. Modern STT engines process audio as a stream, emitting partial transcripts as the caller speaks rather than waiting for silence. This is called streaming transcription, and it is essential for low-latency voice agents because the LLM can start processing before the caller finishes their sentence.

Deepgram is the most common STT choice in production voice agents due to its streaming-first design and low latency (typically under 100ms for partial results). OpenAI Whisper offers higher accuracy on complex speech but with higher latency. Google Cloud STT provides strong multilingual support. The choice depends on the latency budget and language requirements of the application.

Stage 2: The LLM Brain

Once the caller's speech is transcribed, the text goes to an LLM. This is where voice AI diverges from a simple voice-to-text tool. The LLM does not just formulate a reply. It reasons about the request, decides whether to call external tools (book an appointment, look up an order, transfer the call), and constructs a response that accounts for conversational context.

According to Flowful's 2026 analysis of AI voice agent architectures, the LLM stage is the primary bottleneck in the pipeline. STT and TTS each take 50-150ms. The LLM call takes 200-800ms depending on the model and prompt complexity. This is why voice AI platforms use streaming LLM responses: the TTS engine starts generating audio from the first sentence while the LLM is still producing the rest.

Function calling is what separates a voice agent from a voice chatbot. When the LLM decides an action is needed, it emits a structured tool call (e.g., book_appointment({date: "2026-04-15", time: "2pm"})). The orchestration layer executes the tool, feeds the result back to the LLM, and the LLM incorporates it into its spoken response.

Stage 3: Text-to-Speech (TTS)

The final stage converts the LLM's text response into audio the caller hears. Modern TTS has crossed the uncanny valley. ElevenLabs, according to their documentation, uses a neural codec model that generates speech with natural prosody, breathing patterns, and emotional tone from a text prompt alone. You can clone a specific voice from a short audio sample or use pre-built voices optimized for different use cases (warm and professional for customer service, energetic for sales).

Streaming TTS is non-negotiable for conversational voice. The TTS engine receives text token by token from the LLM and begins audio playback before the full response is generated. This pipelining is how the total round-trip stays under the 500ms threshold that Softcery's 2026 comparison of voice platforms identifies as the upper bound for natural-feeling conversation.

Latency: The Defining Constraint

Human conversation has a natural turn-taking rhythm with pauses of about 200-300ms between speakers. A voice AI agent that takes longer than 500ms to start responding feels laggy. Over 1 second feels broken. This latency budget is the single most important architectural constraint in voice AI.

The budget breaks down roughly as follows:

STT: 50-150ms (streaming partial transcripts)
LLM: 200-500ms (time to first token, streaming)
TTS: 50-150ms (time to first audio chunk)
Network overhead: 20-50ms (WebRTC/SIP transport)

Production voice platforms optimize every stage. They use regional inference endpoints to minimize network hops. They pre-warm model connections. They implement speculative execution, where the TTS begins generating audio based on predicted LLM output before the model confirms its response. These optimizations are what make a voice agent sound like a human conversation rather than a phone menu.

Cloudflare-Native Architecture

This demo now uses Cloudflare's native voice stack instead of a third-party voice orchestration provider. The browser connects to a Cloudflare Agent through the @cloudflare/voice client, the Durable Object keeps session state, Workers AI handles speech-to-text and text-to-speech, and the Worker runs the routing and safety logic.

Keeping the path Cloudflare-native matters for the demo: the same Worker serves the UI, static scenario audio, realtime voice route, text fallback endpoint, and documentation. That keeps the architecture easier to inspect, cheaper to operate, and better aligned with the production Cloudflare platform this site is meant to represent.

The important design choice is not a specific vendor name; it is owning the voice pipeline boundaries: audio transport, STT, LLM reasoning, TTS, transcript policy, rate limits, and escalation hooks. Cloudflare Voice and Agents let those boundaries live inside the same edge application as the rest of the demo.

Real-World Use Cases

Voice AI agents are deployed in production across several industries where phone-based interaction is the norm:

Appointment scheduling: Medical offices, salons, and service businesses use voice agents to handle booking calls 24/7. The agent checks availability via API, confirms the appointment, and sends a confirmation SMS.
Order status and support: E-commerce companies deploy voice agents on their support lines to handle "where is my order" calls. The agent looks up the order by phone number, retrieves tracking info, and speaks the status.
Lead qualification: Inbound sales calls are answered by a voice agent that asks qualifying questions, scores the lead, and either books a meeting with a human rep or routes the call directly.
After-hours reception: Professional services firms (law offices, accounting firms) use voice agents as after-hours receptionists that can answer common questions, take messages, and schedule callbacks.

Flowful's 2026 analysis estimates that voice AI agents handle over 2 million business calls per day in North America, with the automotive, healthcare, and real estate verticals leading adoption.

See It in Action

The Voice AI demo plays pre-recorded calls through the full STT to LLM to TTS pipeline with audio, waveform visualization, and live transcript at each stage. You can hear the conversation, see the transcript generate, and follow the response flow in real time.

Voice AI Demo →

Sources & Further Reading

Cloudflare Agents Documentation. Durable agent sessions, routing, and Worker-native deployment patterns.
Cloudflare Workers AI Documentation. Edge model inference for text, speech-to-text, and text-to-speech workloads.
Cloudflare Workers Static Assets. Serving demo audio and other assets from the same Worker deployment.
Flowful: AI Voice Agents in 2026. Market adoption data, pipeline architecture analysis, and production deployment patterns.
Deepgram: Streaming Speech-to-Text. Real-time STT architecture, partial transcript delivery, and latency optimization.

Understanding Voice AI: From Speech to Action