In voice AI, 500ms isn't a UX number — it's a pipeline budget. Daily/Pipecat's 2026 STT benchmark alone shows TTFS medians ranging from 247ms to 1,136ms. Add LLM inference, TTS, and network round-trips, and production response onset easily exceeds 1 second.
The 500ms Barrier Is Created by the Entire System, Not a Single Model
When WebRTC Hacks benchmarked OpenAI's Realtime API in 2025, the theoretical STUN RTT floor was 60–70ms, but actual response latency was approximately 1.66–1.86 seconds. The implication is straightforward: even with fast GPUs, when transport, endpointing, buffering, and connection setup stack up, users experience not a 'fast model' but a 'frozen system.' Latency budgets must be broken down into STT, first-token, TTS TTFB, and transport RTT.
At the STT Stage, TTFS and EOT Come Before Accuracy
Daily's public benchmark recorded Deepgram at 247ms median / 326ms p99, Soniox at 249ms median / 310ms p99, and Speechmatics at 495ms median / 736ms p99 across 1,000 real speech samples. Meanwhile, AWS hit 1,136ms median and Azure 1,016ms median with long tails. This is why p95 and p99 matter more than averages in production voice agents. Deepgram defines its key voice agent metric as end-of-turn latency rather than transcript latency, recommending 20–100ms audio chunks with integrated turn detection — Flux can cut 200–600ms compared to traditional STT+VAD.
LLM Bottlenecks Split on First-Token Strategy, Not Model Size
At the LLM stage, first-token latency matters more than tokens per second. A 2023 staged speculative decoding paper achieved 3.16x decoding latency reduction in small-batch on-device environments — a direction that remains central to production inference optimization. In practice, speculative decoding, prompt compression, tool prefetch, and response streaming must be used together. Context Injection, in particular, isn't about injecting more — it's about injecting the right information fast. Feeding only intent-classification fields rather than entire consultation histories accelerates first responses.
TTS: Manage TTFB Before Audio Quality
Users perceive when the first audio byte arrives, not total synthesis time. Async's 2025 streaming TTS benchmark recorded AsyncFlow at ~20ms model inference latency and 166ms median TTFB, noting that humans perceive pauses beyond 250–300ms. Production systems should stream chunked audio for immediate playback rather than waiting for completed WAV files. Browser-side WebRTC and telephony-side persistent WebSocket session reuse are the advantageous approaches.
BringTalk Designs for p99 Budgets, Not Averages
BringTalk manages latency as a pipeline SLO, not single-model performance. STT finalization, LLM first token, TTS TTFB, and transport RTT are tracked independently. Regional proximity deployment and, where necessary, self-hosted or edge inference reduce round-trip time. Even in Zero Retention environments, connection reuse and selective Context Injection are designed together to avoid adding unnecessary relay hops. Ultimately, what matters isn't 'the best model' — it's the system that starts speaking first at p99.

