Voice-First Interfaces with Realtime APIs: Barge-In, Turn Detection, and Fallback
The first version of our voice assistant was technically impressive and practically unusable. It listened politely while users finished speaking, waited for the LLM to respond, then played back audio — a full conversational round trip. The problem was that users did not know when to start speaking, so they talked over the audio response. The assistant did not notice. It finished its sentence. Users gave up.
Natural conversation is full-duplex. Both parties can speak and be heard simultaneously. Building a voice interface that feels natural means solving barge-in, turn detection, latency, and graceful degradation — not just plugging a TTS library into a chat API.
The Latency Budget
Before writing any code, define your latency budget. Users perceive latency differently in voice than in text. The threshold for "this feels broken" is around 800ms total round-trip for a voice response. Budget it explicitly.
User stops speaking
→ VAD end-of-turn detection: 50–150ms
→ Audio transcription (STT): 100–300ms (streaming reduces this)
→ LLM first token: 300–600ms (depends on model, prompt length)
→ TTS first audio chunk: 100–200ms
→ Audio playback start: 20–50ms
─────────────────────────────────────────────
Total perceived latency: 570–1300msThe 800ms budget is tight. You will not hit it consistently without streaming at every stage. Non-streaming STT + non-streaming LLM + non-streaming TTS adds up to 2–4 seconds. That is a chat interface, not a voice interface.
Voice Activity Detection
VAD is the component that decides when the user has finished speaking. Getting it wrong in either direction is painful: too aggressive and you cut off users mid-sentence; too conservative and there is a dead pause before the assistant responds.
Two common approaches:
Energy-based VAD is fast and cheap — measure the RMS amplitude of audio frames and apply a threshold. It fails in noisy environments and does not distinguish speech from background noise.
Model-based VAD (Silero VAD, WebRTC VAD) uses a neural network trained to detect speech. It handles noise better and produces confidence scores rather than binary decisions. Use this in production.
import numpy as np
class SileroVAD:
"""Thin wrapper around Silero VAD for streaming audio."""
def __init__(self, threshold: float = 0.5, sampling_rate: int = 16000):
import torch
self.model, _ = torch.hub.load(
repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=False
)
self.threshold = threshold
self.sampling_rate = sampling_rate
self._speech_started = False
self._silence_frames = 0
self.SILENCE_FRAMES_TO_END = 8 # 500ms at 16kHz, 64ms frames
def process_chunk(self, audio_chunk: bytes) -> dict:
import torch
audio_array = np.frombuffer(audio_chunk, dtype=np.int16).astype(np.float32) / 32768.0
tensor = torch.FloatTensor(audio_array)
confidence = self.model(tensor, self.sampling_rate).item()
is_speech = confidence > self.threshold
if is_speech:
self._silence_frames = 0
if not self._speech_started:
self._speech_started = True
return {"event": "speech_start", "confidence": confidence}
return {"event": "speech_ongoing", "confidence": confidence}
else:
if self._speech_started:
self._silence_frames += 1
if self._silence_frames >= self.SILENCE_FRAMES_TO_END:
self._speech_started = False
self._silence_frames = 0
return {"event": "speech_end", "confidence": confidence}
return {"event": "silence", "confidence": confidence}Barge-In: Interrupting the Assistant
Barge-in is the ability to interrupt the assistant while it is speaking. Without it, users feel trapped waiting for a long response to finish before they can correct a misunderstanding.
The implementation requires three things working together:
- Continue running VAD during assistant playback — most implementations stop VAD while the assistant is speaking. Do not.
- An interrupt signal that stops audio playback and cancels in-flight generation — this must be fast, under 100ms from detection to silence.
- Context preservation — the partial response that was interrupted needs to be logged (with a
[INTERRUPTED]marker) so the conversation history stays coherent.
import asyncio
from dataclasses import dataclass
@dataclass
class ConversationTurn:
role: str # "user" or "assistant"
content: str
interrupted: bool = False
class VoiceConversationManager:
def __init__(self, vad, stt_client, llm_client, tts_client):
self.vad = vad
self.stt = stt_client
self.llm = llm_client
self.tts = tts_client
self.history: list[ConversationTurn] = []
self._playback_task: asyncio.Task | None = None
self._generation_task: asyncio.Task | None = None
self._partial_response = ""
async def handle_audio_chunk(self, chunk: bytes):
event = self.vad.process_chunk(chunk)
if event["event"] == "speech_start":
# User started speaking — interrupt assistant if active
await self._interrupt_assistant()
elif event["event"] == "speech_end":
# User finished utterance — transcribe and respond
transcription = await self.stt.transcribe_buffered()
if transcription.strip():
self.history.append(ConversationTurn(role="user", content=transcription))
await self._generate_and_play_response()
async def _interrupt_assistant(self):
if self._playback_task and not self._playback_task.done():
self._playback_task.cancel()
await self.tts.stop_playback()
if self._generation_task and not self._generation_task.done():
self._generation_task.cancel()
# Save partial response to history with interrupted marker
if self._partial_response:
self.history.append(ConversationTurn(
role="assistant",
content=self._partial_response,
interrupted=True
))
self._partial_response = ""
async def _generate_and_play_response(self):
self._partial_response = ""
async def generate():
messages = [
{"role": t.role, "content": t.content + (" [INTERRUPTED]" if t.interrupted else "")}
for t in self.history
]
async for chunk in self.llm.stream(messages):
self._partial_response += chunk
await self.tts.stream_text(chunk)
self._generation_task = asyncio.create_task(generate())
self._playback_task = asyncio.create_task(self.tts.play_stream())Turn Detection Beyond Silence
Silence-based turn detection fails in two common cases: when users pause mid-sentence to think, and when users are waiting for a response and are silent expectantly. Both look like end-of-turn on audio alone.
Augment with semantic turn detection: after a configurable minimum silence (300ms), send the transcription-so-far to a fast, small model to predict whether the utterance is complete.
async def is_turn_complete(partial_transcript: str, llm_client) -> bool:
"""Quick heuristic check — use a small, fast model."""
response = await llm_client.complete(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Is this a complete thought or sentence that expects a response?
Answer only YES or NO.
Transcript: "{partial_transcript}"
"""
}],
max_tokens=5,
temperature=0
)
return response.strip().upper() == "YES"Call this only when VAD detects 300–400ms of silence after speech. If the answer is NO, wait another 500ms before checking again.
Text Fallback Path
Audio pipelines fail. Microphone permissions get denied. Background noise makes VAD useless. Network jitter breaks WebSocket streams. You need a text fallback that users can trigger without the interface feeling broken.
Design the fallback as a first-class mode, not an error state:
The fallback trigger conditions to handle automatically:
- Microphone permission denied → show text input immediately, no error message
- Three consecutive STT failures → offer "switch to typing" button
- Noise level too high (measured during first 2 seconds of session) → suggest text mode
- Mobile device with poor connectivity (measured RTT > 2s) → default to text mode
Observability
Voice interfaces have failure modes that are invisible to standard metrics. Log these explicitly:
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class VoiceTurnMetrics:
session_id: str
turn_id: str
timestamp: datetime = field(default_factory=datetime.utcnow)
# Latency breakdown
vad_end_of_turn_ms: float = 0
stt_latency_ms: float = 0
llm_first_token_ms: float = 0
tts_first_audio_ms: float = 0
total_perceived_latency_ms: float = 0
# Quality signals
stt_confidence: float = 0
was_interrupted: bool = False
barge_in_at_pct: float = 0 # How far through response when interrupted
fallback_to_text: bool = False
fallback_reason: str = ""
# Cost
stt_seconds: float = 0
llm_input_tokens: int = 0
llm_output_tokens: int = 0
tts_characters: int = 0Track P50, P90, and P99 of total_perceived_latency_ms. Anything above 1200ms at P90 means you have a pipeline problem. Track was_interrupted rate — high barge-in rates often signal that the LLM is giving overly verbose responses.
Key Takeaways
- Natural voice conversation requires streaming at every stage (STT, LLM, TTS) — non-streaming pipelines produce chat-like latency that feels broken in voice UX.
- Silero VAD or WebRTC VAD consistently outperforms energy-based VAD in real-world noise conditions; model-based VAD is not optional for production.
- Continue running VAD during assistant playback — that is the prerequisite for barge-in support, and without barge-in, users feel trapped.
- Save interrupted partial responses to conversation history with a marker so the LLM has accurate context about what was communicated.
- Semantic turn detection (predicting utterance completeness via a fast small model) reduces false end-of-turn triggers from thinking pauses.
- Text fallback is a first-class mode, not a fallback error screen — design, instrument, and test it with the same care as the primary voice path.