Voice-First Interfaces with Realtime APIs: Barge-In, Turn Detection, and Fallback

Ravinder·April 15, 2025·8 min read

AIVoiceRealtimeLLM

Voice-First Interfaces with Realtime APIs: Barge-In, Turn Detection, and Fallback

The first version of our voice assistant was technically impressive and practically unusable. It listened politely while users finished speaking, waited for the LLM to respond, then played back audio — a full conversational round trip. The problem was that users did not know when to start speaking, so they talked over the audio response. The assistant did not notice. It finished its sentence. Users gave up.

Natural conversation is full-duplex. Both parties can speak and be heard simultaneously. Building a voice interface that feels natural means solving barge-in, turn detection, latency, and graceful degradation — not just plugging a TTS library into a chat API.

The Latency Budget

Before writing any code, define your latency budget. Users perceive latency differently in voice than in text. The threshold for "this feels broken" is around 800ms total round-trip for a voice response. Budget it explicitly.

User stops speaking
  → VAD end-of-turn detection:       50–150ms
  → Audio transcription (STT):       100–300ms  (streaming reduces this)
  → LLM first token:                 300–600ms  (depends on model, prompt length)
  → TTS first audio chunk:           100–200ms
  → Audio playback start:            20–50ms
─────────────────────────────────────────────
Total perceived latency:             570–1300ms

The 800ms budget is tight. You will not hit it consistently without streaming at every stage. Non-streaming STT + non-streaming LLM + non-streaming TTS adds up to 2–4 seconds. That is a chat interface, not a voice interface.

Voice Activity Detection

VAD is the component that decides when the user has finished speaking. Getting it wrong in either direction is painful: too aggressive and you cut off users mid-sentence; too conservative and there is a dead pause before the assistant responds.

stateDiagram-v2 [*] --> Silence Silence --> Speech: audio energy > threshold Speech --> Silence: energy < threshold for 300ms Speech --> Speaking: confirmed speech (>200ms duration) Speaking --> EndOfUtterance: silence > 500ms after speech Speaking --> BargeinDetected: new audio during assistant playback EndOfUtterance --> [*]: trigger STT + LLM pipeline BargeinDetected --> [*]: interrupt assistant, trigger new pipeline

Two common approaches:

Energy-based VAD is fast and cheap — measure the RMS amplitude of audio frames and apply a threshold. It fails in noisy environments and does not distinguish speech from background noise.

Model-based VAD (Silero VAD, WebRTC VAD) uses a neural network trained to detect speech. It handles noise better and produces confidence scores rather than binary decisions. Use this in production.

import numpy as np
 
class SileroVAD:
    """Thin wrapper around Silero VAD for streaming audio."""
 
    def __init__(self, threshold: float = 0.5, sampling_rate: int = 16000):
        import torch
        self.model, _ = torch.hub.load(
            repo_or_dir='snakers4/silero-vad',
            model='silero_vad',
            force_reload=False
        )
        self.threshold = threshold
        self.sampling_rate = sampling_rate
        self._speech_started = False
        self._silence_frames = 0
        self.SILENCE_FRAMES_TO_END = 8  # 500ms at 16kHz, 64ms frames
 
    def process_chunk(self, audio_chunk: bytes) -> dict:
        import torch
        audio_array = np.frombuffer(audio_chunk, dtype=np.int16).astype(np.float32) / 32768.0
        tensor = torch.FloatTensor(audio_array)
        confidence = self.model(tensor, self.sampling_rate).item()
        is_speech = confidence > self.threshold
 
        if is_speech:
            self._silence_frames = 0
            if not self._speech_started:
                self._speech_started = True
                return {"event": "speech_start", "confidence": confidence}
            return {"event": "speech_ongoing", "confidence": confidence}
        else:
            if self._speech_started:
                self._silence_frames += 1
                if self._silence_frames >= self.SILENCE_FRAMES_TO_END:
                    self._speech_started = False
                    self._silence_frames = 0
                    return {"event": "speech_end", "confidence": confidence}
            return {"event": "silence", "confidence": confidence}

Barge-In: Interrupting the Assistant

Barge-in is the ability to interrupt the assistant while it is speaking. Without it, users feel trapped waiting for a long response to finish before they can correct a misunderstanding.

The implementation requires three things working together:

Continue running VAD during assistant playback — most implementations stop VAD while the assistant is speaking. Do not.
An interrupt signal that stops audio playback and cancels in-flight generation — this must be fast, under 100ms from detection to silence.
Context preservation — the partial response that was interrupted needs to be logged (with a [INTERRUPTED] marker) so the conversation history stays coherent.

import asyncio
from dataclasses import dataclass
 
@dataclass
class ConversationTurn:
    role: str           # "user" or "assistant"
    content: str
    interrupted: bool = False
 
class VoiceConversationManager:
    def __init__(self, vad, stt_client, llm_client, tts_client):
        self.vad = vad
        self.stt = stt_client
        self.llm = llm_client
        self.tts = tts_client
        self.history: list[ConversationTurn] = []
        self._playback_task: asyncio.Task | None = None
        self._generation_task: asyncio.Task | None = None
        self._partial_response = ""
 
    async def handle_audio_chunk(self, chunk: bytes):
        event = self.vad.process_chunk(chunk)
 
        if event["event"] == "speech_start":
            # User started speaking — interrupt assistant if active
            await self._interrupt_assistant()
 
        elif event["event"] == "speech_end":
            # User finished utterance — transcribe and respond
            transcription = await self.stt.transcribe_buffered()
            if transcription.strip():
                self.history.append(ConversationTurn(role="user", content=transcription))
                await self._generate_and_play_response()
 
    async def _interrupt_assistant(self):
        if self._playback_task and not self._playback_task.done():
            self._playback_task.cancel()
            await self.tts.stop_playback()
 
        if self._generation_task and not self._generation_task.done():
            self._generation_task.cancel()
 
        # Save partial response to history with interrupted marker
        if self._partial_response:
            self.history.append(ConversationTurn(
                role="assistant",
                content=self._partial_response,
                interrupted=True
            ))
            self._partial_response = ""
 
    async def _generate_and_play_response(self):
        self._partial_response = ""
 
        async def generate():
            messages = [
                {"role": t.role, "content": t.content + (" [INTERRUPTED]" if t.interrupted else "")}
                for t in self.history
            ]
            async for chunk in self.llm.stream(messages):
                self._partial_response += chunk
                await self.tts.stream_text(chunk)
 
        self._generation_task = asyncio.create_task(generate())
        self._playback_task = asyncio.create_task(self.tts.play_stream())

Turn Detection Beyond Silence

Silence-based turn detection fails in two common cases: when users pause mid-sentence to think, and when users are waiting for a response and are silent expectantly. Both look like end-of-turn on audio alone.

Augment with semantic turn detection: after a configurable minimum silence (300ms), send the transcription-so-far to a fast, small model to predict whether the utterance is complete.

async def is_turn_complete(partial_transcript: str, llm_client) -> bool:
    """Quick heuristic check — use a small, fast model."""
    response = await llm_client.complete(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Is this a complete thought or sentence that expects a response?
Answer only YES or NO.
 
Transcript: "{partial_transcript}"
"""
        }],
        max_tokens=5,
        temperature=0
    )
    return response.strip().upper() == "YES"

Call this only when VAD detects 300–400ms of silence after speech. If the answer is NO, wait another 500ms before checking again.

Text Fallback Path

Audio pipelines fail. Microphone permissions get denied. Background noise makes VAD useless. Network jitter breaks WebSocket streams. You need a text fallback that users can trigger without the interface feeling broken.

Design the fallback as a first-class mode, not an error state:

flowchart LR A[User] --> B{Input Mode} B -- Audio available --> C[Voice Pipeline] B -- Text fallback --> D[Text Input] C --> E[VAD + STT] D --> F[Direct Text] E --> G[LLM] F --> G G --> H{Output Mode} H -- Audio available --> I[TTS + Playback] H -- Text fallback --> J[Text Display] I --> A J --> A

The fallback trigger conditions to handle automatically:

Microphone permission denied → show text input immediately, no error message
Three consecutive STT failures → offer "switch to typing" button
Noise level too high (measured during first 2 seconds of session) → suggest text mode
Mobile device with poor connectivity (measured RTT > 2s) → default to text mode

Observability

Voice interfaces have failure modes that are invisible to standard metrics. Log these explicitly:

from dataclasses import dataclass, field
from datetime import datetime
 
@dataclass
class VoiceTurnMetrics:
    session_id: str
    turn_id: str
    timestamp: datetime = field(default_factory=datetime.utcnow)
 
    # Latency breakdown
    vad_end_of_turn_ms: float = 0
    stt_latency_ms: float = 0
    llm_first_token_ms: float = 0
    tts_first_audio_ms: float = 0
    total_perceived_latency_ms: float = 0
 
    # Quality signals
    stt_confidence: float = 0
    was_interrupted: bool = False
    barge_in_at_pct: float = 0    # How far through response when interrupted
    fallback_to_text: bool = False
    fallback_reason: str = ""
 
    # Cost
    stt_seconds: float = 0
    llm_input_tokens: int = 0
    llm_output_tokens: int = 0
    tts_characters: int = 0

Track P50, P90, and P99 of total_perceived_latency_ms. Anything above 1200ms at P90 means you have a pipeline problem. Track was_interrupted rate — high barge-in rates often signal that the LLM is giving overly verbose responses.

Key Takeaways

Natural voice conversation requires streaming at every stage (STT, LLM, TTS) — non-streaming pipelines produce chat-like latency that feels broken in voice UX.
Silero VAD or WebRTC VAD consistently outperforms energy-based VAD in real-world noise conditions; model-based VAD is not optional for production.
Continue running VAD during assistant playback — that is the prerequisite for barge-in support, and without barge-in, users feel trapped.
Save interrupted partial responses to conversation history with a marker so the LLM has accurate context about what was communicated.
Semantic turn detection (predicting utterance completeness via a fast small model) reduces false end-of-turn triggers from thinking pauses.
Text fallback is a first-class mode, not a fallback error screen — design, instrument, and test it with the same care as the primary voice path.