From antigravity-awesome-skills
Guides building low-latency voice agents using speech-to-speech (OpenAI Realtime API) and STT→LLM→TTS pipelines with tools like Deepgram, ElevenLabs, Pipecat for natural conversations.
npx claudepluginhub sickn33/antigravity-awesome-skillsThis skill uses the workspace's default tool permissions.
Voice agents represent the frontier of AI interaction - humans speaking
Guides building low-latency voice agents via speech-to-speech (OpenAI Realtime API) and STT-LLM-TTS pipelines using Deepgram, ElevenLabs, Pipecat.
Architects production voice agents using speech-to-speech (OpenAI Realtime API) or STT→LLM→TTS pipelines, targeting <800ms latency with VAD and interruption handling.
Guides building voice AI agents with LiveKit Cloud and Agents SDK, including project setup, LiveKit Inference integration, workflows, handoffs, and mandatory testing.
Share bugs, ideas, or general feedback.
Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flow with sub-800ms latency while handling interruptions, background noise, and emotional nuance.
This skill covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency, most natural) and pipeline (STT→LLM→TTS, more control, easier to debug). Key insight: latency is the constraint. Humans expect responses in 500ms. Every millisecond matters.
84% of organizations are increasing voice AI budgets in 2025. This is the year voice agents go mainstream.
Direct audio-to-audio processing for lowest latency
When to use: Maximum naturalness, emotional preservation, real-time conversation
""" [User Audio] → [S2S Model] → [Agent Audio]
Advantages:
Disadvantages:
""" import { RealtimeClient } from '@openai/realtime-api-beta';
const client = new RealtimeClient({ apiKey: process.env.OPENAI_API_KEY, });
// Configure for voice conversation
client.updateSession({
modalities: ['text', 'audio'],
voice: 'alloy',
input_audio_format: 'pcm16',
output_audio_format: 'pcm16',
instructions: You are a helpful customer service agent. Be concise and friendly. If you don't know something, say so rather than making things up.,
turn_detection: {
type: 'server_vad', // or 'semantic_vad'
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500,
},
});
// Handle audio streams client.on('conversation.item.input_audio_transcription', (event) => { console.log('User said:', event.transcript); });
client.on('response.audio.delta', (event) => { // Stream audio to speaker audioPlayer.write(Buffer.from(event.delta, 'base64')); });
// Send user audio client.appendInputAudio(audioBuffer); """
Separate STT → LLM → TTS for maximum control
When to use: Need to know/control exactly what's said, debugging, compliance
""" [Audio] → [STT] → [Text] → [LLM] → [Text] → [TTS] → [Audio]
Advantages:
Disadvantages:
""" import { Deepgram } from '@deepgram/sdk'; import { ElevenLabsClient } from 'elevenlabs'; import OpenAI from 'openai';
// Initialize clients const deepgram = new Deepgram(process.env.DEEPGRAM_API_KEY); const elevenlabs = new ElevenLabsClient(); const openai = new OpenAI();
async function processVoiceInput(audioStream) { // 1. Speech-to-Text (Deepgram Nova-3) const transcription = await deepgram.transcription.live({ model: 'nova-3', punctuate: true, endpointing: 300, // ms of silence before end });
transcription.on('transcript', async (data) => { if (data.is_final && data.speech_final) { const userText = data.channel.alternatives[0].transcript; console.log('User:', userText);
// 2. LLM Processing
const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'You are a concise voice assistant.' },
{ role: 'user', content: userText }
],
max_tokens: 150, // Keep responses short for voice
});
const agentText = completion.choices[0].message.content;
console.log('Agent:', agentText);
// 3. Text-to-Speech (ElevenLabs)
const audioStream = await elevenlabs.textToSpeech.stream({
voice_id: 'voice_id_here',
text: agentText,
model_id: 'eleven_flash_v2_5', // Lowest latency
});
// Stream to user
playAudioStream(audioStream);
}
});
// Pipe audio to transcription audioStream.pipe(transcription); } """
Detect when user starts/stops speaking
When to use: All voice agents need VAD for turn-taking
""" VAD Types:
""" import { SileroVAD } from '@pipecat-ai/silero-vad';
const vad = new SileroVAD({ threshold: 0.5, // Speech probability threshold min_speech_duration: 250, // ms before speech confirmed min_silence_duration: 500, // ms of silence = end of turn });
vad.on('speech_start', () => { console.log('User started speaking'); // Stop any playing TTS (barge-in) audioPlayer.stop(); });
vad.on('speech_end', () => { console.log('User finished speaking'); // Trigger response generation processTranscript(); });
// Feed audio to VAD audioStream.on('data', (chunk) => { vad.process(chunk); }); """
""" // In Realtime API session config client.updateSession({ turn_detection: { type: 'semantic_vad', // Uses meaning, not just silence // Model waits longer after "ummm..." // Responds faster after "Yes, that's correct." }, }); """
""" // When user interrupts: function handleBargeIn() { // 1. Stop TTS immediately audioPlayer.stop();
// 2. Cancel pending LLM generation llmController.abort();
// 3. Reset state conversationState.checkpoint();
// 4. Listen to new input startListening(); }
// VAD triggers barge-in vad.on('speech_start', () => { if (audioPlayer.isPlaying) { handleBargeIn(); } }); """
Achieving <800ms end-to-end response time
When to use: Production voice agents
""" Target Metrics:
""" Typical breakdown:
Total: 425-900ms """
""" // Stream STT results as they come stt.on('partial_transcript', (text) => { // Start processing before final transcript llmPreprocessor.prepare(text); });
// Stream LLM output to TTS const llmStream = await openai.chat.completions.create({ stream: true, // ... });
for await (const chunk of llmStream) { tts.appendText(chunk.choices[0].delta.content); } """
""" // While user is speaking, predict and prepare stt.on('partial_transcript', async (text) => { // Pre-fetch relevant context const context = await retrieveContext(text);
// Pre-compute likely first sentence const firstSentence = await generateOpener(context); }); """
""" // STT: Deepgram Nova-3 (150ms TTFT) // LLM: gpt-4o-mini (fastest GPT-4 class) // TTS: ElevenLabs Flash (75ms) or Deepgram Aura-2 (184ms) """
""" // Run inference closer to user // - Cloud regions near user // - Edge computing for VAD/STT // - WebSocket over HTTP for lower overhead """
Designing natural voice conversations
When to use: Building voice UX
""" Voice is different from text:
"""
Bad: "I found several options. The first is... second is..." Good: "I found 3 options. Want me to go through them?"
Bad: "I'll transfer $500 to John." Good: "So that's $500 to John Smith. Should I proceed?" """
""" system_prompt = ''' You are a voice assistant. Follow these rules:
Good: "Got it. I'll set that reminder for three pm. Anything else?" Bad: "I have set a reminder for 3:00 PM. Is there anything else I can assist you with today?" ''' """
""" // Handle recognition errors gracefully const errorResponses = { no_speech: "I didn't catch that. Could you say it again?", unclear: "Sorry, I'm not sure I understood. You said [repeat]. Is that right?", timeout: "Still there? I'm here when you're ready.", };
// Always offer human fallback for complex issues if (confidenceScore < 0.6) { response = "I want to make sure I get this right. Would you like to speak with a human agent?"; } """
Severity: CRITICAL
Situation: Building a voice agent pipeline
Symptoms: Conversations feel awkward. Users repeat themselves. "Are you there?" questions. Users hang up or give up. Low satisfaction scores despite correct answers.
Why this breaks: In human conversation, responses typically arrive within 500ms. Anything over 800ms feels like the agent is slow or confused. Users lose confidence and patience. Every component adds latency: VAD (100ms) + STT (200ms) + LLM (300ms) + TTS (200ms) = 800ms.
Recommended fix:
Use low-latency models:
Stream everything:
Pre-compute:
Edge deployment:
Log timestamps at each stage, track P50/P95 latency
Severity: HIGH
Situation: Voice agent with inconsistent response times
Symptoms: Conversations feel unpredictable. User doesn't know when to speak. Sometimes agent responds immediately, sometimes after long pause. Users talk over agent. Agent talks over users.
Why this breaks: Jitter (variance in response time) disrupts conversational rhythm more than absolute latency. Consistent 800ms feels better than alternating 400ms and 1200ms. Users can't adapt to unpredictable timing.
Recommended fix:
Consistent model loading:
Buffer audio output:
Handle LLM variance:
Monitor and alert:
const MIN_RESPONSE_TIME = 400; // ms
async function respondWithConsistentTiming(text) { const startTime = Date.now(); const audio = await generateSpeech(text);
const elapsed = Date.now() - startTime; if (elapsed < MIN_RESPONSE_TIME) { await delay(MIN_RESPONSE_TIME - elapsed); }
playAudio(audio); }
Severity: HIGH
Situation: Detecting when user finishes speaking
Symptoms: Agent interrupts user mid-thought. Or waits too long after user finishes. "Let me think..." triggers premature response. Short answers have awkward pause before response.
Why this breaks: Simple silence detection (e.g., "end turn after 500ms silence") doesn't understand conversation. Humans pause mid-sentence. "Yes." needs fast response, "Well, let me think about that..." needs patience. Fixed timeout fits neither.
Recommended fix:
client.updateSession({ turn_detection: { type: 'semantic_vad', // Waits longer after "umm..." // Responds faster after "Yes, that's correct." }, });
const pipeline = new Pipeline({ vad: new SileroVAD(), turnDetection: new SmartTurn(), });
// SmartTurn considers: // - Speech content (complete sentence?) // - Prosody (falling intonation?) // - Context (question asked?)
function calculateSilenceThreshold(transcript) { const endsWithComplete = transcript.match(/[.!?]$/); const hasFillers = transcript.match(/um|uh|like|well/i);
if (endsWithComplete && !hasFillers) { return 300; // Fast response } else if (hasFillers) { return 1500; // Wait for continuation } return 700; // Default }
Severity: HIGH
Situation: User tries to interrupt agent mid-sentence
Symptoms: Agent talks over user. User has to wait for agent to finish. Frustrating experience. Users give up and abandon call. "STOP! STOP!" doesn't work.
Why this breaks: Without barge-in handling, the TTS plays to completion regardless of user input. This violates basic conversational norms - in human conversation, we stop when interrupted.
Recommended fix:
vad.on('speech_start', () => { if (ttsPlayer.isPlaying) { // 1. Stop audio immediately ttsPlayer.stop();
// 2. Cancel pending TTS generation
ttsController.abort();
// 3. Checkpoint conversation state
conversationState.save();
// 4. Listen to new input
startTranscription();
} });
vad.on('speech_start', async () => { if (!ttsPlayer.isPlaying) return;
// Wait 200ms to get first words await delay(200); const firstWords = getTranscriptSoFar();
if (isBackchannel(firstWords)) { // "uh-huh", "yeah" - don't interrupt return; }
if (isClarification(firstWords)) { // "What?", "Sorry?" - repeat last sentence repeatLastSentence(); } else { // Real interruption - stop and listen handleFullInterruption(); } });
Severity: MEDIUM
Situation: Prompting LLM for voice agent responses
Symptoms: Agent rambles. Users lose track of information. "Can you repeat that?" requests. Users interrupt to ask for shorter version. Low comprehension of conveyed information.
Why this breaks: Text can be scanned and re-read. Voice is linear and ephemeral. A 3-paragraph response that works in chat is overwhelming in voice. Users can only hold ~7 items in working memory.
Recommended fix:
system_prompt = ''' You are a voice assistant. Keep responses UNDER 30 WORDS. For complex information, break into chunks and confirm understanding between each.
Instead of: "Here are the three options. First, you could... Second... Third..."
Say: "I found 3 options. Want me to go through them?"
Never list more than 3 items without pausing for confirmation. '''
const response = await openai.chat.completions.create({ max_tokens: 100, // Hard limit // ... });
if (information.length > 3) {
response = I have ${information.length} items. Let's go through them one at a time. First: ${information[0]}. Ready for the next?;
}
"I found your account. Want the balance, recent transactions, or something else?" // Don't dump all info at once
Severity: MEDIUM
Situation: Formatting LLM output for voice
Symptoms: "First bullet point: item one" read aloud. Numbers read as "one two three" instead of "one, two, three." Markdown artifacts in speech. Robotic, unnatural delivery.
Why this breaks: TTS models read what they're given. Text formatting intended for visual display sounds robotic when read aloud. Users can't "see" structure in audio.
Recommended fix:
system_prompt = ''' Format responses for SPOKEN delivery:
function prepareForSpeech(text) { return text // Remove markdown .replace(/[*_#`]/g, '') // Convert numbers .replace(/\d+/g, numToWords) // Expand abbreviations .replace(/\betc\b/gi, 'et cetera') .replace(/\be.g./gi, 'for example') // Add pauses .replace(/. /g, '... ') .replace(/, /g, '... '); }
Severity: MEDIUM
Situation: Users in cars, cafes, outdoors
Symptoms: "I didn't catch that" frequently. Background noise triggers false starts. Fan/AC causes continuous listening. Car engine noise confuses STT.
Why this breaks: Default VAD thresholds work for quiet environments. Real-world usage includes background noise that triggers false positives or masks speech, causing false negatives.
Recommended fix:
const transcription = await deepgram.transcription.live({ model: 'nova-3', noise_reduction: true, // or smart_format: true, });
// Measure ambient noise level const ambientLevel = measureAmbientNoise(5000); // 5 sec sample
vad.setThreshold(ambientLevel * 1.5); // Above ambient
stt.on('transcript', (data) => { if (data.confidence < 0.7) { // Low confidence - probably noise askForRepeat(); return; } processTranscript(data.transcript); });
// Prevent agent's voice from being transcribed const echoCanceller = new EchoCanceller(); echoCanceller.reference(ttsOutput); const cleanedAudio = echoCanceller.process(userAudio);
Severity: MEDIUM
Situation: Processing unclear or accented speech
Symptoms: Agent responds to something user didn't say. Names consistently wrong. Technical terms misheard. "I said X, not Y" frustration.
Why this breaks: STT models can hallucinate, especially on proper nouns, technical terms, or accented speech. These errors propagate through the pipeline and produce nonsensical responses.
Recommended fix:
const transcription = await deepgram.transcription.live({ keywords: ['Acme Corp', 'ProductName', 'John Smith'], keyword_boost: 'high', });
if (containsNameOrNumber(transcript)) {
response = I heard "${name}". Is that correct?;
}
if (confidence < 0.8) {
response = I think you said "${transcript}". Did I get that right?;
}
// Some STT APIs return n-best list const alternatives = transcription.alternatives; if (alternatives[0].confidence - alternatives[1].confidence < 0.1) { // Ambiguous - ask for clarification }
promptPattern = User may correct previous mistakes. If they say "no, I said X" or "not Y, Z", update your understanding accordingly.;
Severity: ERROR
Voice agents must track latency at each stage
Message: Voice pipeline without latency tracking. Add timestamps at each stage to measure performance.
Severity: WARNING
Streaming STT reduces latency significantly
Message: Using batch transcription. Consider streaming for lower latency in voice agents.
Severity: WARNING
Streaming TTS reduces time to first audio
Message: TTS without streaming. Stream audio to reduce time to first audio.
Severity: WARNING
Fixed silence thresholds don't adapt to conversation
Message: Fixed silence threshold. Consider semantic VAD or adaptive thresholds for better turn-taking.
Severity: WARNING
Voice agents should stop when user interrupts
Message: VAD without barge-in handling. Stop TTS when user starts speaking.
Severity: WARNING
Voice prompts should constrain response length
Message: Voice prompt without length constraints. Add 'Keep responses under 30 words' to system prompt.
Severity: WARNING
Markdown will be read literally by TTS
Message: Check for markdown in TTS input. Strip formatting before sending to TTS.
Severity: WARNING
STT can fail or return low confidence
Message: STT without error handling. Check confidence scores and handle failures.
Severity: WARNING
Realtime APIs need reconnection handling
Message: Realtime connection without reconnection logic. Handle disconnects gracefully.
Severity: INFO
Real-world audio includes background noise
Message: Consider adding noise handling for real-world audio quality.
Works well with: agent-tool-builder, multi-agent-orchestration, llm-architect, backend