Skill

ava

From ava

Voice interaction for AI coding assistants. Provides natural voice conversations using ElevenLabs TTS and STT. Use when users mention ava, speak, talk, converse, voice status, or voice troubleshooting. ElevenLabs-only: eleven_v3 TTS model, Scribe v2 Realtime STT with local Silero VAD.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ava:ava

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Natural voice conversations with AI coding assistants using ElevenLabs text-to-speech (TTS) and speech-to-text (STT).

SKILL.md

136 lines · ~1.7k tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitJun 3, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Ava

Natural voice conversations with AI coding assistants using ElevenLabs text-to-speech (TTS) and speech-to-text (STT).

The Jarvis Goal

Ava aims to create a Jarvis-like voice assistant experience. The AI speaks to you and listens, like a real conversation. In voice-primary mode, substantive responses must go through converse; if voice fails, stop and restore MCP instead of continuing chat-only.

Setup

1. Configure MCP Server

Ava runs as an HTTP server on port 8765. Add this MCP configuration to Cursor, Claude Code, Factory, or another MCP-capable host:

{
  "mcpServers": {
    "ava": {
      "type": "http",
      "url": "http://127.0.0.1:8765/mcp"
    }
  }
}

2. Configure ElevenLabs

Set your ElevenLabs API key:

# In ~/.ava/ava.env
ELEVENLABS_API_KEY=your-key-here

ElevenLabs provides:

TTS: eleven_v3 model with Donna voice (cloned)
STT: Scribe v2 Realtime (streaming WebSocket with manual commit mode)

Usage

Use the converse MCP tool. Trust server defaults unless changing behavior for the current turn.

Do not bypass the native MCP tool with curl, raw HTTP requests, or direct /mcp JSON-RPC calls when the host already exposes Ava as an MCP tool. If Cursor, Claude Code, Factory, or another MCP-capable host has surfaced converse, use that native tool path rather than manually posting to http://127.0.0.1:8765/mcp.

# Speak and listen — trust server defaults
converse(message="Hello! What would you like to work on?", wait_for_conch=true)

# Speak without waiting (narration while working)
converse(message="Searching the codebase now...", wait_for_response=false, wait_for_conch=true)

# User wants to say something long without silence cutoff
converse(message="Go ahead, I'm listening.", disable_silence_detection=true, wait_for_conch=true)

Parameter rules (read before every call)

The server defaults are tuned and locked. Do not pass tuning parameters — passing them yourself is the #1 cause of inconsistent behavior across agents. The vast majority of calls are exactly:

converse(message="...")

Only three parameters are ever situational: wait_for_response, disable_silence_detection, and (Claude Desktop only) skip_tts. Everything else is a server default you must NOT send.

Parameter	Default	Pass it?	When to pass
`message`	required	always	The text to speak (use `""` only with `skip_tts=true`)
`wait_for_response`	`true`	only to disable	Pass `false` for a fire-and-forget announcement/narration where you will NOT listen
`disable_silence_detection`	`false`	only to enable	Pass `true` when the user explicitly needs to talk at length with long pauses (dictation, reading aloud)
`wait_for_conch`	`true`	never (already default)	Leave as default; it auto-queues behind another speaker
`speed`	`1.2`	never unless user asks	Only if the user says "talk faster/slower" mid-session. Range 0.7–1.2; eleven_v3 rejects >1.2
`vad_aggressiveness`	`2`	never unless troubleshooting	Server-tuned. Only change if the user reports background bleed (raise) or being cut off (lower)
`listen_duration_min`	`1`	never	Server-tuned
`listen_duration_max`	`600`	never	Server-tuned (10 min)
`timeout`	`900`	never	Server-tuned (15 min)
`metrics_level`	`summary`	never	Server-tuned

Hard rule: if you find yourself passing speed, vad_aggressiveness, listen_duration_*, timeout, or metrics_level, stop — you almost certainly should not be. The user tunes those via ~/.ava/ava.env, not per-call.

Best Practices

Voice-primary communication -- substantive responses go through converse; if voice fails, stop and restore MCP instead of continuing chat-only
Trust server defaults -- speed defaults to 1.2 (the eleven_v3 max); pass optional parameters only when changing behavior
Narrate without waiting rarely -- Use wait_for_response=false only for short acknowledgements before work
One question at a time -- Don't bundle multiple questions
Parallel calls -- Combine a rare short converse(..., wait_for_response=false) acknowledgement with other tools in one turn for zero dead air
Long input -- Set disable_silence_detection=true when the user needs to speak at length; server defaults already allow a 10-minute listen window
Long spoken output -- A long single-line paragraph is valid in one converse call; normalize literal newlines out of message, but do not shorten or split substantive content merely to work around reliability issues

Configuration

Config file: ~/.ava/ava.env

ElevenLabs Settings

Variable	Default	Description
`ELEVENLABS_API_KEY`	(none)	API key -- required
`AVA_ELEVENLABS_TTS_MODEL`	`eleven_v3`	TTS model
`AVA_ELEVENLABS_TTS_VOICE`	`k4hP4cQadSZQc0Oar2Ld`	Voice ID (Donna)
`AVA_ELEVENLABS_STT_MODEL`	`scribe_v2_realtime`	STT model
`AVA_ELEVENLABS_REALTIME_STT`	`true`	Use realtime streaming STT
`AVA_SILENCE_THRESHOLD_MS`	`2000`	Silence threshold in ms (2.0s default)
`AVA_VAD_AGGRESSIVENESS`	`2`	VAD strictness (0-3); higher rejects more background audio

Architecture

Server: Single HTTP MCP server on http://127.0.0.1:8765/mcp
Auto-start: Managed by launchd (macOS) via scripts/ava-server.sh
TTS: ElevenLabs eleven_v3 with convert() + play() via ffplay
STT: ElevenLabs Scribe v2 Realtime (WebSocket streaming) with manual commit mode
VAD: Local Silero VAD (ONNX, no PyTorch) for silence detection -- sends manual commit when silence exceeds 2.0s threshold
Audio caching: Recordings cached in memory for crash resilience -- if ElevenLabs disconnects mid-stream, cached audio is batch-transcribed
Audio I/O: Direct mic/speaker access on the host machine

Server Management

# Via script (manages launchd plist)
scripts/ava-server.sh setup    # Create launchd plist + start
scripts/ava-server.sh start    # Start server
scripts/ava-server.sh stop     # Stop server
scripts/ava-server.sh restart  # Restart server
scripts/ava-server.sh status   # Check status
scripts/ava-server.sh logs     # Tail server logs

ava

Invocation

Context Preview

SKILL.md

ava

Invocation

Context Preview

SKILL.md

Ava

The Jarvis Goal

Setup

1. Configure MCP Server

2. Configure ElevenLabs

Usage

Parameter rules (read before every call)

Best Practices

Configuration

ElevenLabs Settings

Architecture

Server Management

Similar Skills

Ava

The Jarvis Goal

Setup

1. Configure MCP Server

2. Configure ElevenLabs

Usage

Parameter rules (read before every call)

Best Practices

Configuration

ElevenLabs Settings

Architecture

Server Management

Similar Skills