Jarvis Voice Plugin

A voice-driven interaction plugin for Claude Code. Local speech-to-text, text-to-speech, and speaker verification — fully offline, fully hands-free.

What It Does

Jarvis turns Claude Code into a voice assistant. You speak, it listens. It speaks back. No cloud APIs, no latency penalties — everything runs on-device.

Pipeline

Mic (SoX rec) → 16kHz Downsample → VAD (Silero) → Denoiser (GTCRN)
    → Speaker Verification (WeSpeaker) → STT (Whisper tiny.en)
    → Transcription Queue → Claude Code

Claude Code → TTS (Kokoro 24kHz) → Temp File → SoX play → Speakers

Features

Kokoro TTS — High-quality neural text-to-speech at 24kHz with 11 voice options
Whisper STT — Local speech-to-text using Whisper tiny.en
Speaker Verification — WeSpeaker embeddings ensure only the enrolled user's voice is processed
Speech Denoiser — GTCRN model provides ~34dB noise reduction before processing
Voice Activity Detection — Silero VAD detects speech segments in real-time
Speech Accumulation — Consecutive speech segments are merged with 3-second silence gap detection, delivering complete sentences instead of fragments
Pause/Resume — Say "Jarvis pause" to mute listening, "Jarvis resume" to restart
Wake Word (optional) — Queue filtering mode that only passes messages starting with "Jarvis"
Listening Indicator — Short tone plays when the assistant is ready for input
Concurrent Capture — Mic stays active during TTS playback so user speech queues up
Acoustic Self-Test — Loopback test that speaks a phrase and verifies transcription through the full pipeline

Voice Enrollment

On first run, Jarvis guides you through 5 phrases to build a speaker profile. This profile is used for speaker verification — filtering out background voices and ambient noise.

Architecture

src/
  main.ts                  — Entry point, MCP server + orchestrator
  mcp/
    entry.ts               — JSON-RPC stdio server with lifecycle handlers
    tools.ts               — MCP tool definitions and handlers
  pipeline/
    orchestrator.ts        — Coordinates all pipeline components
    vad.ts                 — Voice Activity Detection (Silero)
    stt.ts                 — Speech-to-Text (Whisper)
    tts.ts                 — Text-to-Speech (Kokoro / Piper VITS fallback)
    queue.ts               — Transcription queue with wake word filtering,
                             pause/resume, and speech accumulation
    embedding-extractor.ts — Speaker embedding extraction
    speaker-verify.ts      — Cosine similarity speaker verification
  audio/
    capture.ts             — Mic capture via SoX rec with rate downsampling
    playback.ts            — Audio playback via SoX play with temp files
  profile/
    enrollment.ts          — Voice enrollment session management
    storage.ts             — Profile persistence
    passive-refine.ts      — Background profile refinement
  models/
    registry.ts            — Model registry with optional voice downloads
    downloader.ts          — Model download and extraction
  config.ts                — Runtime configuration
  logging/
    logger.ts              — Ring buffer logger with scoped contexts

Models

Model	Purpose	Size
Silero VAD	Voice activity detection	2 MB
Whisper tiny.en	Speech-to-text	150 MB
WeSpeaker ResNet34	Speaker verification	20 MB
Kokoro v0.19	Text-to-speech (24kHz)	330 MB
Kokoro voices	Voice embeddings	5.5 MB
GTCRN	Speech denoising	0.5 MB

MCP Tools

Tool	Description
`GetVoiceStatus`	Pipeline state, VAD activity, listening mode, pause state
`ListenForResponse`	Block until speech detected (with accumulation)
`SpeakText`	TTS synthesis + playback, supports `expect_response` for Q&A flow
`StartEnrollment`	Begin/advance voice enrollment session
`TestEnrollment`	Verify enrollment quality
`SaveProfile`	Persist voice profile
`ResetProfile`	Delete voice profile
`SetMode`	Change capture mode
`SetThreshold`	Adjust VAD sensitivity or speaker confidence at runtime
`DownloadModels`	Download missing ML models
`GetDebugLog`	Ring buffer log entries
`GetSessionStats`	Utterance counts, verification rates, latency stats

Installation

Prerequisites

macOS (CoreAudio required for mic capture)
Node.js 18+
SoX — audio capture and playback
```
brew install sox
```

Install as Claude Code Plugin

# Add the Jarvis marketplace
/plugin marketplace add civitas-cerebrum/jarvis-plugin

# Install the plugin
/plugin install jarvis-voice@jarvis-marketplace

Dependencies install automatically on first session start. Voice models (~300MB) download on first use.

Activate

Start a Claude Code session and say:

/jarvis-voice:jarvis-voice

jarvis-voice

Popularity

Health & Quality

What's Inside

README