Skill

smart-transcribe

This skill should be used when the user asks to "transcribe this audio", "transcribe this recording", "convert speech to text", "transcribe voice memo", "transcribe this file", "dictation", "speech recognition", "speech-to-text", "STT", or needs to transcribe audio files, voice memos, interviews, or recordings. Provides a resilient default trio transcription pipeline (AssemblyAI Universal-3 + ElevenLabs Scribe v2 + Cohere local) with Claude-powered merge, manual-merge fallback, resumable runs, and a learning correction dictionary.

Install

npx claudepluginhub oliverames/ames-claude --plugin ames-standalone-skills

Tool Access

This skill is limited to using the following tools:

BashReadWriteEditAskUserQuestion

Preview

Dual-engine audio transcription with Claude-powered interactive merge and dictionary learning. Optionally scales to an eight-model ensemble pipeline with Opus 4.6 consensus merge and structural review for maximum accuracy.

Supporting Assets

data/transcription-dictionary.jsonevals/evals.jsonscripts/batch-transcribe-folder.pyscripts/common.pyscripts/ensemble.pyscripts/requirements.txtscripts/setup.shscripts/smart-transcribe.pytasks/plan.mdtasks/todo.mdtests/test_smart_transcribe.py

SKILL.md

Similar Skills

kotlin-ktor-patterns

Provides Ktor server patterns for routing DSL, plugins (auth, CORS, serialization), Koin DI, WebSockets, services, and testApplication testing.

everything-claude-code

163.2k

deep-research

Conducts multi-source web research with firecrawl and exa MCPs: searches, scrapes pages, synthesizes cited reports. For deep dives, competitive analysis, tech evaluations, or due diligence.

everything-claude-code

163.2k

inventory-demand-planning

Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.

everything-claude-code

163.2k

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 12, 2026

Actions

View Source View Plugin View on GitHub View README

Smart Transcribe

Usage

Invoke by saying "transcribe this audio", "transcribe [file]", "fix this transcript", or "batch transcribe [folder]". For setup, run bash ${CLAUDE_PLUGIN_ROOT}/skills/smart-transcribe/scripts/setup.sh.

Key flags

Flag	Description
`--fix-transcript FILE`	Correct an existing transcript (`.srt`, `.vtt`, `.txt`, `.md`) — skips audio engines entirely
`--context NAME`	Load a named per-project context overlay (e.g. `--context bcbs-vt`). First use triggers a short interview. Omit NAME to list saved contexts.
`--review`	Interactively review applied corrections before saving — dispute false positives to log them for dictionary cleanup
`--engines E1,E2,...`	Choose transcription engines (default: `assemblyai-u3-pro,scribe-v2,cohere-transcribe`). Run `--list-engines` for all IDs.
`--list-engines`	Print all available engine IDs and aliases, then exit
`--speakers "A,B"`	Comma-separated speaker names to help identification
`--no-diarization`	Disable speaker diarization (faster, single-speaker recordings)
`--doctor`	Verify Python runtime, API key resolution, ffmpeg/ffprobe, SDK imports, HF token presence, and Claude merge availability
`--check-engine scribe-v2`	Run an engine startup self-test without transcribing
`--merge-mode manual`	Skip Claude, save normalized per-engine outputs, and generate a comparison bundle with a recommended base transcript
`--resume`	Reuse completed per-engine outputs from the run directory
`--rerun-engine ENGINE_ID`	Re-run just one engine while resuming
`--use-system-python` / `--engine-python scribe-v2=/path/to/python`	Escape hatches for runtime selection

Modes

Fix-Transcript Mode (`--fix-transcript FILE`)

Accepts an existing transcript file (.srt, .vtt, .txt, .md) and runs it through the dictionary + LLM review pipeline — no audio re-transcription. Useful for correcting outputs from Whisper, YouTube, Riverside, Descript, or any other STT tool.

Workflow:

Parse input file (strips SRT/VTT timestamps automatically)
Apply dictionary context via --context if provided
Single-pass LLM review: correct errors, flag uncertainties, preserve speaker voice
Save corrected .md + copy of original alongside it (never modifies source)
Transparency report appended to output

Standard Mode (default)

Three-engine pipeline. Engines run in parallel (cloud) or sequentially (local), then Claude Code headless merges all transcripts with per-engine weighting: AssemblyAI for speaker structure, ElevenLabs for word accuracy, Cohere as tiebreaker. If headless Claude is unavailable or rate-limited, the script falls back to headless Codex when it is installed.

Default engines: AssemblyAI Universal-3 Pro (SLAM-1, diarization + chapters + entities) + ElevenLabs Scribe v2 (best cloud accuracy) + Cohere Transcribe (local, #1 HF-WER, free). If one engine fails, the run continues, marks the failure clearly, and emits default trio degraded. It can optionally retry with Voxtral Small as a fallback.

Ensemble Mode

An eight-model pipeline for maximum accuracy. Invoked via the transcribe command when the user requests "ensemble", "maximum accuracy", "full pipeline", or selects a multi-model preset.

Phases:

Transcription - All selected models run in parallel (cloud) or sequentially (local)
Merge - Opus 4.6 performs consensus merge using speaker scaffolding
Review - Opus 4.6 performs structural/flow correction pass
Format - Dictionary corrections applied, output saved

Available Models:

Benchmarks use two different methodologies — results are not directly comparable across systems:

AA-WER: Artificial Analysis — commercial APIs on conversational/business speech (lower is better)
HF-WER: HuggingFace Open ASR Leaderboard — academic datasets (CommonVoice, FLEURS, etc.), includes open-source models

#	Model	Type	Cost/1K min	Notes
1	ElevenLabs Scribe v2	Cloud	$6.67	2.3% AA-WER (#1), 5.83% HF-WER (#6); best overall for cloud accuracy
2	Mistral Voxtral Small	Cloud	$4.00	2.9% AA-WER; context biasing via prompt
3	Google Gemini 3 Pro	Cloud	$18.40	2.9% AA-WER; multimodal, most expensive cloud option
4	AssemblyAI Universal-3 Pro	Cloud	$3.50	3.2% AA-WER; best speaker diarization (used for speaker scaffolding)
5	OpenAI GPT-4o Transcribe	Cloud	~$6.00	~2.46% WER (OpenAI self-reported); RL-trained ASR
6	OpenAI GPT-4o Mini Transcribe	Cloud	~$3.00	Decorrelated errors from GPT-4o full; budget option
7	Voxtral Mini Realtime	Local	Free	4B params, mlx-audio on Apple Silicon; 7.68% HF-WER
8	Cohere Transcribe	Local	Free	Apache 2.0; 5.42% HF-WER (#1 on leaderboard); 524x RTFx, 3x faster than Whisper; 14 languages

Presets:

Quick (1 model): Voxtral Small only
Standard (3 models, default): AssemblyAI U3 Pro + Scribe v2 + Cohere (local) — best accuracy/cost balance
Full (all 8): Maximum accuracy, all models

Model Profiles

Detailed strengths and weaknesses for each engine. Use these to choose the right engines for your recording type.

1. ElevenLabs Scribe v2 (scribe-v2)

Strengths: Highest cloud WER (2.3% AA-WER, #1); real-time variant achieves <150ms latency with predictive next-word transcription; speaker diarization with entity timestamps and PII redaction; keyterms biasing is context-aware (decides whether to apply the term, not just force it); SOC 2/HIPAA/GDPR compliant, zero-retention mode available; 90+ languages.
Weaknesses: Proprietary, no offline deployment; real-time diarization accuracy lags behind its batch mode; non-English diarization in real-time mode is deprioritized.
Best for: Maximum cloud word accuracy; compliance-sensitive recordings; recordings with heavy jargon benefiting from smart keyterm injection.

2. Mistral Voxtral Small (voxtral-small)

Strengths: LLM-backed — can answer questions directly from audio (not just transcribe); context biasing accepts up to 100 words/phrases for proper nouns and domain vocabulary; outperforms Whisper on every benchmark; open-weights model available; sub-200ms streaming mode.
Weaknesses: Context biasing is English-only (experimental for other languages); overlapping speech collapses to a single speaker; training data provenance partially undisclosed; newer and less battle-tested in high-volume production.
Best for: Recordings with specialized terminology or names that other engines mangle; when you want to ask follow-up questions about the audio content directly.

3. Google Gemini 3 Pro (gemini-pro)

Strengths: True multimodal — can reason about audio alongside video/image content in the same prompt; strong accent handling; emotion, sentiment, and intent detection; native live speech-to-speech translation with mid-conversation language switching; 90+ languages; supports up to ~2 hours of audio per request.
Weaknesses: Most expensive option by far (~$18.40/1K min); Gemini API does not support real-time streaming transcription (use Google Cloud Speech-to-Text for that); noise handling is slightly weaker than dedicated ASR models; 25MB file upload limits apply via the standard file API.
Best for: Recordings needing multimodal reasoning (e.g., transcribing a video while also describing what's happening visually); sentiment/emotion extraction alongside transcript; massive single-file recordings.

4. AssemblyAI Universal-3 Pro (assemblyai-u3-pro)

Strengths: Best-in-class speaker diarization (64% fewer speaker-counting errors; 30% better in noisy environments); plain-language prompting up to 1,000 words for per-recording context; automatic chapters, entity detection, sentiment analysis, and IAB categories; handles utterances as short as one word; built-in code-switching; multichannel support.
Weaknesses: 3.2% AA-WER puts it behind Scribe v2 and Voxtral on raw word accuracy; SLAM-1 (the prior LLM-reasoning model) is now deprecated and replaced by Universal-3 Pro; slightly higher latency than streaming-first models.
Best for: Multi-speaker meetings, interviews, or panels where correctly attributing who said what matters most; recordings where downstream structure (chapters, entities) is needed.

5. OpenAI GPT-4o Transcribe (gpt4o-transcribe)

Strengths: RL-trained ASR improves on Whisper, particularly for previously weak languages (e.g., Malayalam, Vietnamese); better handling of accents, noise, and variable speech speed; WebSocket streaming for real-time use.
Weaknesses: 25MB file size cap — long recordings must be split first; OpenAI's own current recommendation favors the Mini model (December 2025 snapshot); errors are correlated with GPT-4o Mini Transcribe, reducing diversity benefit in an ensemble; some production users report higher latency and instability vs. Whisper.
Best for: Ensemble diversity as a second OpenAI vote alongside Mini; recordings in languages where Whisper historically underperformed.

6. OpenAI GPT-4o Mini Transcribe (gpt4o-mini-transcribe)

Strengths: OpenAI's current recommended transcription model (December 2025 snapshot is the latest); half the cost of the full model; lower latency; error patterns are decorrelated from GPT-4o full, which increases ensemble diversity value; strong language coverage over Whisper baseline.
Weaknesses: Same 25MB file size cap as GPT-4o Transcribe; lower accuracy ceiling than the full model on difficult audio; limited context biasing.
Best for: Budget-conscious cloud transcription; adding an independent OpenAI signal to an ensemble without doubling the cost.

7. Voxtral Mini Realtime (voxtral-mini-local) — local

Strengths: Fully local — zero API cost, zero data egress; words stream as you speak (true real-time); 4B params MLX-optimized for Apple Silicon; 4x quantization available with minimal quality loss; best privacy option for sensitive recordings.
Weaknesses: Lowest accuracy of the 8 models (7.68% HF-WER); no keyterm biasing; the mlx-audio implementation uses chunked inference so partial transcripts are choppier than server-side streaming; quality degrades when transcription delay is set too low; no speaker diarization.
Best for: Sensitive recordings that cannot leave the device; real-time dictation or live captioning; situations with no internet access.

8. Cohere Transcribe (cohere-transcribe) — local

Strengths: #1 on HF Open ASR Leaderboard (5.42% WER, beating all cloud dedicated ASR alternatives in academic benchmarks); 524x real-time factor — roughly 3x faster than Whisper Large; Conformer-based encoder-decoder with >90% parameters in the acoustic encoder for high efficiency; Apache 2.0 license (freely deployable); 14 enterprise languages; free after local setup.
Weaknesses: No timestamps; no speaker diarization; no automatic language detection (language must be pre-specified); does not handle code-switching or mixed-language audio; no keyterm biasing; "eager transcription" — hallucinates on floor noise and non-speech sounds without a VAD preprocessing step; requires HF token for gated model access.
Best for: Tiebreaker in the default trio ensemble; single-language recordings where pure word accuracy matters and timestamps/diarization are not needed; high-volume local batch processing where API costs would add up.

Architecture

Transparency Report

Every LLM merge/review pass now produces a structured transparency report appended to the output .md:

APPLIED — each correction the LLM made, with a brief context snippet
UNCERTAIN — passages where the correct form could not be verified (left as-is)
PRESERVED — items intentionally left unchanged (casual speech, unverifiable product names/version numbers)

The report is also printed to the terminal after every run. With --review, the APPLIED section becomes interactive: accept, dispute (logs to suggestions file for later dict cleanup), or skip each item.

Per-Context Configuration

Named contexts (e.g. bcbs-vt) live in ~/.config/smart-transcribe/contexts/<name>.json. They use the same schema as the main dictionary and deep-merge on top of it:

Context corrections take precedence over global ones on key collision
Entities and notes are additive (union, no duplicates)
First use of a new context name triggers a 3-question setup interview

Standard (3-engine, default)

AssemblyAI Universal-3 Pro (SLAM-1 is deprecated): Structure, speaker diarization, chapters, entity detection, sentiment, IAB categories. Dictionary terms injected via keyterms_prompt.
ElevenLabs Scribe v2: Highest word-level cloud accuracy. Dictionary terms injected via keyterms.
Cohere Transcribe (local, free): Runs on Apple Silicon via MPS. No keyterm biasing; used as tiebreaker.
Claude Code headless merge: claude -p --model opus --effort medium — reads per-engine capability profiles and produces structured 4-section output (metadata, transcript, transparency report, suggestions). If Claude is rate-limited or unavailable, fallback is codex exec in headless mode.

Ensemble (up to 8 engines)

All 8 engines available via --engines. Cloud engines run in parallel; local engines run sequentially. Same Claude Code headless merge step.

Script Location

Main script: ${CLAUDE_PLUGIN_ROOT}/skills/smart-transcribe/scripts/smart-transcribe.py
Ensemble script: ${CLAUDE_PLUGIN_ROOT}/skills/smart-transcribe/scripts/ensemble.py
Batch script: ${CLAUDE_PLUGIN_ROOT}/skills/smart-transcribe/scripts/batch-transcribe-folder.py
Setup script: ${CLAUDE_PLUGIN_ROOT}/skills/smart-transcribe/scripts/setup.sh
Dictionary: ${CLAUDE_PLUGIN_ROOT}/skills/smart-transcribe/data/transcription-dictionary.json
Suggested additions log: ~/.config/smart-transcribe/suggested-additions.jsonl

Requirements

API Keys

All keys resolved at runtime from 1Password (Development vault) via op item get. Environment variables are also checked first as a fallback.

Required for standard 3-engine mode:

ASSEMBLYAI_API_KEY — AssemblyAI Universal-3 / SLAM-1
ELEVENLABS_API_KEY — Scribe v2

Optional (additional engines):

MISTRAL_API_KEY — Voxtral Small
GOOGLE_API_KEY — Gemini 3 Pro
OPENAI_API_KEY — GPT-4o Transcribe + Mini
HF_TOKEN — Required for Cohere Transcribe (gated HuggingFace repo)

Merge (always required):

Claude Code CLI (claude) should be authenticated — merge uses claude -p --model opus --effort medium
Codex CLI (codex) is the automatic fallback merge runner when Claude headless is unavailable or rate-limited

Tools

ffmpeg — Audio format conversion (installed via Homebrew); handles .qta, .m4a, .mp3, etc.
Python 3.13+ runtime per engine venv with all SDK packages (see scripts/requirements.txt). setup.sh auto-detects the highest available Python >= 3.13 — re-run it after upgrading Python to rebuild venvs.
torch, transformers, soundfile, librosa — For Cohere Transcribe (local); model cached at ~/.cache/huggingface/hub/

Setup

Run bash ${CLAUDE_PLUGIN_ROOT}/skills/smart-transcribe/scripts/setup.sh to create dedicated Python 3.13 engine runtimes and install Python deps.

API keys are resolved from 1Password at runtime — no keys.env configuration needed.

Dictionary

The plugin uses a seed + user dictionary architecture:

Seed dictionary (${CLAUDE_PLUGIN_ROOT}/skills/smart-transcribe/data/transcription-dictionary.json) - read-only reference that ships with the skill.
User dictionary (~/.config/smart-transcribe/transcription-dictionary.json by default) - the personal, evolving copy that the learning loop writes to.

The dictionary contains:

Corrections: Wrong-to-right mappings organized by category (places, names, organizations, etc.)
Entities: Known proper nouns for context biasing the transcription engines
Speakers: Known speakers with topic and company associations for identification
Notes: Context notes informing the merge process

After each transcription, new terms are identified and presented to the user for approval before being added to the user dictionary.

Source

Scripts at ${CLAUDE_PLUGIN_ROOT}/skills/smart-transcribe/scripts/. Skill source: plugins/ames-standalone-skills/skills/smart-transcribe/ in the ames-claude repo.

smart-transcribe

Install

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

smart-transcribe

Install

Tool Access

Preview

Supporting Assets

SKILL.md

Smart Transcribe

Usage

Key flags

Modes

Fix-Transcript Mode (--fix-transcript FILE)

Standard Mode (default)

Ensemble Mode

Model Profiles

Architecture

Transparency Report

Per-Context Configuration

Standard (3-engine, default)

Ensemble (up to 8 engines)

Script Location

Requirements

API Keys

Tools

Setup

Dictionary

Source

Similar Skills

Smart Transcribe

Usage

Key flags

Modes

Fix-Transcript Mode (--fix-transcript FILE)

Standard Mode (default)

Ensemble Mode

Model Profiles

Architecture

Transparency Report

Per-Context Configuration

Standard (3-engine, default)

Ensemble (up to 8 engines)

Script Location

Requirements

API Keys

Tools

Setup

Dictionary

Source

Fix-Transcript Mode (`--fix-transcript FILE`)

Fix-Transcript Mode (`--fix-transcript FILE`)