From claude-transcription
Prepare audio for transcription — format normalise (mono/16kHz/opus), loudness normalise (EBU R128), and collapse long silences (silero-vad). Optional denoise pass. Use before sending to AssemblyAI or any ASR for cleaner results, smaller uploads, and lower cost. Use when the user asks to "preprocess audio", "prep for transcription", "clean up a recording before sending it", or just hands over a raw voice memo to be transcribed.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin claude-transcriptionThis skill uses the workspace's default tool permissions.
Single orchestrator that takes a raw recording and emits a transcription-ready file. Source is never modified.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Share bugs, ideas, or general feedback.
Single orchestrator that takes a raw recording and emits a transcription-ready file. Source is never modified.
| Pass | Default | Notes |
|---|---|---|
| 1. Denoise | off (flag --denoise) | rarely worth it for transcription — modern ASR is robust to moderate noise. See "Should I denoise?" below. |
| 2. Format normalise | on | mono, 16 kHz, opus 24k |
| 3. Loudness normalise | on | EBU R128 via ffmpeg loudnorm |
| 4. Silence cap | on | silero-vad detects speech, every gap capped at 0.4s max (no speech ever clipped). Sub-0.4s natural pauses pass through untouched. |
Passes 2+3 run in a single ffmpeg invocation (one decode/encode). Pass 4 runs after.
If invoked inside an existing project structure with audio/ or context/ already present, respect it. Otherwise:
audio/raw/<basename> if the user is OK with thataudio/processed/<stem>.preprocessed.opusaudio/processed/<stem>.<ext>Never overwrite — if <stem>.preprocessed.opus exists, suffix .v2, .v3, etc.
The full pipeline is shipped as a reusable script. Invoke it directly rather than reimplementing:
scripts/preprocess-audio.sh [--denoise] [--no-silence-trim] [--max-gap SEC] [--out-dir DIR] INPUT
The script handles all four passes and writes <stem>.preprocessed.opus to the output dir.
Internals (for reference):
ffmpeg -af "afftdn=nf=-25". Off by default. Enable with --denoise for noisy field recordings.ffmpeg -af "loudnorm=I=-16:TP=-1.5:LRA=11" -ac 1 -ar 16000 -c:a pcm_s16le (WAV intermediate so the python step doesn't need codec deps).scripts/silence-collapse.py — silero-vad detects speech segments (ML, used inside Whisper / NeMo), then every gap between speech segments is capped at --max-gap (default 0.4s). Speech is never clipped; only the dead air between speech is truncated. Sub-max-gap natural pauses pass through untouched.Output stats (size, duration, removed seconds, removal %, speech segment count) are written to <stem>.vad-stats.json.
--max-gap cap and not >min-gap collapseEarlier iteration used "collapse silences over 2.5s." That preserved natural cadence but felt loose because medium-length pauses (1–2.5s) survived intact. The cap-gap approach uniformly truncates every gap longer than the threshold, eliminating all dead air without ever cutting speech.
Validated on a 41-minute voice memo: cap@0.4s removes ~14% (vs ~3.6% with the old collapse>2.5s approach), no perceptible word clipping. Auto-editor was tried and rejected — its energy-threshold approach chops mid-syllable on quiet speech.
After running, print:
Input: recording.m4a (12m 04s, stereo 48kHz, 14 MB)
Output: audio/processed/recording.preprocessed.opus (9m 18s, mono 16kHz, 1.6 MB)
Removed: 2m 46s of silence (23%)
Loudness: normalised to -16 LUFS
Denoise: skipped (use --denoise to enable)
--denoise — enable pass 1--no-silence-trim — skip pass 4 (keep all silence)--max-gap <seconds> — override default 0.4s gap cap (lower = tighter)--out-dir <path> — override default audio/processed/mono 16kHz opus and was recorded in a controlled environment → just transcribe directlyProbably not, for transcription.
POC at https://github.com/danielrosehill/Crying-Baby-Audio-Scrub tested DeepFilterNet vs raw audio (recording with background baby crying) into Whisper. The cleaned and uncleaned transcripts differed by only 5–6 word choices across 119 seconds. Modern ASR (Whisper, AssemblyAI, Gemini) already handles moderate background noise.
Reach for --denoise (or the standalone denoise skill, which has stronger backends) only when:
The --denoise flag here uses ffmpeg afftdn (basic, fast, ~no quality cost). For higher-quality denoise (DeepFilterNet, Auphonic, ElevenLabs Voice Isolator), use the standalone denoise skill before this preprocessor.
Validated 2026-04-28 against a 41-minute combined voice memo (Job-Search-Planning-0426). With --max-gap 0.4: