From babel-fish
Generate audio content — text-to-speech, podcasts, voice cloning, sound effects, speech-to-speech, dubbing, and audio isolation. Currently powered by ElevenLabs. Works with both the Python SDK and the ElevenLabs CLI. Includes ready-to-run generator scripts that Claude writes to a temp file and executes directly. Triggers: audio, elevenlabs, text-to-speech, TTS, podcast, voice, voiceover, narration, voice clone, sound effects, dubbing, speech-to-speech, audio isolation.
npx claudepluginhub ondrej-svec/heart-of-gold-toolkit --plugin babel-fishThis skill is limited to using the following tools:
Generate audio — from single-line TTS to multi-voice podcasts. Currently powered by ElevenLabs.
Mandates invoking relevant skills via tools before any response in coding sessions. Covers access, priorities, and adaptations for Claude Code, Copilot CLI, Gemini CLI.
Share bugs, ideas, or general feedback.
Generate audio — from single-line TTS to multi-voice podcasts. Currently powered by ElevenLabs. Includes ready-to-run Python scripts. Claude writes them to a temp file and executes directly.
This skill MAY: install the SDK, generate audio files, list voices, create voice clones, write and execute generation scripts, play audio.
This skill MAY NOT: store API keys in code (use env vars or ~/.elevenlabs/api_key), commit audio files to git, generate audio without user approval of the script first.
| Shortcut | Why It Fails | The Cost |
|---|---|---|
| Hardcode API key in script | Leaks credentials to git history | Security incident |
| Skip voice selection | Default voice may not match content tone | Wasted credits on re-gen |
| Generate full podcast without preview | Long audio = expensive; mistakes compound | Non-refundable credits |
| Import pydub for concatenation | Broken on Python 3.13+ (audioop removed) | Runtime crash |
| Use VoiceSettings with cloned voices | Custom settings destabilize cloned voices | Garbled/robotic audio |
Use ... for pauses | Causes hesitation/nervousness artifacts | Unnatural stuttering |
| Use large chunks for long content | Quality degrades in second half | Robotic pacing |
Skip language_code with accented speakers | Model guesses language from accent | Chinese/French mid-narration |
Entry: User wants audio content.
Check in order — use the first one found:
# 1. Check CLI auth (preferred — already logged in)
elevenlabs auth whoami --no-ui 2>/dev/null
# 2. Check env var
echo "${ELEVENLABS_API_KEY:+set}"
# 3. Check stored key file
cat ~/.elevenlabs/api_key 2>/dev/null | head -c 10
API key resolution in Python (use this in ALL scripts):
import os
def get_api_key() -> str:
"""Resolve ElevenLabs API key from CLI store, env var, or fail."""
key_file = os.path.expanduser("~/.elevenlabs/api_key")
if os.path.exists(key_file):
return open(key_file).read().strip()
key = os.environ.get("ELEVENLABS_API_KEY", "")
if key:
return key
raise RuntimeError(
"No ElevenLabs API key found. Run `elevenlabs auth login` or "
"set ELEVENLABS_API_KEY environment variable."
)
python3 -c "import elevenlabs" 2>/dev/null || uv pip install --system --break-system-packages elevenlabs
IMPORTANT: Do NOT install pydub. It's broken on Python 3.13+ (audioop removed). The scripts below use raw MP3 byte concatenation — MP3 is a frame-based format and files can be concatenated directly.
IMPORTANT: On Python 3.14+, client.text_to_speech.convert() returns a generator, not
bytes. All scripts below use a to_bytes() helper to normalize this.
python3 -c "
from elevenlabs.client import ElevenLabs
import os
key_file = os.path.expanduser('~/.elevenlabs/api_key')
api_key = open(key_file).read().strip() if os.path.exists(key_file) else os.environ.get('ELEVENLABS_API_KEY', '')
client = ElevenLabs(api_key=api_key)
voices = client.voices.get_all()
print(f'Connected. {len(voices.voices)} voices available.')
for v in voices.voices[:10]:
labels = dict(v.labels) if v.labels else {}
print(f' {v.voice_id} | {v.name:25s} | {labels.get(\"accent\", \"\")} {labels.get(\"gender\", \"\")}')
"
CRITICAL: Voice IDs are account-specific. Never hardcode voice IDs from examples or documentation — always run Step 3 first to discover the actual IDs available.
Exit: Auth verified, SDK installed, voices listed.
Entry: User wants a single audio file from text (< 5,000 chars).
#!/usr/bin/env python3
"""ElevenLabs TTS generator."""
import os
from elevenlabs.client import ElevenLabs
TEXT = """Your text here."""
VOICE_ID = "FILL_FROM_VOICE_LIST"
MODEL_ID = "eleven_multilingual_v2"
OUTPUT_PATH = "output.mp3"
def to_bytes(audio) -> bytes:
return audio if isinstance(audio, bytes) else b"".join(audio)
key_file = os.path.expanduser("~/.elevenlabs/api_key")
api_key = open(key_file).read().strip() if os.path.exists(key_file) else os.environ["ELEVENLABS_API_KEY"]
client = ElevenLabs(api_key=api_key)
print(f"Generating {len(TEXT)} chars with {MODEL_ID}...")
audio = to_bytes(client.text_to_speech.convert(
text=TEXT,
voice_id=VOICE_ID,
model_id=MODEL_ID,
output_format="mp3_44100_128",
language_code="en", # ALWAYS set for cloned/accented voices
))
with open(OUTPUT_PATH, "wb") as f:
f.write(audio)
print(f"Saved to {OUTPUT_PATH} ({os.path.getsize(OUTPUT_PATH) / 1024:.0f} KB)")
Exit: Audio file saved.
Entry: User wants narration of long-form content (> 5,000 chars).
THIS IS THE CRITICAL PHASE. Long-form audio requires special handling to maintain quality throughout. The approach below was battle-tested and is the only one that produces consistent quality across 10+ minute narrations.
Create a separate speech-text.md adapted for listening:
| Written form | Speech form | Why |
|---|---|---|
90% | ninety percent | TTS mispronounces digits |
1.7 times | one point seven times | Same |
2 AM | two in the morning | Natural speech |
Kačka | Kachka | Phonetic for TTS |
Žaneta | Zhaneta | Phonetic for TTS |
Aibility | Eigh-bility | Phonetic — write directly in text |
**bold text** | bold text | Strip all markdown |
--- | (remove) | Strip section breaks |
Pause control:
<break time="0.7s" /> — sub-section pause (v2 supports SSML break tags)<break time="1.0s" /> — major section transition<break time="1.2s" /> — thesis/key moment (max recommended)... — causes hesitation/nervousness artifactsWhat NOT to do:
<lexeme> tags — they get read aloud as textWhy this approach: Large chunks (4000+ chars) degrade in quality — the model loses emotional range and natural pacing in the second half. Small chunks (800-1200 chars) stay high quality. Request stitching chains them together for continuity.
CRITICAL for cloned voices:
language_code="en" is mandatory — without it, the model guesses language from
accent and can switch to Chinese/French mid-narration#!/usr/bin/env python3
"""ElevenLabs long-form narration with request stitching.
Splits text into small chunks, chains via previous_request_ids for
continuity, uses httpx directly to access request-id headers.
"""
import os
import httpx
# --- CONFIG ---
SPEECH_TEXT_PATH = "speech-text.md"
VOICE_ID = "FILL_FROM_VOICE_LIST"
OUTPUT_PATH = "speech.mp3"
CHUNK_SIZE = 1000 # chars per chunk — keep 800-1200 for quality
LANGUAGE_CODE = "en" # ALWAYS set for cloned/accented voices
# --- END CONFIG ---
api_key_file = os.path.expanduser("~/.elevenlabs/api_key")
api_key = open(api_key_file).read().strip() if os.path.exists(api_key_file) else os.environ["ELEVENLABS_API_KEY"]
with open(SPEECH_TEXT_PATH, "r") as f:
text = f.read()
# Split into small chunks at paragraph boundaries
paragraphs = text.split("\n\n")
chunks, current = [], ""
for p in paragraphs:
if len(current) + len(p) + 2 > CHUNK_SIZE and current.strip():
chunks.append(current.strip())
current = p
else:
current = f"{current}\n\n{p}" if current else p
if current.strip():
chunks.append(current.strip())
print(f"Script: {len(text)} chars -> {len(chunks)} chunks")
for i, c in enumerate(chunks):
print(f" Chunk {i+1}: {len(c)} chars")
url = f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}"
headers = {"xi-api-key": api_key, "Content-Type": "application/json"}
all_audio = b""
prev_request_id = None
for i, chunk in enumerate(chunks):
data = {
"text": chunk,
"model_id": "eleven_multilingual_v2",
"output_format": "mp3_44100_128",
"language_code": LANGUAGE_CODE,
}
# Chain to previous chunk for prosody continuity
if prev_request_id:
data["previous_request_ids"] = [prev_request_id]
# Give forward context from next chunk
if i + 1 < len(chunks):
data["next_text"] = chunks[i + 1][:500]
print(f" [{i+1}/{len(chunks)}] {len(chunk)} chars...", end=" ", flush=True)
resp = httpx.post(url, json=data, headers=headers, timeout=60)
if resp.status_code != 200:
print(f"ERROR {resp.status_code}: {resp.text[:200]}")
break
prev_request_id = resp.headers.get("request-id")
all_audio += resp.content
print(f"done ({len(resp.content)//1024} KB)")
with open(OUTPUT_PATH, "wb") as f:
f.write(all_audio)
size_mb = os.path.getsize(OUTPUT_PATH) / (1024 * 1024)
print(f"\nSaved to {OUTPUT_PATH} ({size_mb:.1f} MB)")
Always test first: Generate chunks 1-2 as a preview clip before committing to the full generation. Credits are non-refundable.
Listen to the full audio. If specific sections sound off:
previous_request_ids (from the preceding chunk)
and next_request_ids (from the following chunk) to maintain flowExit: Long-form narration audio saved.
Entry: User wants a custom voice from their audio.
| Requirement | Details |
|---|---|
| Duration | 1-2 minutes (more than 3 min can be detrimental) |
| Content | Read your own writing — natural intonation matches best |
| Quality | Quiet room, no background noise, consistent distance from mic |
| Format | MP3 128kbps or higher, mono or stereo |
| Style | Consistent pace and tone — the clone replicates EVERYTHING |
| Avoid | Stumbles, "uhm"s, long pauses, whispers, shouting, music |
CRITICAL: Do NOT pre-process the recording with ffmpeg filters (silenceremove, loudnorm, etc.). These strip voice characteristics the clone needs. The only acceptable preprocessing is trimming to length.
from elevenlabs import ElevenLabs
client = ElevenLabs(api_key=get_api_key())
voice = client.voices.ivc.create(
name="User Voice",
description="Natural speaking voice for narration",
files=[open("recording.mp3", "rb")],
remove_background_noise=False, # Preserve voice characteristics
)
print(f"Voice ID: {voice.voice_id}")
After cloning, ALWAYS test with a short clip before generating long content:
audio_gen = client.text_to_speech.convert(
text="A short test sentence to verify the voice sounds right.",
voice_id=voice.voice_id,
model_id="eleven_multilingual_v2",
output_format="mp3_44100_128",
language_code="en",
# DO NOT pass voice_settings — defaults are best for clones
)
| Model | Works with clones? | Notes |
|---|---|---|
eleven_multilingual_v2 | YES — use this | Best voice fidelity with clones |
eleven_v3 | NO | Smooth output but voice identity completely lost |
eleven_flash_v2_5 | Untested | May work, lower quality expected |
eleven_turbo_v2_5 | Untested | May work |
Do NOT override VoiceSettings for cloned voices. Default settings produce the best results. Every combination tested (stability 0.3-0.8, similarity 0.5-1.0, style 0.3-0.7, speaker boost on/off) made the output worse — garbled, robotic, or unnatural pacing.
If you must tweak, test with a single sentence first and compare to the no-settings version before committing to a full generation.
Exit: Custom voice created and tested.
Entry: User wants podcast-style audio (single voice, long content).
Use the Phase 2 Long-Form Narration approach with request stitching.
The old approach (4500-char chunks with previous_text) produces lower
quality than small chunks with previous_request_ids.
#!/usr/bin/env python3
"""ElevenLabs multi-voice podcast generator."""
import os
from elevenlabs.client import ElevenLabs
SEGMENTS = [
("VOICE_ID_HOST", "Welcome to the show..."),
("VOICE_ID_GUEST", "Thanks for having me..."),
]
MODEL_ID = "eleven_multilingual_v2"
OUTPUT_PATH = "dialogue-podcast.mp3"
def to_bytes(audio) -> bytes:
return audio if isinstance(audio, bytes) else b"".join(audio)
client = ElevenLabs(api_key=get_api_key())
audio_parts = []
for i, (voice_id, text) in enumerate(SEGMENTS):
print(f" [{i+1}/{len(SEGMENTS)}] {text[:50]}...")
audio_bytes = to_bytes(client.text_to_speech.convert(
text=text,
voice_id=voice_id,
model_id=MODEL_ID,
output_format="mp3_44100_128",
language_code="en",
))
audio_parts.append(audio_bytes)
with open(OUTPUT_PATH, "wb") as f:
for part in audio_parts:
f.write(part)
print(f"Saved to {OUTPUT_PATH}")
audio = client.text_to_sound_effects.convert(
text="Heavy rain on a tin roof with distant thunder",
duration_seconds=10.0,
)
with open("rain.mp3", "wb") as f:
f.write(to_bytes(audio))
Tips: be specific ("footsteps on gravel" > "walking sounds"), include environment, specify duration.
with open("input.mp3", "rb") as f:
input_audio = f.read()
transformed = to_bytes(client.speech_to_speech.convert(
audio=input_audio,
voice_id="target_voice_id",
model_id="eleven_english_sts_v2",
))
with open("transformed.mp3", "wb") as f:
f.write(transformed)
with open("noisy.mp3", "rb") as f:
clean = to_bytes(client.audio_isolation.audio_isolation(audio=f.read()))
with open("clean.mp3", "wb") as f:
f.write(clean)
elevenlabs auth login # Interactive API key setup
elevenlabs auth whoami --no-ui # Check status
elevenlabs auth logout # Remove stored key
The CLI is focused on agent management, NOT TTS. For TTS, use the Python SDK.
| Model ID | Best For | Char Limit | Latency | Clone Support |
|---|---|---|---|---|
eleven_multilingual_v2 | Long-form, cloned voices | 10,000 | Standard | YES |
eleven_v3 | Dramatic, expressive (stock voices) | 5,000 | ~300ms | NO — loses identity |
eleven_flash_v2_5 | Ultra-low latency | 40,000 | ~75ms | Untested |
eleven_turbo_v2_5 | Quality + speed | 40,000 | ~250ms | Untested |
Using a cloned voice?
├─ Yes → eleven_multilingual_v2 (only reliable option)
└─ No → Content > 5,000 chars?
├─ Yes → eleven_multilingual_v2
└─ No → Need dramatic delivery?
├─ Yes → eleven_v3
└─ No → Need low latency?
├─ Yes → eleven_flash_v2_5
└─ No → eleven_turbo_v2_5
| Method | Works? | Notes |
|---|---|---|
<break time="0.7s" /> | YES (v2 only) | SSML break tag, up to 3s. Use sparingly (max 5-6 per generation) |
| Paragraph breaks | YES | Natural, reliable, no cost |
| Short sentences | YES | Best method — rhythm from writing |
... ellipsis | NO | Causes hesitation/nervousness artifacts |
Multiple dashes -- -- | Somewhat | Inconsistent |
| Method | Works? | Notes |
|---|---|---|
| Phonetic spelling in text | YES | Most reliable: "Eigh-bility" instead of "Aibility" |
| Pronunciation dictionary API | UNRELIABLE | Silently ignored with some model/voice combos |
<lexeme> tags in text | NO | Read aloud as text |
<phoneme> SSML tags | v2: NO, Flash v2: YES | Only works with specific models |
Rule: Always use phonetic spelling directly in the speech text. Don't rely on dictionaries or SSML phoneme tags.
| Format | Quality | Use Case |
|---|---|---|
mp3_44100_128 | High | Default, general purpose |
mp3_44100_192 | Highest MP3 | Archival |
pcm_44100 | Lossless | Post-processing |
eleven_flash_v2_5 is 50% cheaper than other models# When using httpx directly (for request stitching):
resp = httpx.post(url, json=data, headers=headers, timeout=60)
if resp.status_code == 401:
print("Bad API key.")
elif resp.status_code == 400 and "quota_exceeded" in resp.text:
print("Out of credits.")
elif resp.status_code != 200:
print(f"Error {resp.status_code}: {resp.text[:200]}")
~/.elevenlabs/api_key or env var, never hardcodedlanguage_code set for cloned or accented voices... ellipses in speech text