npx claudepluginhub ilderaj/agent-plugin-marketplace --plugin codex--hyperframesThis skill uses the workspace's default tool permissions.
Three CLI commands that produce assets for compositions: `tts` (speech), `transcribe` (timestamps), and `remove-background` (transparent video). Each downloads a model on first run and caches it under `~/.cache/hyperframes/`. Drop the output into the project, then reference it from the composition HTML — see the `hyperframes` skill for the audio/video element conventions.
Executes HyperFrames CLI: scaffolds projects with templates/media, lints/inspects HTML compositions, previews/renders videos, transcribes/generates TTS audio, troubleshoots environments.
Generates speech (TTS, voice cloning), music, videos (text-to-video, image-to-video), and images via MiniMax APIs. Includes FFmpeg for media conversion, concat, trim, extract. For multimodal AI tasks.
Converts documents to narrated video scripts by splitting into scenes, generating TTS audio, and producing VTT files with word-level timestamps. Use for essays, blogs, articles.
Share bugs, ideas, or general feedback.
Three CLI commands that produce assets for compositions: tts (speech), transcribe (timestamps), and remove-background (transparent video). Each downloads a model on first run and caches it under ~/.cache/hyperframes/. Drop the output into the project, then reference it from the composition HTML — see the hyperframes skill for the audio/video element conventions.
tts)Generate speech audio locally with Kokoro-82M. No API key.
npx hyperframes tts "Text here" --voice af_nova --output narration.wav
npx hyperframes tts script.txt --voice bf_emma --output narration.wav
npx hyperframes tts --list # all 54 voices
Match voice to content. Default is af_heart.
| Content type | Voice | Why |
|---|---|---|
| Product demo | af_heart/af_nova | Warm, professional |
| Tutorial / how-to | am_adam/bf_emma | Neutral, easy to follow |
| Marketing / promo | af_sky/am_michael | Energetic or authoritative |
| Documentation | bf_emma/bm_george | Clear British English, formal |
| Casual / social | af_heart/af_sky | Approachable, natural |
Voice IDs encode language in the first letter: a=American English, b=British English, e=Spanish, f=French, h=Hindi, i=Italian, j=Japanese, p=Brazilian Portuguese, z=Mandarin. The CLI auto-detects the phonemizer locale from the prefix — no --lang needed when the voice matches the text.
npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
npx hyperframes tts "今日はいい天気ですね" --voice jf_alpha --output ja.wav
Use --lang only to override auto-detection (stylized accents). Valid codes: en-us, en-gb, es, fr-fr, hi, it, pt-br, ja, zh. Non-English phonemization requires espeak-ng system-wide (brew install espeak-ng / apt-get install espeak-ng).
0.7-0.8 — tutorial, complex content, accessibility1.0 — natural pace (default)1.1-1.2 — intros, transitions, upbeat content1.5+ — rarely appropriate; test carefullyFor more than a few paragraphs, write to a .txt file and pass the path. Inputs over ~5 minutes of speech may benefit from splitting into segments.
Python 3.8+ with kokoro-onnx and soundfile (pip install kokoro-onnx soundfile). Model downloads on first use (~311 MB + ~27 MB voices, cached in ~/.cache/hyperframes/tts/).
transcribe)Produce a normalized transcript.json with word-level timestamps.
npx hyperframes transcribe audio.mp3
npx hyperframes transcribe video.mp4 --model small --language es
npx hyperframes transcribe subtitles.srt # import existing
npx hyperframes transcribe subtitles.vtt
npx hyperframes transcribe openai-response.json
Never use .en models unless the user explicitly states the audio is English. .en models (small.en, medium.en) translate non-English audio into English instead of transcribing it. This silently destroys the original language.
--model small --language <code> (no .en suffix)--model small.en--model small (no .en, no --language) — whisper auto-detectsDefault model is small, not small.en.
| Model | Size | Speed | When to use |
|---|---|---|---|
tiny | 75 MB | Fastest | Quick previews, testing pipeline |
base | 142 MB | Fast | Short clips, clear audio |
small | 466 MB | Moderate | Default — most content |
medium | 1.5 GB | Slow | Important content, noisy audio, music |
large-v3 | 3.1 GB | Slowest | Production quality |
Music with vocals: start at medium minimum; produced tracks often need manual SRT/VTT import. For caption-quality checks (mandatory after every transcription), the cleaning JS, retry rules, and the OpenAI/Groq API import path, see hyperframes/references/transcript-guide.md.
Compositions consume a flat array of word objects. The id field (w0, w1, ...) is added during normalization for stable references in caption overrides; it's optional for backwards compatibility.
[
{ "id": "w0", "text": "Hello", "start": 0.0, "end": 0.5 },
{ "id": "w1", "text": "world.", "start": 0.6, "end": 1.2 }
]
remove-background)Remove the background from a video or image so the subject (typically a person — avatar, presenter, talking head) sits as a transparent overlay in a composition.
npx hyperframes remove-background subject.mp4 -o transparent.webm # default: VP9 alpha WebM
npx hyperframes remove-background subject.mp4 -o transparent.mov # ProRes 4444 (editing)
npx hyperframes remove-background portrait.jpg -o cutout.png # single-image cutout
npx hyperframes remove-background subject.mp4 -o transparent.webm --device cpu
npx hyperframes remove-background --info # detected providers
Uses u2net_human_seg (MIT). First run downloads ~168 MB of weights to ~/.cache/hyperframes/background-removal/models/.
| Format | When |
|---|---|
.webm (VP9 + alpha) | Default. Compositions play this directly via <video>. |
.mov (ProRes 4444) | Editing in DaVinci/Premiere/FCP. Large files. |
.png | Single-image cutout (still subject, layered over a backdrop). |
Chrome decodes VP9 alpha natively, so the .webm plugs into a composition like any other muted-autoplay video — see the hyperframes skill for the <video> track conventions.
--quality fast|balanced|best controls only the VP9 encoder's CRF — segmentation quality is fixed.
| Preset | CRF | When |
|---|---|---|
fast | 30 | Iterating, smaller file, looser color match |
balanced | 18 | Default. Visually identical for most uses |
best | 12 | Master / final delivery. Largest file, tightest match |
The cutout webm is a re-encoded copy of the source mp4's RGB. That choice has consequences depending on what you put behind it:
| Pattern | What's behind the cutout | Result |
|---|---|---|
| Cutout over a different scene (most common) | Static image, gradient, or unrelated video | Looks great. The cutout's RGB is the only source of the subject — no doubling, no edge halo. This is what remove-background is built for. |
| Cutout over its own source mp4 (text-behind-subject) | Same mp4 the cutout was generated from | Two RGB sources for the same person. At default --quality balanced (crf 18) the doubling is barely visible; at --quality fast (crf 30) you'll see a faint color shift / edge halo. Use --quality best (crf 12) for masters. |
| Cutout over a different take of the same person | Footage of the same subject | Will look like two separate people overlapping. Don't do this. |
Text-behind-subject (headline behind a presenter):
<video
src="presenter.mp4"
id="bg"
data-start="0"
data-duration="6"
data-track-index="0"
muted
playsinline
></video>
<h1 id="headline" style="z-index:2; ...">MAKE IT IN HYPERFRAMES</h1>
<div class="cutout-wrap" style="position:absolute;inset:0;z-index:3;opacity:0">
<video
src="presenter.webm"
data-start="0"
data-duration="6"
data-track-index="1"
muted
playsinline
></video>
</div>
Two key rules:
<div> and animate the wrapper's opacity, not the video element's. The framework forces opacity:1 on active clips (any element with data-start/data-duration), so animating the video's opacity directly is silently overridden. The wrapper has no data-* attributes, so it's owned by your CSS/GSAP.data-start="0" and data-media-start="0" so the framework decodes them in sync from t=0. Late-mounting the cutout (data-start=3.3) introduces a seek + warm-up that lands a frame off the base mp4 — visible as one frame of misalignment at the cut.Then GSAP-flip the wrapper opacity at the cut: tl.set(cutoutWrap, { opacity: 1 }, 3.3).
When there's no pre-recorded voiceover, generate one and transcribe it back to get word-level timestamps for captions:
npx hyperframes tts script.txt --voice af_heart --output narration.wav
npx hyperframes transcribe narration.wav # → transcript.json
Whisper extracts precise word boundaries from the generated audio, so caption timing matches delivery without hand-tuning.