Skill

hyperframes-media

Preprocesses media for HyperFrames compositions: Kokoro TTS narration, Whisper audio/video transcription, u2net background removal. Use for voiceovers, captions, transparent overlays.

Python

Node

Bash

ai-ml

cli-tools

npx claudepluginhub ilderaj/agent-plugin-marketplace --plugin codex--hyperframes

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Three CLI commands that produce assets for compositions: `tts` (speech), `transcribe` (timestamps), and `remove-background` (transparent video). Each downloads a model on first run and caches it under `~/.cache/hyperframes/`. Drop the output into the project, then reference it from the composition HTML — see the `hyperframes` skill for the audio/video element conventions.

SKILL.md

Similar Skills

hyperframes-cli

Executes HyperFrames CLI: scaffolds projects with templates/media, lints/inspects HTML compositions, previews/renders videos, transcribes/generates TTS audio, troubleshoots environments.

craft-workspace-webconsulting-skills

minimax-multimodal-toolkit

Generates speech (TTS, voice cloning), music, videos (text-to-video, image-to-video), and images via MiniMax APIs. Includes FFmpeg for media conversion, concat, trim, extract. For multimodal AI tasks.

15 files

minimax-skills

document-to-narration

Converts documents to narrated video scripts by splitting into scenes, generating TTS audio, and producing VTT files with word-level timestamps. Use for essays, blogs, articles.

11 files

jwynia-agent-skills-1

Stats

Stars14546

Forks1335

Last CommitMay 5, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

HyperFrames Media Preprocessing

Three CLI commands that produce assets for compositions: tts (speech), transcribe (timestamps), and remove-background (transparent video). Each downloads a model on first run and caches it under ~/.cache/hyperframes/. Drop the output into the project, then reference it from the composition HTML — see the hyperframes skill for the audio/video element conventions.

Text-to-Speech (`tts`)

Generate speech audio locally with Kokoro-82M. No API key.

npx hyperframes tts "Text here" --voice af_nova --output narration.wav
npx hyperframes tts script.txt --voice bf_emma --output narration.wav
npx hyperframes tts --list                       # all 54 voices

Voice Selection

Match voice to content. Default is af_heart.

Content type	Voice	Why
Product demo	`af_heart`/`af_nova`	Warm, professional
Tutorial / how-to	`am_adam`/`bf_emma`	Neutral, easy to follow
Marketing / promo	`af_sky`/`am_michael`	Energetic or authoritative
Documentation	`bf_emma`/`bm_george`	Clear British English, formal
Casual / social	`af_heart`/`af_sky`	Approachable, natural

Multilingual

Voice IDs encode language in the first letter: a=American English, b=British English, e=Spanish, f=French, h=Hindi, i=Italian, j=Japanese, p=Brazilian Portuguese, z=Mandarin. The CLI auto-detects the phonemizer locale from the prefix — no --lang needed when the voice matches the text.

npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
npx hyperframes tts "今日はいい天気ですね" --voice jf_alpha --output ja.wav

Use --lang only to override auto-detection (stylized accents). Valid codes: en-us, en-gb, es, fr-fr, hi, it, pt-br, ja, zh. Non-English phonemization requires espeak-ng system-wide (brew install espeak-ng / apt-get install espeak-ng).

Speed

0.7-0.8 — tutorial, complex content, accessibility
1.0 — natural pace (default)
1.1-1.2 — intros, transitions, upbeat content
1.5+ — rarely appropriate; test carefully

Long Scripts

For more than a few paragraphs, write to a .txt file and pass the path. Inputs over ~5 minutes of speech may benefit from splitting into segments.

Requirements

Python 3.8+ with kokoro-onnx and soundfile (pip install kokoro-onnx soundfile). Model downloads on first use (~311 MB + ~27 MB voices, cached in ~/.cache/hyperframes/tts/).

Transcription (`transcribe`)

Produce a normalized transcript.json with word-level timestamps.

npx hyperframes transcribe audio.mp3
npx hyperframes transcribe video.mp4 --model small --language es
npx hyperframes transcribe subtitles.srt          # import existing
npx hyperframes transcribe subtitles.vtt
npx hyperframes transcribe openai-response.json

Language Rule (Non-Negotiable)

Never use .en models unless the user explicitly states the audio is English. .en models (small.en, medium.en) translate non-English audio into English instead of transcribing it. This silently destroys the original language.

Language known and non-English → --model small --language <code> (no .en suffix)
Language known and English → --model small.en
Language unknown → --model small (no .en, no --language) — whisper auto-detects

Default model is small, not small.en.

Model Sizes

Model	Size	Speed	When to use
`tiny`	75 MB	Fastest	Quick previews, testing pipeline
`base`	142 MB	Fast	Short clips, clear audio
`small`	466 MB	Moderate	Default — most content
`medium`	1.5 GB	Slow	Important content, noisy audio, music
`large-v3`	3.1 GB	Slowest	Production quality

Music with vocals: start at medium minimum; produced tracks often need manual SRT/VTT import. For caption-quality checks (mandatory after every transcription), the cleaning JS, retry rules, and the OpenAI/Groq API import path, see hyperframes/references/transcript-guide.md.

Output Shape

Compositions consume a flat array of word objects. The id field (w0, w1, ...) is added during normalization for stable references in caption overrides; it's optional for backwards compatibility.

[
  { "id": "w0", "text": "Hello", "start": 0.0, "end": 0.5 },
  { "id": "w1", "text": "world.", "start": 0.6, "end": 1.2 }
]

Background Removal (`remove-background`)

Remove the background from a video or image so the subject (typically a person — avatar, presenter, talking head) sits as a transparent overlay in a composition.

npx hyperframes remove-background subject.mp4 -o transparent.webm  # default: VP9 alpha WebM
npx hyperframes remove-background subject.mp4 -o transparent.mov   # ProRes 4444 (editing)
npx hyperframes remove-background portrait.jpg -o cutout.png       # single-image cutout
npx hyperframes remove-background subject.mp4 -o transparent.webm --device cpu
npx hyperframes remove-background --info                           # detected providers

Uses u2net_human_seg (MIT). First run downloads ~168 MB of weights to ~/.cache/hyperframes/background-removal/models/.

Output Format

Format	When
`.webm` (VP9 + alpha)	Default. Compositions play this directly via `<video>`.
`.mov` (ProRes 4444)	Editing in DaVinci/Premiere/FCP. Large files.
`.png`	Single-image cutout (still subject, layered over a backdrop).

Chrome decodes VP9 alpha natively, so the .webm plugs into a composition like any other muted-autoplay video — see the hyperframes skill for the <video> track conventions.

Quality presets

--quality fast|balanced|best controls only the VP9 encoder's CRF — segmentation quality is fixed.

Preset	CRF	When
`fast`	30	Iterating, smaller file, looser color match
`balanced`	18	Default. Visually identical for most uses
`best`	12	Master / final delivery. Largest file, tightest match

Compositing patterns — pick the right one

The cutout webm is a re-encoded copy of the source mp4's RGB. That choice has consequences depending on what you put behind it:

Pattern	What's behind the cutout	Result
Cutout over a different scene (most common)	Static image, gradient, or unrelated video	Looks great. The cutout's RGB is the only source of the subject — no doubling, no edge halo. This is what `remove-background` is built for.
Cutout over its own source mp4 (text-behind-subject)	Same mp4 the cutout was generated from	Two RGB sources for the same person. At default `--quality balanced` (crf 18) the doubling is barely visible; at `--quality fast` (crf 30) you'll see a faint color shift / edge halo. Use `--quality best` (crf 12) for masters.
*Cutout over a different* take of the same person**	Footage of the same subject	Will look like two separate people overlapping. Don't do this.

Text-behind-subject (headline behind a presenter):

<video
  src="presenter.mp4"
  id="bg"
  data-start="0"
  data-duration="6"
  data-track-index="0"
  muted
  playsinline
></video>
<h1 id="headline" style="z-index:2; ...">MAKE IT IN HYPERFRAMES</h1>
<div class="cutout-wrap" style="position:absolute;inset:0;z-index:3;opacity:0">
  <video
    src="presenter.webm"
    data-start="0"
    data-duration="6"
    data-track-index="1"
    muted
    playsinline
  ></video>
</div>

Two key rules:

Wrap the cutout video in a non-timed <div> and animate the wrapper's opacity, not the video element's. The framework forces opacity:1 on active clips (any element with data-start/data-duration), so animating the video's opacity directly is silently overridden. The wrapper has no data-* attributes, so it's owned by your CSS/GSAP.
Both videos use data-start="0" and data-media-start="0" so the framework decodes them in sync from t=0. Late-mounting the cutout (data-start=3.3) introduces a seek + warm-up that lands a frame off the base mp4 — visible as one frame of misalignment at the cut.

Then GSAP-flip the wrapper opacity at the cut: tl.set(cutoutWrap, { opacity: 1 }, 3.3).

TTS → Transcribe → Captions

When there's no pre-recorded voiceover, generate one and transcribe it back to get word-level timestamps for captions:

npx hyperframes tts script.txt --voice af_heart --output narration.wav
npx hyperframes transcribe narration.wav   # → transcript.json

Whisper extracts precise word boundaries from the generated audio, so caption timing matches delivery without hand-tuning.

hyperframes-media

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

hyperframes-media

Tool Access

Preview

SKILL.md

HyperFrames Media Preprocessing

Text-to-Speech (tts)

Voice Selection

Multilingual

Speed

Long Scripts

Requirements

Transcription (transcribe)

Language Rule (Non-Negotiable)

Model Sizes

Output Shape

Background Removal (remove-background)

Output Format

Quality presets

Compositing patterns — pick the right one

TTS → Transcribe → Captions

Similar Skills

Help us improve

HyperFrames Media Preprocessing

Text-to-Speech (tts)

Voice Selection

Multilingual

Speed

Long Scripts

Requirements

Transcription (transcribe)

Language Rule (Non-Negotiable)

Model Sizes

Output Shape

Background Removal (remove-background)

Output Format

Quality presets

Compositing patterns — pick the right one

TTS → Transcribe → Captions

Text-to-Speech (`tts`)

Transcription (`transcribe`)

Background Removal (`remove-background`)

Text-to-Speech (`tts`)

Transcription (`transcribe`)

Background Removal (`remove-background`)