Help us improve
Share bugs, ideas, or general feedback.
From lattifai-skills
Generate viral-ready karaoke / lyric-video subtitles with per-word highlighting in one command. Adaptive font size from media resolution. Ready-made style presets for TikTok, Reels, Shorts, 抖音, 小红书, YouTube, and cinematic long-form. Trigger on "karaoke 字幕", "卡拉OK", "lyric video", "逐词歌词", "按字高亮", "TikTok 字幕", "抖音字幕", "Reels caption", "Shorts caption", "小红书歌词", "viral subtitles", "爆款字幕", or when a song / podcast / video clip is given and per-word highlighted captions are asked for. For general caption format conversion without karaoke styling, use /lai-caption.
npx claudepluginhub lattifai/lattifai-skills --plugin lattifai-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/lattifai-skills:lai-karaokeThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
One command → per-word highlighted ASS ready for TikTok, Reels, Shorts, 抖音, 小红书, or cinematic YouTube. Adaptive font size keeps captions sharp across 1080p, 4K, portrait, square, and cinematic aspect ratios without manual tuning.
Generates CapCut-style animated word-level captions for viral videos using FFmpeg and ASS subtitles from Whisper word timestamps. For TikTok, YouTube Shorts, Instagram Reels.
Convert between 30+ caption/subtitle formats (SRT, VTT, ASS, JSON, TextGrid, LRC, FCPXML, Premiere, …) and shift timing. Trigger on "convert captions", "SRT to VTT", "转换字幕格式", "shift timing", "ASS styling", "karaoke effect", "导入Premiere", or any caption-format question. Do NOT trigger to fix timing accuracy (`/lai-align`) or translate (`/lai-translate`).
Generates AI music, songs with vocals (Suno/Udio), and instrumentals (Google Lyria). Guides API selection, prompt crafting, and workflow for genres, moods, BPM.
Share bugs, ideas, or general feedback.
One command → per-word highlighted ASS ready for TikTok, Reels, Shorts, 抖音, 小红书, or cinematic YouTube. Adaptive font size keeps captions sharp across 1080p, 4K, portrait, square, and cinematic aspect ratios without manual tuning.
🚫 Karaoke prerequisite — language must match the audio. Karaoke is an audio-sync visual: the highlight follows the actual word being spoken. So the caption text must be in the same language as the audio. If the user wants a different language on screen (e.g. English audio + Chinese captions), DO NOT make a karaoke in the target language — the timestamps were aligned to source-language words and the highlight will desync nonsensically. Instead:
- Bilingual (most common) → keep source-language karaoke on top, add a static translated reading line below. See
/lai-translate§Bilingual Delivery Guide → Recipe R3.- Dubbing (rare) → regenerate target-language audio, then re-align and karaoke against the new audio.
Quick rule:
audio_lang != caption_lang⇒ bilingual, never single-line karaoke.
Invoke through your agent (in Claude Code: /lai-karaoke <input>; in Codex CLI / Gemini CLI: ask the agent to run lai-karaoke and pass the input). The agent then automates the full chain:
/lai-karaoke https://youtu.be/VIDEO_ID
/lai-karaoke /path/to/video.mp4
/lai-karaoke /path/to/aligned.json # when alignment already exists
Default behavior (agent executes):
/lai-youtube with caption.input.split_sentence=false caption.render.word_level=true (per-word alignment, MP4 for resolution probe; flip to split_sentence=true if auto-captions break mid-sentence)scripts/probe_media.py reads the downloaded mp4 and computes font_size, play_res_x, play_res_ytiktok, landscape → cinematic, square → minimal) — the agent announces the choice and accepts an overridelaicap-convert writes <base>.karaoke.<preset>.ass in the current directory (<base> = video ID for YouTube input, media stem for local files)/lai-translate), custom tuning (/lai-caption), or speaker-colored (/lai-diarize)Each preset = karaoke_effect × karaoke_color_scheme × kinetic_style, battle-tested for a specific distribution channel.
The highlight direction column tells you which color the un-sung text fades to when the highlight passes — this is the visual-anchor color the viewer's eye latches onto. KTV convention: highlight = saturated (gold/red/cyan), un-sung = neutral (white). cinematic intentionally inverts this for a "fading out" feel; if you want canonical KTV direction, see §Custom color override below.
| Preset | Effect | Color scheme | Kinetic | Highlight direction (un-sung → sung) | Built for |
|---|---|---|---|---|---|
tiktok | sweep | sunset-warm | pop | white → orange-pink | TikTok / Reels / Shorts — warm punchy pop |
douyin | instant | navy-pink | shake | navy → pink | 抖音潮流感、快切节奏 |
cinematic | sweep | prussian-elegant | fade | gold → white (inverted; KTV-canonical wants white → gold) | YouTube 长视频 / 播客 / 电影感 |
neon | outline | mint-ocean | neon | mint → ocean blue | 赛博 / 电子乐 / game clip |
minimal | instant | azure-gold | (none) | azure → gold | 商务 / 教程 / 播客精华 |
Ask in plain language:
"用 tiktok 风格做个 karaoke" / "cinematic 风" / "给我个极简商务款"
The agent maps your phrasing to the preset table above (or asks if ambiguous).
For fully custom tuning (12 color schemes × 15 kinetic styles × 3 effects = 540 combos), see /lai-caption §Karaoke.
The agent should drive this skill — do not require a wrapper CLI. Follow this decision tree:
| User provides | Next |
|---|---|
| YouTube URL | /lai-youtube align with caption.input.split_sentence=false caption.render.word_level=true, MP4 output |
| media + transcript (SRT/VTT/JSON) | /lai-align with caption.render.word_level=true, JSON output |
| aligned.json only | Ask for original media path (needed to probe resolution) |
| diarized.json | Use it directly; plan on ass.speaker_color=auto or a CSV |
bilingual JSON (text + translation) | Use it directly; plan on ass.translation_color=... (renders source-language karaoke + static translation line) |
translated-only JSON (e.g. English audio aligned, then translated to Chinese — text already replaced with the target language) | STOP — language guard violated. The word timestamps belong to the source language; rendering them as karaoke in the target language will desync. Re-route to /lai-translate Recipe R3 (bilingual reading line) instead. |
Confirm with the user before starting if their intent is ambiguous (e.g., they only said "做个 karaoke" without specifying input).
python <skill>/scripts/probe_media.py <media_path> [--target-platform tiktok|...]
Returns JSON: {width, height, aspect, font_size, play_res_x, play_res_y, source}. source is probe (from ffprobe) or platform-default:<name> (audio-only fallback).
If the input is audio-only (mp3/wav/m4a, width ≈ height = 0), ask the user for the target platform and pass --target-platform to drive the fallback.
Auto-select if the user didn't specify one:
aspect == "portrait" → tiktokaspect == "landscape" → cinematicaspect == "square" → minimalAlways tell the user which preset you picked; they can override.
Map preset → ASS config, then call laicap-convert:
# Example: preset=tiktok, probe={font_size:86, play_res_x:1080, play_res_y:1920}
# render.word_level is default word-scope — karaoke_effect alone triggers per-word \k.
# Only pass render.word_level=false when you want line-scope (single override per line).
laicap-convert --direct -Y \
la0CaZ2R8EY.aligned.json la0CaZ2R8EY.karaoke.tiktok.ass \
render.include_speaker_in_text=true \
ass.karaoke_effect=sweep \
ass.karaoke_color_scheme=sunset-warm \
ass.kinetic_style=pop \
ass.font_size=86 \
ass.play_res_x=1080 \
ass.play_res_y=1920 \
standardization.start_margin=0.05 \
standardization.end_margin=0.15
Input JSON must carry words arrays (upstream /lai-align or /lai-youtube with caption.render.word_level=true); without that the effect falls back silently to line-scope.
Speaker-aware variant (when input is diarized.json):
ass.speaker_color="#658AE4,#FFC209,#F7C3D9,#CC5D84" # per-speaker palette
Bilingual variant (input has translation populated — see /lai-translate §Bilingual Delivery Guide for format selection, typography rules, and dual-track vs dual-line trade-offs):
ass.translation_color="#00FFFF" # accent on the translation line
To swap any preset to KTV-canonical direction (un-sung neutral → sung accent), do NOT pass ass.karaoke_color_scheme=... because the scheme overrides individual color fields (see lattifai/caption/config.py:apply_karaoke_color_scheme). Pass colors directly:
laicap-convert --direct -Y la0CaZ2R8EY.aligned.json la0CaZ2R8EY.karaoke.swapped.ass \
ass.karaoke_effect=sweep \
ass.kinetic_style=fade \
ass.primary_color="#FFD700" \ # sung = gold (visual anchor)
ass.secondary_color="#FFFFFF" \ # un-sung = neutral white
ass.outline_color="#1A2B4D" \ # keep prussian-elegant outline
ass.font_size=151 \
ass.play_res_x=3840 ass.play_res_y=2160
Precedence rule (commit to memory): ass.karaoke_color_scheme overrides ass.primary_color / ass.secondary_color / ass.outline_color. Setting both is a footgun — pick one path.
After successful render, always surface one of:
/lai-translate on aligned.json, then re-render."/lai-caption §Karaoke — 12 color schemes × 15 kinetic styles available."/lai-diarize first, then re-render with ass.speaker_color=auto."Font size is picked from the shorter edge of the media, tuned per aspect:
| Aspect | Formula | Example input | Recommended font_size |
|---|---|---|---|
| portrait (w/h < 0.85) | w × 0.08 | 1080×1920 (TikTok) | 86 |
| square (0.85–1.4) | h × 0.075 | 1080×1080 (IG feed) | 81 |
| landscape (w/h > 1.4) | h × 0.07 | 1920×1080 (YT) | 76 |
| 4K landscape | h × 0.07 | 3840×2160 | 151 |
| 2.35:1 cinematic | h × 0.07 | 3840×1634 | 114 |
Also set ass.play_res_x / ass.play_res_y to the actual media width/height so ASS renderers scale all metrics (outline width, shadow depth, margins) correctly for the downstream player. probe_media.py emits these values directly.
Override knobs for the agent:
font_size by 1.2–1.4ass.margin_l / ass.margin_rDefault output path: <base>.karaoke[.<style>].ass in the current directory (<base> = video ID or media stem). Multi-style exports coexist, e.g.:
la0CaZ2R8EY.mp4
la0CaZ2R8EY.karaoke.tiktok.ass
la0CaZ2R8EY.karaoke.cinematic.ass
For the exhaustive matrix (3 × 12 × 15 × 2 = 1080 combinations) used in visual tuning, see /lai-caption §Karaoke — this skill intentionally ships only 5 curated presets to keep the decision light.
| Problem | Fix |
|---|---|
ffprobe: command not found | brew install ffmpeg (macOS) or your distro equivalent |
| Audio-only input, no resolution | Pass --target-platform tiktok|youtube|... for a fallback |
| Karaoke renders without per-word highlighting | Input JSON has no words arrays — re-run /lai-align with caption.render.word_level=true |
| Text too small on 4K YouTube | Media probed correctly? Confirm font_size >= 140 for 2160p landscape |
| Text overflows on portrait | Reduce font_size by 10% or set ass.wrap_style=2 (smart wrap) |
Highlight color looks wrong (e.g. cinematic shows gold→white instead of white→gold) | Use the Custom color override recipe above — pass ass.primary_color / ass.secondary_color directly without ass.karaoke_color_scheme (scheme overrides single fields) |
| Subtitles in different language than the audio | You're trying to karaoke a translation. Stop and switch to /lai-translate Recipe R3 (bilingual reading line). Karaoke requires audio_lang == caption_lang. |
/lai-youtube — URL → aligned captions (what this skill calls under the hood for URL input)/lai-align — local media + transcript → aligned JSON/lai-caption — full 540-combo karaoke matrix + broadcast subtitle profiles/lai-translate — produce bilingual input for multilingual lyric videos/lai-diarize — per-speaker colored karaoke