Help us improve
Share bugs, ideas, or general feedback.
From lattifai-skills
Transcribe audio/video to timestamped captions with Gemini (100+ languages) or local Parakeet / SenseVoice models. Trigger on "transcribe", "speech to text", "转录", "语音转文字", "generate captions from audio", or when the user provides an audio/video file with no text. If the YouTube video already has captions, prefer `/lai-youtube`.
npx claudepluginhub lattifai/lattifai-skills --plugin lattifai-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/lattifai-skills:lai-transcribeThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generates timestamped text from audio/video. Default is Gemini (fast, broad language coverage); local models run offline on GPU.
Transcribes audio/video from YouTube URLs or local files to structured markdown with timestamps, speaker labels, and chapters using Google Gemini API.
Align existing captions to audio/video with word-level precision using the Lattice-1 model. Trigger when the user has both a media file AND a caption/transcript that need to be synchronized, or says "fix caption timing", "字幕对不上", "对齐字幕", "word-level timestamps", "karaoke timing", "timestamps are off". Do NOT trigger without existing text — use `/lai-transcribe` first.
Generates SRT/VTT subtitles and plain text transcripts from video or audio files using AWS Transcribe and ffmpeg. Useful for captions, extracting speech, notes, or searchable content.
Share bugs, ideas, or general feedback.
Generates timestamped text from audio/video. Default is Gemini (fast, broad language coverage); local models run offline on GPU.
Gemini needs an API key (free at https://aistudio.google.com/apikey):
lai config set GEMINI_API_KEY <your-key>
Pick a <base> (media stem or YouTube ID) and reuse for the rest of the pipeline; outputs land in the current directory:
# <base> = podcast (from podcast.mp3)
lai transcribe run podcast.mp3 podcast.transcript.json
# shortcut:
lai-transcribe podcast.mp3 podcast.transcript.json
Gemini accepts YouTube URLs directly — no download needed:
# <base> = la0CaZ2R8EY (the YouTube video ID)
lai transcribe run "https://youtu.be/la0CaZ2R8EY" la0CaZ2R8EY.transcript.json
Output naming: prefer <base>.transcript.json so it pipes cleanly into /lai-align (which writes <base>.aligned.json). Use <base>.srt etc. when the transcript itself is the final deliverable and no alignment step follows.
| Model | Languages | Requires |
|---|---|---|
gemini-3-flash-preview (default) | 100+ | Gemini API key |
gemini-3.1-pro-preview | 100+, highest quality | Gemini API key |
nvidia/parakeet-tdt-0.6b-v3 | 24, offline | GPU + nemo_toolkit |
FunAudioLLM/SenseVoiceSmall | zh / en / ja / ko / cantonese, offline | GPU |
Switch model:
lai transcribe run audio.mp4 output.srt transcription.model_name=gemini-3.1-pro-preview
transcription.language=zh — force language (otherwise auto-detect)media.streaming_chunk_secs=300 — chunk long audio.srt / .vtt / .ass / .json / .txt. Use .json when you plan to follow up with /lai-align.| Problem | Fix |
|---|---|
GEMINI_API_KEY not set | lai config set GEMINI_API_KEY <your-key> |
| Upload timeout / file >2 GB | Split the audio or switch to a local model |
| Wrong language detected | Force with transcription.language=en |
| Timestamps are coarse | Follow up with /lai-align |
/lai-align — sharpen timestamps after transcription/lai-diarize — add speaker labels/lai-translate — translate the transcript/lai-youtube — YouTube end-to-end (download + caption + align)/lai-caption — convert output format