Help us improve
Share bugs, ideas, or general feedback.
From lattifai-skills
Identify speakers ("who said what") in aligned captions via pyannote.audio. Real speaker names come from the agent's own reasoning over transcript + context (default), with a CLI-LLM fallback for headless runs. Trigger on multi-speaker content (podcasts, interviews, meetings) or phrases like "diarize", "speaker detection", "说话人识别", "区分说话人", "label the speakers". Requires aligned captions — run `/lai-align` first.
npx claudepluginhub lattifai/lattifai-skills --plugin lattifai-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/lattifai-skills:lai-diarizeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Preferred model: Claude Sonnet** (cost-efficient for agent-driven naming). This skill runs on whatever model is active in the parent session — any Claude model works; no hard switch. Sonnet has no 1M-context variant, so if the parent session is Opus[1M], continuing on Opus is normal (avoids a no-op model swap).
Align existing captions to audio/video with word-level precision using the Lattice-1 model. Trigger when the user has both a media file AND a caption/transcript that need to be synchronized, or says "fix caption timing", "字幕对不上", "对齐字幕", "word-level timestamps", "karaoke timing", "timestamps are off". Do NOT trigger without existing text — use `/lai-transcribe` first.
Cleans raw auto-generated podcast transcripts for publication: removes filler words, corrects errors, adds speaker labels, and formats for readability while preserving authentic voice.
Aligns captions or transcripts to audio/video with precise word-level timing via LattifAI Lattice-1 forced alignment. Corrects timing drift; supports SRT, VTT, ASS, LRC, TXT, MD formats.
Share bugs, ideas, or general feedback.
Preferred model: Claude Sonnet (cost-efficient for agent-driven naming). This skill runs on whatever model is active in the parent session — any Claude model works; no hard switch. Sonnet has no 1M-context variant, so if the parent session is Opus[1M], continuing on Opus is normal (avoids a no-op model swap).
Adds speaker labels to aligned captions. Speaker detection (who speaks when) is always CLI-based via pyannote.audio; speaker naming (who each one is) is agent-driven by default.
<base> = source media stem (e.g. podcast from podcast.mp3) or YouTube ID. Files all land in the current directory:
lai diarize run podcast.mp3 podcast.aligned.json podcast.diarized.json
# shortcut:
lai-diarize podcast.mp3 podcast.aligned.json podcast.diarized.json
Output labels detected speakers as SPEAKER_00, SPEAKER_01, …
Speaker count is auto-detected. Override only when auto-detection is clearly wrong:
diarization.num_speakers=3 — exact count (when known)diarization.min_speakers=N / diarization.max_speakers=N — bound the searchAfter the basic command finishes, the agent reads the diarized output (the file you wrote with output_caption=…) together with any available context, and writes the named result. You may write the named version back into the same path (in-place edit) or to a separate path — depends on your project's convention.
Two-file convention (preferred when state matters, e.g. CI pipelines): emit the acoustic-only output as diarized.raw.json and let the agent write the named result to diarized.json. This keeps "acoustic切分 done" distinct from "named, ready for publish," and lets downstream stages hard-fail when the agent hasn't run yet. The ai-podcast-pipeline repo follows this convention (see its CLAUDE.md).
Signals the agent uses:
meta.md beside the source (YAML frontmatter, format below)[Alice], >> Bob:, SPEAKER_01:) — preserved by the CLI and matched by the agentsupervision.custom — see Forward Search belowProcess:
diarized.json — collect unique SPEAKER_XX ids and sample 3–5 segments per speakerSPEAKER_XX → real name with a confidence note. If unsure, keep SPEAKER_XX rather than guessingspeaker field across all segments; do not touch text, start, end, or segment order>> / speaker-change markersVTT and SRT broadcast captions encode speaker turns with markers like >>
(usually escaped as >> in raw VTT), <v Speaker>, [Speaker], or all-caps
lead-ins. LattifAI preserves whatever marker it found in supervision.custom:
"custom": {
"original_speaker": ">>",
"speaker_change": true
}
Key insight: >> alone (no trailing name) is still a strong signal — the
captioner asserts a new speaker starts here. When the resolved speaker for
such a segment is still SPEAKER_XX / Unknown / empty (typically a 1–3 segment
"ghost tier" that pyannote couldn't merge into a main cluster), don't leave it
unnamed. Run forward search:
SPEAKER_XX neighbors and
onwards through later segments after >> boundaries.meta.md affiliation fields)meta.md (e.g. "in my RNA work…" → host with affiliation: "Atomic AI")>> segment.>> or end of file), keep the segment as SPEAKER_XX rather than guessing.Dominant-neighbor merge (when >> is absent): tiers with ≤3 segments and no
speaker-change marker are usually pyannote boundary artifacts. If such a
segment is sandwiched between two segments of the same real speaker, attribute
it to that speaker — short interjections ("Yes.", "Yeah.", "Right.") don't
carry identity, and the acoustic edge is more likely segmentation noise than a
third party.
meta.md (optional but strong signal — drives both num_speakers and forward-search topic anchors). All fields below are parsed by both the agent-driven path and the CLI-LLM fallback (lai diarize naming / diarize run):
---
title: "Deep Dive into LLMs"
speakers:
- name: Alice Chen
role: host
affiliation: "Anthropic (research engineer)" # self-introduction & topic-ownership anchor
aliases: ["Alice"] # short forms LLM should fold back to full name
bio: "Host of the show. Background in distributed systems."
- name: Bob Smith
role: guest
affiliation: "Stanford AI Lab"
aliases: ["Bob", "Bobby"]
bio: "PhD candidate working on RLHF and scaling laws."
topics: ["RLHF", "scaling laws", "alignment"] # episode-level keyword hints
prior_episodes:
- "Episode 42: pretraining — same guest, covers scaling laws"
---
Keep name clean (no ", OpenAI" suffix) — put organizations in affiliation so
the agent can match self-introductions ("I'm a researcher at Stanford" → Bob) to
exactly one speaker, and downstream slug resolvers don't break on commas. aliases
let the LLM map cross-references like "thanks, Swyx" back to the full legal name
instead of inventing a third speaker; bio and topics give the LLM
episode-specific expertise to anchor topical references against.
When the agent is not in the loop (batch pipelines, CI, unattended scripts), let the CLI do name inference with its own LLM backend:
lai config set diarization.llm.model_name gemini-3-flash-preview # one-time
# Gemini key: see /lai-transcribe
lai diarize run --direct -Y \
podcast.mp3 podcast.aligned.json podcast.diarized.json \
diarization.infer_speakers=true \
diarization.llm.reasoning=true
diarization.infer_speakers=true — enable CLI-side name inference (requires LLM config above)diarization.llm.reasoning=true — ask the LLM to show its reasoning before committing to a name; trades latency for accuracy on ambiguous speakersYou can also pass hints at invocation time without any LLM:
lai diarize run podcast.mp3 podcast.aligned.json podcast.diarized.json \
context="Host: Alice Chen (tech journalist), Guest: Bob Smith (AI researcher)"
# or point at a meta.md (first positional `context` arg also accepts a file path):
lai diarize run podcast.mp3 podcast.aligned.json podcast.diarized.json context=podcast.meta.md
Each supervision gains a speaker field:
{ "text": "Welcome to the show.", "start": 0.0, "end": 2.5, "speaker": "Alice Chen" }
| Problem | Fix |
|---|---|
No aligned segments | Run /lai-align first |
| Too many speakers detected (ghost tiers) | Pre-empt: pass diarization.num_speakers=N from meta.md. Post-hoc: dominant-neighbor merge (see Forward Search section) |
| Tiny tier (1–3 segments) of short interjections | Pyannote boundary noise — dominant-neighbor merge into the surrounding speaker, don't treat as a real third party |
>> segment left as SPEAKER_XX | Run forward search (see above); only keep SPEAKER_XX if no anchor exists within the speaker turn |
| Agent can't confidently name a speaker | Keep SPEAKER_XX and ask the user — don't guess |
name field contains org (e.g. "Alex Lupsasca, OpenAI") | Split into name: "Alex Lupsasca" + affiliation: "OpenAI" — comma in name breaks slug resolution downstream |
| Headless run, no LLM configured | lai config set diarization.llm.model_name gemini-3-flash-preview |
/lai-align — produce the aligned input (required)/lai-transcribe — transcript from scratch/lai-translate, /lai-summarize — run on diarized output for speaker-aware results