Help us improve
Share bugs, ideas, or general feedback.
From watch
Downloads videos from YouTube, Instagram, X/Twitter, Vimeo, TikTok, or local paths, extracts frames and transcripts (via captions or on-device mlx-whisper), and lets Claude answer questions about the video content.
npx claudepluginhub mathiaschu/watch --plugin watchHow this skill is triggered — by the user, by Claude, or both
Slash command
/watch:watch <video-url-or-path> [question]<video-url-or-path> [question]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You don't have a video input; this skill gives you one. A Python script downloads the video, extracts frames as JPEGs, gets a timestamped transcript (native captions first, then **local mlx-whisper** as fallback — runs on-device, no API and no key), and prints frame paths. You then `Read` each frame path to see the images and combine them with the transcript to answer the user.
Ingests video/audio from files, URLs, RTSP feeds, or desktop capture; indexes visual/spoken content for search; transcodes, edits timelines, generates assets, and creates real-time alerts.
Analyzes video files or YouTube URLs: extracts frames/audio, detects scenes/motion/silence/transitions via ffmpeg tools with structured workflow.
Download, edit, transcribe, and upload videos. Includes silence removal, trimming, audio extraction, speed change, concatenation, captions, and social media posting.
Share bugs, ideas, or general feedback.
You don't have a video input; this skill gives you one. A Python script downloads the video, extracts frames as JPEGs, gets a timestamped transcript (native captions first, then local mlx-whisper as fallback — runs on-device, no API and no key), and prints frame paths. You then Read each frame path to see the images and combine them with the transcript to answer the user.
/watch invocation, silent on success)Python interpreter: every python3 ... command in this skill is for macOS/Linux. On Windows, substitute python — the python3 command on Windows is the Microsoft Store stub and will not run the script.
Before every /watch run, verify that dependencies are in place:
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" --check
This is a <100ms lookup. On exit 0, the script emits nothing — proceed to Step 1 without comment. Do NOT announce "setup is complete" to the user — they don't need a status message on every turn. The only acceptable user-visible output from Step 0 is when remediation is required.
On non-zero exit, follow the table:
| Exit | Meaning | Action |
|---|---|---|
2 | Missing binaries (ffmpeg / ffprobe / yt-dlp) | Run installer |
3 | No local whisper engine (mlx-whisper / openai-whisper) | Run installer, then tell user the pip3 command it prints |
4 | Both missing | Run installer |
The installer is idempotent — safe to re-run:
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py"
On macOS with Homebrew, it auto-installs ffmpeg and yt-dlp. On Linux/Windows, it prints the exact install commands for the user to run. For transcription it checks for a local whisper engine (mlx-whisper preferred on Apple Silicon, openai-whisper as a CPU fallback) and prints the pip3 install command if neither is present. No API key, no config file, no .env — transcription runs entirely on-device.
If no whisper engine is installed: run the installer and relay the exact pip3 install … command it prints (mlx-whisper on Apple Silicon, openai-whisper on Windows/Linux/Intel Macs — do not assume mlx, it only installs on Apple Silicon). If they don't want to install it, proceed with --no-whisper and tell them videos without native captions will come back frames-only.
Structured mode (optional): python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" --json emits {status, missing_binaries, whisper_backend, has_whisper, platform} where status is one of ready | needs_install | needs_whisper | needs_install_and_whisper.
Within a single session, you can skip Step 0 on follow-up /watch calls — once --check returned 0, nothing about the environment changes between turns.
.mp4, .mov, .mkv, .webm, etc.) and asks about it./watch <url-or-path> [question].Step 1 — parse the user input. Separate the video source (URL or path) from any question the user asked. Example: /watch https://youtu.be/abc what language is this in? → source = https://youtu.be/abc, question = what language is this in?.
Step 2 — run the watch script. Pass the source verbatim. Do not shell-escape it yourself beyond normal quoting:
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "<source>"
Optional flags:
--start T / --end T — focus on a section. Accepts SS, MM:SS, or HH:MM:SS. When either is set, fps auto-scales denser (see "Focusing on a section" below).--max-frames N — lower the cap for tighter token budget (e.g. --max-frames 40)--resolution W — change frame width in px (default 512; bump to 1024 only if the user needs to read on-screen text)--fps F — override auto-fps (clamped to 2 fps max)--out-dir DIR — keep working files somewhere specific (default: an auto-generated tmp dir)--cookies-from-browser B — read cookies from a local browser (chrome, firefox, safari, edge, brave, …) for login-gated sources--cookies FILE — path to a Netscape-format cookies.txt (alternative to --cookies-from-browser)--whisper mlx|openai-whisper — force a specific local Whisper engine (default: prefer mlx-whisper, fall back to openai-whisper)--no-whisper — disable the local Whisper fallback entirely (frames-only if no captions)Public videos (most of YouTube, Vimeo, TikTok, Loom, etc.) download with no auth. But some sources gate the download behind a login: Instagram, X/Twitter, age-restricted or private/unlisted YouTube, members-only content. Those need the user's own cookies.
Do NOT pass cookies pre-emptively. Always try the plain download first. Only reach for cookies when it fails with a login / private / 403 / "login required" / "rate-limit" error. The user never types the flag themselves — you add it and re-run. When that happens, walk the user through it (these are sub-steps of the main Step 2, not the main flow):
(a) Ask which browser they're logged into. "To grab this Instagram video I need to borrow the cookies from a browser where you're logged into Instagram. Which one are you logged in on — Chrome, Safari, Firefox, Edge, or Brave?" Supported values: chrome, firefox, safari, edge, brave, chromium, opera, vivaldi.
(b) Re-run with that browser (on Windows use python, not python3 — see Step 0):
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "https://www.instagram.com/reel/XXXX/" --cookies-from-browser chrome
(c) Handle the common per-browser snags (tell the user the specific fix, don't just retry):
(d) Manual fallback if browser extraction just won't cooperate (most reliable, works on macOS / Windows / Linux): guide the user to export a cookies.txt and pass it with --cookies:
instagram.com).~/Downloads/cookies.txt).python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "<url>" --cookies ~/Downloads/cookies.txtPrivacy note to reassure the user: cookies are read live from their own machine and piped straight into the yt-dlp subprocess. The skill never copies, stores, logs, or transmits them anywhere. The cookies.txt file (if they used the manual fallback) stays on their disk — they can delete it after.
When the user asks about a specific moment — "what happens at the 2 minute mark?", "zoom into 0:45 to 1:00", "the first 10 seconds" — pass --start and/or --end. The script switches to focused-mode budgets, which are denser than full-video budgets (still capped at 2 fps):
Focused mode is the right call for:
Transcript is auto-filtered to the same range. Frame timestamps are absolute (real video timeline, not offset-from-start).
Examples:
# Last 10 seconds of a 1 minute video
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" video.mp4 --start 50 --end 60
# Zoom into 2:15 → 2:45 at 3 fps (90 frames)
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "$URL" --start 2:15 --end 2:45 --fps 3
# From 1h12m to the end of the video
python3 "${CLAUDE_SKILL_DIR}/scripts/watch.py" "$URL" --start 1:12:00
Step 3 — Read every frame path the script lists. The Read tool renders JPEGs directly as images for you. Read all frames in a single message (parallel tool calls) so you see them together. The frames are in chronological order with a t=MM:SS timestamp so you can align them to the transcript.
Step 4 — answer the user. You now have two streams of evidence:
captions = yt-dlp pulled native subs; whisper (mlx) or whisper (openai-whisper) = transcribed locally on-device).If the user asked a specific question, answer it directly citing timestamps. If they didn't ask anything, summarize what happens in the video — structure, key moments, notable visuals, spoken content.
Step 5 — clean up. The script prints a working directory at the end. If the user isn't going to ask follow-ups about this video, delete it with rm -rf <dir>. If they might, leave it in place.
The script gets a timestamped transcript in one of two ways:
ffmpeg -vn -ac 1 -ar 16000 -b:a 64k, ~0.5 MB/min) and transcribes it locally:
mlx-community/whisper-large-v3-turbo. Preferred on Apple Silicon: fast, runs on the GPU/Neural Engine. Same engine ig-scraper uses. Install: pip3 install mlx-whisper.base model on CPU. Cross-platform fallback when mlx isn't available. Install: pip3 install openai-whisper.The audio never leaves the machine. The script prefers mlx-whisper; override with --whisper openai-whisper. Language is auto-detected. Use --no-whisper to skip the fallback entirely.
python3 "${CLAUDE_SKILL_DIR}/scripts/setup.py" (auto-installs ffmpeg/yt-dlp via brew on macOS; prints exact commands on Windows/Linux). If it reports no whisper engine, relay the exact pip3 install … command it printed (mlx-whisper only on Apple Silicon, openai-whisper elsewhere).--no-whisper set OR transcription failed). Script prints a hint pointing to setup. Proceed frames-only and tell the user.--start/--end rather than a sparse full-video scan.--cookies-from-browser <browser> using a browser the user is logged into (see "Login-gated sources" above). If it's region-locked or genuinely unavailable, tell the user plainly; do not keep retrying.This skill burns tokens primarily on frames. Order of magnitude:
--resolution to 1024 roughly quadruples the image tokens per frame. Only do it when necessary.If you already watched a video this session and the user asks a follow-up, do not re-run the script — you already have the frames and transcript in context. Just answer from what you have.
What this skill does:
yt-dlp locally to download the video and pull native captions when the source supports them (public data; the request goes directly to whatever host the URL points at)ffmpeg / ffprobe locally to extract frames as JPEGs and, when Whisper is needed, a mono 16 kHz audio clip--out-dir if specified) so Claude can Read them~/.cache/huggingfaceWhat this skill does NOT do:
--cookies-from-browser / --cookies for a login-gated source, and only to authenticate the yt-dlp download. Those cookies are read live and never copied, stored, logged, or transmitted by the skill.env, no config file, no secretsBundled scripts: scripts/watch.py (entry point), scripts/download.py (yt-dlp wrapper), scripts/frames.py (ffmpeg frame extraction), scripts/transcribe.py (caption parsing), scripts/whisper.py (local mlx/openai-whisper transcription), scripts/setup.py (preflight + installer)
Review scripts before first use to verify behavior.