watch-cli

Watch any social video → get an architecture diagram, working component, runnable notebook, or step-by-step cheat sheet — automatically.
Eyes and ears for your AI agent. watch-cli composes yt-dlp + ffmpeg + a Whisper-class ASR into a single command that hands an agent the raw materials to "watch" any video: VIDEO + FRAMES + TRANSCRIPT, ready for an LLM to read frames as images and transcript as text.
watch https://twitter.com/anyone/status/12345
Works on YouTube, X, LinkedIn, TikTok, Reddit, Vimeo, and Facebook. Login-walled posts (LinkedIn, private X, FB) fall back to your browser cookies automatically.
What you can build
Hand the watch output to your agent with one of five prompts in prompts/:
The prompt library is what turns "video → frames + transcript" into "video → working artifact". The full Prompt library section below has copy-paste templates.
Why this exists
Large language models can't watch video natively — they read text and
look at still images. You can hand a video to a multimodal API and get
back a chat-style summary, but for an agent workflow that's the wrong
artifact: the agent wants the raw frames and the full transcript so it
can reason for itself, not someone else's pre-digested recap.
A video is just frames + audio, and each piece already has a fast,
near-free primitive:
yt-dlp downloads from any social platform
ffmpeg extracts evenly-spaced frames
- An ASR model transcribes the audio
- A multimodal LLM hears tone, music, SFX, language, mood
Compose them and your agent has the materials to watch any social video.
What it looks like
$ watch https://www.linkedin.com/posts/some-talk_activity-12345
VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 218
FRAMES:
/tmp/frames_abc123/frame_01.jpg
/tmp/frames_abc123/frame_02.jpg
…
TRANSCRIPT:
Today I want to talk about how decomposition unlocks 10× cost reduction in
multimodal pipelines …
Your agent reads the JPGs and the transcript. That's the whole watch.
Why pay-per-use, not subscription
Most subscription summary tools start around $15/month and deliver a
polished, human-readable summary. If you're feeding an AI agent, that's the
wrong artifact — agents need raw frames and the full transcript to reason
for themselves, not someone else's pre-digested recap.
A typical research session is 1–3 videos, not 100. Through Kyma — the default
backend — a 1-hour video costs ~$0.05 (transcribe is the only paid step;
frame extraction is local ffmpeg).
| This month you watch | You pay |
|---|
| 0 videos | $0 |
| 1 one-hour video | ~$0.05 |
| 100 one-hour videos | ~$5 |
No monthly minimum, no seat license, no lock-in. The free credit at Kyma
signup is enough to run the full pipeline end-to-end before you spend a cent.
Install
# macOS — Homebrew (recommended)
brew tap sonpiaz/tap
brew install watch-cli
# Any OS — curl
curl -fsSL https://github.com/sonpiaz/watch-cli/releases/latest/download/install.sh | bash
The curl one-liner auto-falls back to git clone of main if no
published release tarball is reachable.
Claude Code (skill marketplace)
If you use Claude Code, install watch-cli as a skill:
/plugin marketplace add sonpiaz/watch-cli
/plugin install watch-cli@watch-cli
The agent then picks up watch <url> as a first-class command.
Pin a specific version:
WATCH_CLI_VERSION=0.3.0 curl -fsSL \
https://github.com/sonpiaz/watch-cli/releases/download/v0.3.0/install.sh | bash
Or from a clone:
git clone https://github.com/sonpiaz/watch-cli ~/.watch-cli
cd ~/.watch-cli && ./install.sh
The installer checks for yt-dlp, ffmpeg, jq, curl, python3 and
symlinks the commands into ~/.local/bin. On macOS:
brew install yt-dlp ffmpeg jq
On Debian/Ubuntu:
sudo apt install yt-dlp ffmpeg jq python3 curl
Optional install flags
./install.sh --with-skill # also drop SKILL.md into ~/.claude/skills/watch-cli/
./install.sh --with-mcp # print the npm install hint for the MCP stdio server