YouTube Channel Corpus Analyzer

A Python tool that analyzes entire YouTube channel catalogs by extracting transcripts, metadata, and running structured analysis via the Anthropic API to generate pattern-based constitution documents.

Features

Data Acquisition: Pull complete channel video catalogs with transcripts using yt-dlp
Whisper Fallback: Automatic audio transcription via OpenAI Whisper when VTT captions are unavailable
Block Detection: Detects monetized/members-only videos and retries with browser cookies automatically
Corpus Indexing: Store and query video data using DuckDB
Feature Extraction: Analyze transcripts using Claude to extract structured patterns
Visual Analysis: Analyze thumbnails for body language, facial expressions, and visual hooks
Retry Queue: Failed extractions are queued and retried automatically on subsequent runs
Constitution Synthesis: Generate comprehensive markdown reports of channel patterns
MCP Integration: Use as an MCP server in Claude Desktop for interactive analysis
Claude Code Plugin: /analyze slash command and autopilot skill for one-command analysis

Requirements

Python 3.11+
yt-dlp
ffmpeg (required for Whisper audio extraction)
Anthropic API key

Installation

See INSTALL.md for full instructions covering Claude Code (plugin) and Claude Desktop (MCP server).

Quick install:

git clone https://github.com/crichalchemist/yt-analysis.git
cd yt-analysis
pip install -r requirements.txt
export ANTHROPIC_API_KEY="sk-ant-..."

Usage

Claude Code (plugin)

/analyze https://www.youtube.com/@channelname ./output --min-views 100000

Or invoke the autopilot skill directly:

Use the yt-analysis skill to analyze https://www.youtube.com/@channelname

CLI

python main.py "https://www.youtube.com/@channelname" ./output --min-views 100000

MCP Server (Claude Desktop)

Configure claude_desktop_config.json (see INSTALL.md Section B), then use the MCP tools:

pull_and_index — pull channel and create database
run_extraction — extract features from videos
generate_constitution — create constitution markdown
query_corpus — run SQL queries on the corpus

Output Files

output/
  corpus.db              # DuckDB database
  features.jsonl         # append-only extraction log
  retry_queue.jsonl      # videos pending retry (removed when all recovered)
  constitution.md        # synthesized pattern analysis
  constitution_raw.json  # raw features used for synthesis
  errors.log             # WARNING+ level logs
  YYYYMMDD_VIDEOID/      # per-video raw files from yt-dlp

Architecture

Pipeline

acquire.py   → pull_channel(): yt-dlp subprocess, block detection + cookie retry
               build_record(): *.info.json + *.en.vtt → Dict
                               falls back to Whisper audio transcription if no VTT

index.py     → DuckDB single table: videos(id, title, date, views, duration_seconds,
               transcript, thumbnail_path, channel_url, features_json)

extract.py   → Per-video: Claude with transcript + optional base64 thumbnail
               → structured JSON stored in features_json
               Failed videos → retry_queue.jsonl, retried at end of each run

synthesize.py → Top-performers (>= min_views) → constitution.md + constitution_raw.json

Extracted Features

Text analysis (from transcript):

hook_type: question, stat, story, contrast, or void
hook_text: first two sentences of transcript
structure: array of section labels
cta_present: boolean
emotional_arc: flat, rise, fall, or rise-fall
key_claims: list of 3–5 main claims
topic_category: primary topic
estimated_retention_signal: low, medium, or high

Visual analysis (from thumbnail, when available):

visual_hook_elements, body_language, visual_context, thumbnail_effectiveness

Visual analysis auto-degrades to text-only if thumbnail is missing or unreadable.

Database Schema

CREATE TABLE videos (
    id TEXT PRIMARY KEY,
    title TEXT,
    date TEXT,
    views INTEGER,
    duration_seconds INTEGER,
    transcript TEXT,
    thumbnail_path TEXT,
    channel_url TEXT,
    features_json TEXT
)

Constitution Sections

Hook Pattern Index
Structural Templates
Topic-Performance Matrix
Anomalies (high-performers that break patterns)
High-Signal Phrasing Patterns
Null Hypotheses (frequent patterns with no view correlation)

Cost Estimates

For a typical channel with ~170 videos:

Text-only: ~1.04M input tokens, ~173K output tokens
With visual analysis: ~1.2M input tokens, ~260K output tokens
Synthesis: ~50K input + 4K output
Estimated: $8–12 USD using Claude Sonnet with visual analysis

Consider Anthropic's batch API for 50% cost reduction on large channels.

Help us improve

Find plugins for your project

Help us improve

yt-analysis

Popularity

Confidence

What's Inside

README

YouTube Channel Corpus Analyzer

Features

Requirements

Installation

Usage

Claude Code (plugin)

CLI

MCP Server (Claude Desktop)

Output Files

Architecture

Pipeline

Extracted Features

Database Schema

Constitution Sections

Cost Estimates

Troubleshooting

Help us improve

Health & Quality

Similar Plugins

youtube

ai-content-forensics

youtube-strategy

bulk-summarize

gr

peepshow

More by crichalchemist

adobe-cli

claude-pi