From samhvw8-dot-claude
Multimodal AI processing via Google Gemini API (2M tokens context). Capabilities: audio (transcription, 9.5hr max, summarization, music analysis), images (captioning, OCR, object detection, segmentation, visual Q&A), video (scene detection, 6hr max, YouTube URLs, temporal analysis), documents (PDF extraction, tables, forms, charts), image generation (text-to-image, editing). Actions: transcribe, analyze, extract, caption, detect, segment, generate from media. Keywords: Gemini API, audio transcription, image captioning, OCR, object detection, video analysis, PDF extraction, text-to-image, multimodal, speech recognition, visual Q&A, scene detection, YouTube transcription, table extraction, form processing, image generation, Imagen. Use when: transcribing audio/video, analyzing images/screenshots, extracting data from PDFs, processing YouTube videos, generating images from text, implementing multimodal AI features.
npx claudepluginhub joshuarweaver/cascade-code-languages-misc-1 --plugin samhvw8-dot-claudeThis skill is limited to using the following tools:
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
references/audio-processing.mdreferences/image-generation.mdreferences/video-analysis.mdreferences/vision-understanding.mdscripts/document_converter.pyscripts/gemini_batch_process.pyscripts/media_optimizer.pyscripts/requirements.txtscripts/tests/requirements.txtscripts/tests/test_document_converter.pyscripts/tests/test_failures.logscripts/tests/test_gemini_batch_process.pyscripts/tests/test_media_optimizer.pySearches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Checks Next.js compilation errors using a running Turbopack dev server after code edits. Fixes actionable issues before reporting complete. Replaces `next build`.
Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
| Task | Audio | Image | Video | Document | Generation |
|---|---|---|---|---|---|
| Transcription | ✓ | - | ✓ | - | - |
| Summarization | ✓ | ✓ | ✓ | ✓ | - |
| Q&A | ✓ | ✓ | ✓ | ✓ | - |
| Object Detection | - | ✓ | ✓ | - | - |
| Text Extraction | - | ✓ | - | ✓ | - |
| Structured Output | ✓ | ✓ | ✓ | ✓ | - |
| Creation | TTS | - | - | - | ✓ |
| Timestamps | ✓ | - | ✓ | - | - |
| Segmentation | - | ✓ | - | - | - |
API Key Setup: Supports both Google AI Studio and Vertex AI.
The skill checks for GEMINI_API_KEY in this order:
export GEMINI_API_KEY="your-key".env.claude/.env.claude/skills/.env.claude/skills/ai-multimodal/.envGet API key: https://aistudio.google.com/apikey
For Vertex AI:
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1 # Optional
Install SDK:
pip install google-genai python-dotenv pillow
Transcribe Audio:
python scripts/gemini_batch_process.py \
--files audio.mp3 \
--task transcribe \
--model gemini-2.5-flash
Analyze Image:
python scripts/gemini_batch_process.py \
--files image.jpg \
--task analyze \
--prompt "Describe this image" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
Process Video:
python scripts/gemini_batch_process.py \
--files video.mp4 \
--task analyze \
--prompt "Summarize key points with timestamps" \
--output docs/assets/<output-name>.md \
--model gemini-2.5-flash
Extract from PDF:
python scripts/gemini_batch_process.py \
--files document.pdf \
--task extract \
--prompt "Extract table data as JSON" \
--output docs/assets/<output-name>.md \
--format json
Generate Image:
python scripts/gemini_batch_process.py \
--task generate \
--prompt "A futuristic city at sunset" \
--output docs/assets/<output-file-name> \
--model gemini-2.5-flash-image \
--aspect-ratio 16:9
Optimize Media:
# Prepare large video for processing
python scripts/media_optimizer.py \
--input large-video.mp4 \
--output docs/assets/<output-file-name> \
--target-size 100MB
# Batch optimize multiple files
python scripts/media_optimizer.py \
--input-dir ./videos \
--output-dir docs/assets/optimized \
--quality 85
Convert Documents to Markdown:
# Convert to PDF
python scripts/document_converter.py \
--input document.docx \
--output docs/assets/document.md
# Extract pages
python scripts/document_converter.py \
--input large.pdf \
--output docs/assets/chapter1.md \
--pages 1-20
For detailed implementation guidance, see:
references/audio-processing.md - Transcription, analysis, TTS
references/vision-understanding.md - Captioning, detection, OCR
references/video-analysis.md - Scene detection, temporal understanding
references/document-extraction.md - PDF processing, structured output
references/image-generation.md - Text-to-image, editing
Input Pricing:
Token Rates:
TTS Pricing:
gemini-2.5-flash for most tasks (best price/performance)media_optimizer.py)Free Tier:
YouTube Limits:
Storage Limits:
Common errors and solutions:
All scripts support unified API key detection and error handling:
gemini_batch_process.py: Batch process multiple media files
media_optimizer.py: Prepare media for Gemini API
document_converter.py: Convert documents to PDF
Run any script with --help for detailed usage.