Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.
Processes audio, images, videos, and documents using Google Gemini's multimodal API.
/plugin marketplace add zircote/agents/plugin install zircote-zircote@zircote/agentsThis skill is limited to using the following tools:
references/audio-processing.mdreferences/image-generation.mdreferences/video-analysis.mdreferences/vision-understanding.mdscripts/document_converter.pyscripts/gemini_batch_process.pyscripts/media_optimizer.pyscripts/requirements.txtscripts/tests/requirements.txtscripts/tests/test_document_converter.pyscripts/tests/test_gemini_batch_process.pyscripts/tests/test_media_optimizer.pyProcess audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.
| Task | Audio | Image | Video | Document | Generation |
|---|---|---|---|---|---|
| Transcription | Y | - | Y | - | - |
| Summarization | Y | Y | Y | Y | - |
| Q&A | Y | Y | Y | Y | - |
| Object Detection | - | Y | Y | - | - |
| Text Extraction | - | Y | - | Y | - |
| Structured Output | Y | Y | Y | Y | - |
| Creation | TTS | - | - | - | Y |
| Timestamps | Y | - | Y | - | - |
| Segmentation | - | Y | - | - | - |
API Key Setup: Supports both Google AI Studio and Vertex AI.
The skill checks for GEMINI_API_KEY in this order:
export GEMINI_API_KEY="your-key".env.claude/.env.claude/skills/.env.claude/skills/ai-multimodal/.envGet API key: https://aistudio.google.com/apikey
For Vertex AI:
<example type="usage"> <code language="bash"> export GEMINI_USE_VERTEX=true export VERTEX_PROJECT_ID=your-gcp-project-id export VERTEX_LOCATION=us-central1 # Optional </code> </example>Install SDK:
<example type="usage"> <code language="bash"> pip install google-genai python-dotenv pillow </code> </example>Transcribe Audio:
<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --files audio.mp3 \ --task transcribe \ --model gemini-2.5-flash </code> </example>Analyze Image:
<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --files image.jpg \ --task analyze \ --prompt "Describe this image" \ --output docs/assets/<output-name>.md \ --model gemini-2.5-flash </code> </example>Process Video:
<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --files video.mp4 \ --task analyze \ --prompt "Summarize key points with timestamps" \ --output docs/assets/<output-name>.md \ --model gemini-2.5-flash </code> </example>Extract from PDF:
<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --files document.pdf \ --task extract \ --prompt "Extract table data as JSON" \ --output docs/assets/<output-name>.md \ --format json </code> </example>Generate Image:
<example type="usage"> <code language="bash"> python scripts/gemini_batch_process.py \ --task generate \ --prompt "A futuristic city at sunset" \ --output docs/assets/<output-file-name> \ --model gemini-2.5-flash-image \ --aspect-ratio 16:9 </code> </example>Optimize Media:
<example type="usage"> <code language="bash"> # Prepare large video for processing python scripts/media_optimizer.py \ --input large-video.mp4 \ --output docs/assets/<output-file-name> \ --target-size 100MBpython scripts/media_optimizer.py
--input-dir ./videos
--output-dir docs/assets/optimized
--quality 85
</code>
</example>
Convert Documents to Markdown:
<example type="usage"> <code language="bash"> # Convert to PDF python scripts/document_converter.py \ --input document.docx \ --output docs/assets/document.mdpython scripts/document_converter.py
--input large.pdf
--output docs/assets/chapter1.md
--pages 1-20
</code>
</example>
For detailed implementation guidance, see:
references/audio-processing.md - Transcription, analysis, TTS
references/vision-understanding.md - Captioning, detection, OCR
references/video-analysis.md - Scene detection, temporal understanding
references/document-extraction.md - PDF processing, structured output
references/image-generation.md - Text-to-image, editing
Input Pricing:
Token Rates:
TTS Pricing:
gemini-2.5-flash for most tasks (best price/performance)media_optimizer.py)Free Tier:
YouTube Limits:
Storage Limits:
Common errors and solutions:
All scripts support unified API key detection and error handling:
gemini_batch_process.py: Batch process multiple media files
media_optimizer.py: Prepare media for Gemini API
document_converter.py: Convert documents to PDF
Run any script with --help for detailed usage.
Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
Activates when the user asks about Agent Skills, wants to find reusable AI capabilities, needs to install skills, or mentions skills for Claude. Use for discovering, retrieving, and installing skills.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.