Help us improve
Share bugs, ideas, or general feedback.
From gemini
This skill should be used when the user asks to analyze a video, process images, transcribe audio, read or summarize a PDF, extract text from a screenshot, convert a diagram to code, or perform any visual analysis. Relevant when the user says "transcribe this audio file," "what's in this video," or "turn this diagram into code."
How this skill is triggered — by the user, by Claude, or both
Slash command
/gemini:gemini-multimodalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Invoke using `/gemini-media` or `mcp__gemini__gemini_execute` with file paths in the prompt.**
Share bugs, ideas, or general feedback.
Invoke using /gemini-media or mcp__gemini__gemini_execute with file paths in the prompt.
Gemini can directly process:
Before sending files to Gemini:
| Media | Model | Timeout | Notes |
|---|---|---|---|
| Video (long) | pro | 2400000 | Complex temporal analysis |
| Video (short) | flash | 300000 | Quick extraction |
| Audio (long) | pro | 2400000 | Full transcription |
| Audio (short) | flash | 300000 | Quick transcription |
| Images | flash | 300000 | Most image tasks are fast |
| Complex diagrams | pro | 300000 | Architecture, flowcharts |
| PDFs (long) | pro | 2400000 | Multi-page analysis |
| PDFs (short) | flash | 300000 | Quick extraction |
Multimodal analysis often feeds into code work:
npx claudepluginhub naluforge/geminicli-cc-plugin --plugin geminiAnalyzes media files (PDFs, images, diagrams, screenshots) using a vision backend to extract structured data, descriptions, or summaries instead of literal file reading.
Processes audio, images, videos, and PDFs, and generates images/videos using Google Gemini, Imagen, and Veo models. Useful for transcription, OCR, visual Q&A, document extraction, and media generation.
Analyzes images with MiniMax vision tool for description, OCR, text extraction, UI mockup review, chart data parsing, diagrams. Auto-triggers on image shares or analysis requests.