Skill

gemini-multimodal

This skill should be used when the user asks to analyze a video, process images, transcribe audio, read or summarize a PDF, extract text from a screenshot, convert a diagram to code, or perform any visual analysis. Relevant when the user says "transcribe this audio file," "what's in this video," or "turn this diagram into code."

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/gemini:gemini-multimodal

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

**Invoke using `/gemini-media` or `mcp__gemini__gemini_execute` with file paths in the prompt.**

SKILL.md

72 lines · ~701 tokens

Stats

LanguageTypeScript

Stars2

MaintenanceGood

Last CommitApr 2, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Multimodal Processing with Gemini

Invoke using /gemini-media or mcp__gemini__gemini_execute with file paths in the prompt.

Supported Media Types

Gemini can directly process:

Video: MP4, WebM, MOV — analysis, summarization, scene detection
Audio: MP3, WAV, FLAC — transcription, speaker detection, analysis
Images: PNG, JPG, WebP, GIF — OCR, analysis, diagram interpretation
PDFs: Multi-page document analysis, extraction, summarization

Pre-Flight Validation

Before sending files to Gemini:

Verify files exist: Use Glob or Read to confirm all paths are valid
Check file sizes: Very large files (>1GB video) may need segmenting
Confirm file type: Verify the extension matches expected content

Parameter Selection by Media Type

Media	Model	Timeout	Notes
Video (long)	pro	2400000	Complex temporal analysis
Video (short)	flash	300000	Quick extraction
Audio (long)	pro	2400000	Full transcription
Audio (short)	flash	300000	Quick transcription
Images	flash	300000	Most image tasks are fast
Complex diagrams	pro	300000	Architecture, flowcharts
PDFs (long)	pro	2400000	Multi-page analysis
PDFs (short)	flash	300000	Quick extraction

Output Structure by Media Type

Video

Include timestamps: "At 2:34, the speaker discusses..."
Reference visual elements: "The diagram shown at 5:12 illustrates..."
For long videos, provide a timeline summary first, then details

Audio

Include timestamps for key moments
Attribute speakers when possible: "Speaker A (likely the interviewer)..."
Note audio quality issues that may affect accuracy

Images

Use spatial references: "In the top-right corner...", "The second row..."
For diagrams, describe the structure before details
For screenshots, identify UI elements and their state

PDFs

Reference page numbers: "On page 3, section 2.1..."
For tables, describe structure and key data points
For forms, list fields and their values

Combining with Code Context

Multimodal analysis often feeds into code work:

Screenshot → identify UI components → generate code
Architecture diagram → map to file structure → verify alignment
Error screenshot → identify the error → find relevant code
PDF spec → extract requirements → plan implementation

gemini-multimodal

Popularity

Invocation

Context Preview

SKILL.md

Help us improve

Help us improve

Find plugins for your project

gemini-multimodal

Popularity

Invocation

Context Preview

SKILL.md

Multimodal Processing with Gemini

Supported Media Types

Pre-Flight Validation

Parameter Selection by Media Type

Output Structure by Media Type

Video

Audio

Images

PDFs

Combining with Code Context

Similar Skills

Help us improve

Multimodal Processing with Gemini

Supported Media Types

Pre-Flight Validation

Parameter Selection by Media Type

Output Structure by Media Type

Video

Audio

Images

PDFs

Combining with Code Context

Similar Skills