GPT-5/4o, Claude 4.5, Gemini 2.5/3, Grok 4 vision patterns for image analysis, document understanding, and visual QA. Use when implementing image captioning, document/chart analysis, or multi-image comparison.
Analyzes images, documents, and charts using leading vision-language models for visual QA.
/plugin marketplace add yonatangross/skillforge-claude-plugin/plugin install skillforge-complete@skillforgeThis skill inherits all available tools. When active, it can use any tool Claude has access to.
checklists/implementation.mdreferences/cost-optimization.mdreferences/document-vision.mdreferences/image-captioning.mdIntegrate vision capabilities from leading multimodal models for image understanding, document analysis, and visual reasoning.
| Model | Context | Strengths | Vision Input |
|---|---|---|---|
| GPT-5.2 | 128K | Best general reasoning, multimodal | Up to 10 images |
| Claude Opus 4.5 | 200K | Best coding, sustained agent tasks | Up to 100 images |
| Gemini 2.5 Pro | 1M+ | Longest context, video analysis | 3,600 images max |
| Gemini 3 Pro | 1M | Deep Think, 100% AIME 2025 | Enhanced segmentation |
| Grok 4 | 2M | Real-time X integration, DeepSearch | Images + upcoming video |
import base64
import mimetypes
def encode_image_base64(image_path: str) -> tuple[str, str]:
"""Encode local image to base64 with MIME type."""
mime_type, _ = mimetypes.guess_type(image_path)
mime_type = mime_type or "image/png"
with open(image_path, "rb") as f:
base64_data = base64.standard_b64encode(f.read()).decode("utf-8")
return base64_data, mime_type
from openai import OpenAI
client = OpenAI()
def analyze_image_openai(image_path: str, prompt: str) -> str:
"""Analyze image using GPT-5 or GPT-4o."""
base64_data, mime_type = encode_image_base64(image_path)
response = client.chat.completions.create(
model="gpt-5", # or "gpt-4o", "gpt-4.1"
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {
"url": f"data:{mime_type};base64,{base64_data}",
"detail": "high" # low, high, or auto
}}
]
}],
max_tokens=4096 # Required for vision
)
return response.choices[0].message.content
import anthropic
client = anthropic.Anthropic()
def analyze_image_claude(image_path: str, prompt: str) -> str:
"""Analyze image using Claude Opus 4.5 or Sonnet 4.5."""
base64_data, media_type = encode_image_base64(image_path)
response = client.messages.create(
model="claude-opus-4-5-20251124", # or claude-sonnet-4-5
max_tokens=4096,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": base64_data
}
},
{"type": "text", "text": prompt}
]
}]
)
return response.content[0].text
import google.generativeai as genai
from PIL import Image
genai.configure(api_key="YOUR_API_KEY")
def analyze_image_gemini(image_path: str, prompt: str) -> str:
"""Analyze image using Gemini 2.5 Pro or Gemini 3."""
model = genai.GenerativeModel("gemini-2.5-pro") # or gemini-3-pro
image = Image.open(image_path)
response = model.generate_content([prompt, image])
return response.text
# For video analysis (Gemini excels here)
def analyze_video_gemini(video_path: str, prompt: str) -> str:
"""Analyze video using Gemini's native video support."""
model = genai.GenerativeModel("gemini-2.5-pro")
video_file = genai.upload_file(video_path)
response = model.generate_content([prompt, video_file])
return response.text
from openai import OpenAI # Grok uses OpenAI-compatible API
client = OpenAI(
api_key="YOUR_XAI_API_KEY",
base_url="https://api.x.ai/v1"
)
def analyze_image_grok(image_path: str, prompt: str) -> str:
"""Analyze image using Grok 4 with real-time capabilities."""
base64_data, mime_type = encode_image_base64(image_path)
response = client.chat.completions.create(
model="grok-4", # or grok-2-vision-1212
messages=[{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {
"url": f"data:{mime_type};base64,{base64_data}"
}}
]
}]
)
return response.choices[0].message.content
async def compare_images(images: list[str], prompt: str) -> str:
"""Compare multiple images (Claude supports up to 100)."""
content = []
for img_path in images:
base64_data, media_type = encode_image_base64(img_path)
content.append({
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": base64_data
}
})
content.append({"type": "text", "text": prompt})
response = client.messages.create(
model="claude-opus-4-5-20251124",
max_tokens=8192,
messages=[{"role": "user", "content": content}]
)
return response.content[0].text
def detect_objects_gemini(image_path: str) -> list[dict]:
"""Detect objects with bounding boxes using Gemini 2.5+."""
model = genai.GenerativeModel("gemini-2.5-pro")
image = Image.open(image_path)
response = model.generate_content([
"Detect all objects in this image. Return bounding boxes "
"as JSON with format: {objects: [{label, box: [x1,y1,x2,y2]}]}",
image
])
import json
return json.loads(response.text)
| Provider | Detail Level | Cost Impact |
|---|---|---|
| OpenAI | low (65 tokens) | Use for classification |
| OpenAI | high (129+ tokens/tile) | Use for OCR/charts |
| Gemini | 258 tokens base | Scales with resolution |
| Claude | Per-image pricing | Batch for efficiency |
# Cost-optimized simple classification
response = client.chat.completions.create(
model="gpt-4o-mini", # Cheaper for simple tasks
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Is there a person? Reply: yes/no"},
{"type": "image_url", "image_url": {
"url": image_url,
"detail": "low" # Minimal tokens
}}
]
}]
)
| Provider | Max Size | Max Images | Notes |
|---|---|---|---|
| OpenAI | 20MB | 10/request | GPT-5 series |
| Claude | 8000x8000 px | 100/request | 2000px if >20 images |
| Gemini | 20MB | 3,600/request | Best for batch |
| Grok | 20MB | Limited | Grok 5 expands this |
| Decision | Recommendation |
|---|---|
| High accuracy | Claude Opus 4.5 or GPT-5 |
| Long documents | Gemini 2.5 Pro (1M context) |
| Cost efficiency | Gemini 2.5 Flash ($0.15/M tokens) |
| Real-time/X data | Grok 4 with DeepSearch |
| Video analysis | Gemini 2.5/3 Pro (native) |
max_tokens (responses truncated)high detail for yes/no questionsaudio-language-models - Audio/speech processingmultimodal-rag - Image + text retrievalllm-streaming - Streaming vision responsesKeywords: caption, describe, image description, alt text, accessibility Solves:
Keywords: VQA, visual question, image question, analyze image Solves:
Keywords: document, PDF, chart, diagram, OCR, extract, table Solves:
Keywords: compare images, multiple images, image comparison, batch Solves:
Keywords: bounding box, detect objects, locate, segmentation Solves:
Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
Activates when the user asks about Agent Skills, wants to find reusable AI capabilities, needs to install skills, or mentions skills for Claude. Use for discovering, retrieving, and installing skills.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.