npx claudepluginhub yonatangross/orchestkit --plugin orkWant just this skill?
Add to a custom plugin, then install with one command.
Vision, audio, video generation, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, generating AI video (Kling, Sora, Veo, Runway), or building multimodal AI pipelines.
This skill is limited to using the following tools:
rules/_sections.mdrules/_template.mdrules/audio-models.mdrules/audio-speech-to-text.mdrules/audio-text-to-speech.mdrules/video-generation-models.mdrules/video-generation-patterns.mdrules/video-multi-shot.mdrules/vision-document.mdrules/vision-image-analysis.mdrules/vision-models.mdtest-cases.jsonMultimodal LLM Patterns
Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling 3.0, Sora 2, Veo 3.1, Runway Gen-4.5).
Quick Reference
| Category | Rules | Impact | When to Use |
|---|---|---|---|
| Vision: Image Analysis | 1 | HIGH | Image captioning, VQA, multi-image comparison, object detection |
| Vision: Document Understanding | 1 | HIGH | OCR, chart/diagram analysis, PDF processing, table extraction |
| Vision: Model Selection | 1 | MEDIUM | Choosing provider, cost optimization, image size limits |
| Audio: Speech-to-Text | 1 | HIGH | Transcription, speaker diarization, long-form audio |
| Audio: Text-to-Speech | 1 | MEDIUM | Voice synthesis, expressive TTS, multi-speaker dialogue |
| Audio: Model Selection | 1 | MEDIUM | Real-time voice agents, provider comparison, pricing |
| Video: Model Selection | 1 | HIGH | Choosing video gen provider (Kling, Sora, Veo, Runway) |
| Video: API Patterns | 1 | HIGH | Async task polling, SDK integration, webhook callbacks |
| Video: Multi-Shot | 1 | HIGH | Storyboarding, character elements, scene consistency |
Total: 9 rules across 3 categories (Vision, Audio, Video Generation)
Vision: Image Analysis
Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.
| Rule | File | Key Pattern |
|---|---|---|
| Image Analysis | rules/vision-image-analysis.md | Base64 encoding, multi-image, bounding boxes |
Vision: Document Understanding
Extract structured data from documents, charts, and PDFs using vision models.
| Rule | File | Key Pattern |
|---|---|---|
| Document Vision | rules/vision-document.md | PDF page ranges, detail levels, OCR strategies |
Vision: Model Selection
Choose the right vision provider based on accuracy, cost, and context window needs.
| Rule | File | Key Pattern |
|---|---|---|
| Vision Models | rules/vision-models.md | Provider comparison, token costs, image limits |
Audio: Speech-to-Text
Convert audio to text with speaker diarization, timestamps, and sentiment analysis.
| Rule | File | Key Pattern |
|---|---|---|
| Speech-to-Text | rules/audio-speech-to-text.md | Gemini long-form, GPT-4o-Transcribe, AssemblyAI features |
Audio: Text-to-Speech
Generate natural speech from text with voice selection and expressive cues.
| Rule | File | Key Pattern |
|---|---|---|
| Text-to-Speech | rules/audio-text-to-speech.md | Gemini TTS, voice config, auditory cues |
Audio: Model Selection
Select the right audio/voice provider for real-time, transcription, or TTS use cases.
| Rule | File | Key Pattern |
|---|---|---|
| Audio Models | rules/audio-models.md | Real-time voice comparison, STT benchmarks, pricing |
Video: Model Selection
Choose the right video generation provider based on use case, duration, and budget.
| Rule | File | Key Pattern |
|---|---|---|
| Video Models | rules/video-generation-models.md | Kling vs Sora vs Veo vs Runway, pricing, capabilities |
Video: API Patterns
Integrate video generation APIs with proper async polling, SDKs, and webhook callbacks.
| Rule | File | Key Pattern |
|---|---|---|
| API Integration | rules/video-generation-patterns.md | Kling REST, fal.ai SDK, Vercel AI SDK, task polling |
Video: Multi-Shot
Generate multi-scene videos with consistent characters using storyboarding and character elements.
| Rule | File | Key Pattern |
|---|---|---|
| Multi-Shot | rules/video-multi-shot.md | Kling 3.0 character elements, 6-shot storyboards, identity binding |
Key Decisions
| Decision | Recommendation |
|---|---|
| High accuracy vision | Claude Opus 4.6 or GPT-5 |
| Long documents | Gemini 2.5 Pro (1M context) |
| Cost-efficient vision | Gemini 2.5 Flash ($0.15/M tokens) |
| Video analysis | Gemini 2.5/3 Pro (native video) |
| Voice assistant | Grok Voice Agent (fastest, <1s) |
| Emotional voice AI | Gemini Live API |
| Long audio transcription | Gemini 2.5 Pro (9.5hr) |
| Speaker diarization | AssemblyAI or Gemini |
| Self-hosted STT | Whisper Large V3 |
| Character-consistent video | Kling 3.0 (Character Elements 3.0) |
| Narrative video / storytelling | Sora 2 (best cause-and-effect coherence) |
| Cinematic B-roll | Veo 3.1 (camera control + polished motion) |
| Professional VFX | Runway Gen-4.5 (Act-Two motion transfer) |
| High-volume social video | Kling 3.0 Standard ($0.20/video) |
| Open-source video gen | Wan 2.6 or LTX-2 |
| Lip-sync / avatar video | Kling 3.0 (native lip-sync API) |
Example
import anthropic, base64
client = anthropic.Anthropic()
with open("image.png", "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
{"type": "text", "text": "Describe this image"}
]}]
)
Common Mistakes
- Not setting
max_tokenson vision requests (responses truncated) - Sending oversized images without resizing (>2048px)
- Using
highdetail level for simple yes/no classification - Using STT+LLM+TTS pipeline instead of native speech-to-speech
- Not leveraging barge-in support for natural voice conversations
- Using deprecated models (GPT-4V, Whisper-1)
- Ignoring rate limits on vision and audio endpoints
- Calling video generation APIs synchronously (they're async — poll or use callbacks)
- Generating separate clips without character elements (characters look different each time)
- Using Sora for high-volume social content (expensive, slow — use Kling Standard instead)
Related Skills
ork:rag-retrieval- Multimodal RAG with image + text retrievalork:llm-integration- General LLM function calling patternsstreaming-api-patterns- WebSocket patterns for real-time audioork:demo-producer- Terminal demo videos (VHS, asciinema) — not AI video gen