Skill

multimodal-llm

Install
1
Install the plugin
$
npx claudepluginhub yonatangross/orchestkit --plugin ork

Want just this skill?

Add to a custom plugin, then install with one command.

Description

Vision, audio, video generation, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, generating AI video (Kling, Sora, Veo, Runway), or building multimodal AI pipelines.

Tool Access

This skill is limited to using the following tools:

ReadGlobGrepWebFetchWebSearch
Supporting Assets
View in Repository
rules/_sections.md
rules/_template.md
rules/audio-models.md
rules/audio-speech-to-text.md
rules/audio-text-to-speech.md
rules/video-generation-models.md
rules/video-generation-patterns.md
rules/video-multi-shot.md
rules/vision-document.md
rules/vision-image-analysis.md
rules/vision-models.md
test-cases.json
Skill Content

Multimodal LLM Patterns

Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling 3.0, Sora 2, Veo 3.1, Runway Gen-4.5).

Quick Reference

CategoryRulesImpactWhen to Use
Vision: Image Analysis1HIGHImage captioning, VQA, multi-image comparison, object detection
Vision: Document Understanding1HIGHOCR, chart/diagram analysis, PDF processing, table extraction
Vision: Model Selection1MEDIUMChoosing provider, cost optimization, image size limits
Audio: Speech-to-Text1HIGHTranscription, speaker diarization, long-form audio
Audio: Text-to-Speech1MEDIUMVoice synthesis, expressive TTS, multi-speaker dialogue
Audio: Model Selection1MEDIUMReal-time voice agents, provider comparison, pricing
Video: Model Selection1HIGHChoosing video gen provider (Kling, Sora, Veo, Runway)
Video: API Patterns1HIGHAsync task polling, SDK integration, webhook callbacks
Video: Multi-Shot1HIGHStoryboarding, character elements, scene consistency

Total: 9 rules across 3 categories (Vision, Audio, Video Generation)

Vision: Image Analysis

Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.

RuleFileKey Pattern
Image Analysisrules/vision-image-analysis.mdBase64 encoding, multi-image, bounding boxes

Vision: Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

RuleFileKey Pattern
Document Visionrules/vision-document.mdPDF page ranges, detail levels, OCR strategies

Vision: Model Selection

Choose the right vision provider based on accuracy, cost, and context window needs.

RuleFileKey Pattern
Vision Modelsrules/vision-models.mdProvider comparison, token costs, image limits

Audio: Speech-to-Text

Convert audio to text with speaker diarization, timestamps, and sentiment analysis.

RuleFileKey Pattern
Speech-to-Textrules/audio-speech-to-text.mdGemini long-form, GPT-4o-Transcribe, AssemblyAI features

Audio: Text-to-Speech

Generate natural speech from text with voice selection and expressive cues.

RuleFileKey Pattern
Text-to-Speechrules/audio-text-to-speech.mdGemini TTS, voice config, auditory cues

Audio: Model Selection

Select the right audio/voice provider for real-time, transcription, or TTS use cases.

RuleFileKey Pattern
Audio Modelsrules/audio-models.mdReal-time voice comparison, STT benchmarks, pricing

Video: Model Selection

Choose the right video generation provider based on use case, duration, and budget.

RuleFileKey Pattern
Video Modelsrules/video-generation-models.mdKling vs Sora vs Veo vs Runway, pricing, capabilities

Video: API Patterns

Integrate video generation APIs with proper async polling, SDKs, and webhook callbacks.

RuleFileKey Pattern
API Integrationrules/video-generation-patterns.mdKling REST, fal.ai SDK, Vercel AI SDK, task polling

Video: Multi-Shot

Generate multi-scene videos with consistent characters using storyboarding and character elements.

RuleFileKey Pattern
Multi-Shotrules/video-multi-shot.mdKling 3.0 character elements, 6-shot storyboards, identity binding

Key Decisions

DecisionRecommendation
High accuracy visionClaude Opus 4.6 or GPT-5
Long documentsGemini 2.5 Pro (1M context)
Cost-efficient visionGemini 2.5 Flash ($0.15/M tokens)
Video analysisGemini 2.5/3 Pro (native video)
Voice assistantGrok Voice Agent (fastest, <1s)
Emotional voice AIGemini Live API
Long audio transcriptionGemini 2.5 Pro (9.5hr)
Speaker diarizationAssemblyAI or Gemini
Self-hosted STTWhisper Large V3
Character-consistent videoKling 3.0 (Character Elements 3.0)
Narrative video / storytellingSora 2 (best cause-and-effect coherence)
Cinematic B-rollVeo 3.1 (camera control + polished motion)
Professional VFXRunway Gen-4.5 (Act-Two motion transfer)
High-volume social videoKling 3.0 Standard ($0.20/video)
Open-source video genWan 2.6 or LTX-2
Lip-sync / avatar videoKling 3.0 (native lip-sync API)

Example

import anthropic, base64

client = anthropic.Anthropic()
with open("image.png", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
        {"type": "text", "text": "Describe this image"}
    ]}]
)

Common Mistakes

  1. Not setting max_tokens on vision requests (responses truncated)
  2. Sending oversized images without resizing (>2048px)
  3. Using high detail level for simple yes/no classification
  4. Using STT+LLM+TTS pipeline instead of native speech-to-speech
  5. Not leveraging barge-in support for natural voice conversations
  6. Using deprecated models (GPT-4V, Whisper-1)
  7. Ignoring rate limits on vision and audio endpoints
  8. Calling video generation APIs synchronously (they're async — poll or use callbacks)
  9. Generating separate clips without character elements (characters look different each time)
  10. Using Sora for high-volume social content (expensive, slow — use Kling Standard instead)

Related Skills

  • ork:rag-retrieval - Multimodal RAG with image + text retrieval
  • ork:llm-integration - General LLM function calling patterns
  • streaming-api-patterns - WebSocket patterns for real-time audio
  • ork:demo-producer - Terminal demo videos (VHS, asciinema) — not AI video gen
Stats
Stars128
Forks14
Last CommitMar 20, 2026
Actions

Similar Skills