Multi-Modal Prompting Skill

Bonded to: advanced-techniques-agent

Quick Start

Skill("custom-plugin-prompt-engineering:multi-modal")

Parameter Schema

parameters:
  modality:
    type: enum
    values: [vision, document, audio, video]
    required: true

  task_type:
    type: enum
    values: [analysis, extraction, generation, qa]
    default: analysis

  detail_level:
    type: enum
    values: [low, medium, high]
    default: medium

Vision Prompting

Image Analysis Template

Analyze this image and provide:
1. Main subjects and objects
2. Actions or activities
3. Setting and context
4. Notable details
5. Overall interpretation

Be specific and descriptive.

Visual Q&A Pattern

Look at the image carefully.

Question: {question}

Provide a detailed answer based only on what you can see in the image.

Chart/Graph Analysis

Analyze this chart/graph:
1. Type of visualization
2. Axes and labels
3. Key data points
4. Trends or patterns
5. Main insights
6. Limitations or caveats

Document Processing

PDF Extraction

Extract the following from this document:
- Title and headers
- Key information: {fields}
- Tables (if any)
- Important dates/numbers

Output as structured JSON.

Form/Invoice Processing

extraction_schema:
  document_type: "invoice|form|contract"
  fields:
    - name: vendor
      type: string
    - name: date
      type: date
    - name: total
      type: currency
    - name: line_items
      type: array

Audio Integration

Transcription Enhancement

Transcribe and enhance:
1. Accurate transcription
2. Speaker identification
3. Timestamps for key points
4. Summary of main topics
5. Action items (if applicable)

Best Practices

best_practices:
  image_prompts:
    - Be specific about what to look for
    - Request structured output
    - Ask for confidence levels

  document_prompts:
    - Define extraction schema
    - Handle multi-page documents
    - Validate extracted data

  audio_prompts:
    - Specify language if known
    - Request speaker diarization
    - Ask for timestamps

Troubleshooting

Issue	Cause	Solution
Hallucinated details	Over-interpretation	Ask for visible-only info
Missed text in images	Low resolution	Request higher detail
Wrong document parsing	Complex layout	Break into sections
Inaccurate transcription	Audio quality	Acknowledge limitations

References

See: GPT-4V documentation, Claude Vision Guide

multi-modal

Multi-Modal Prompting Skill

Quick Start

Parameter Schema

Vision Prompting

Image Analysis Template

Visual Q&A Pattern

Chart/Graph Analysis

Document Processing

PDF Extraction

Form/Invoice Processing

Audio Integration

Transcription Enhancement

Best Practices

Troubleshooting

References

Similar Skills