Generate synthetic PDF documents for RAG and unstructured data use cases. Use when creating test PDFs, demo documents, or evaluation datasets for retrieval systems.
Generates synthetic PDF documents with evaluation metadata for testing RAG systems and unstructured data pipelines.
/plugin marketplace add https://www.claudepluginhub.com/api/plugins/databricks-solutions-databricks-ai-dev-kit/marketplace.json/plugin install databricks-solutions-databricks-ai-dev-kit@cpd-databricks-solutions-databricks-ai-dev-kitThis skill inherits all available tools. When active, it can use any tool Claude has access to.
Generate realistic synthetic PDF documents using LLM for RAG (Retrieval-Augmented Generation) and unstructured data use cases.
This skill uses the generate_pdf_documents MCP tool to create professional PDF documents with:
Use the generate_pdf_documents MCP tool:
catalog: "my_catalog"schema: "my_schema"description: "Technical documentation for a cloud infrastructure platform including setup guides, troubleshooting procedures, and API references."count: 10This generates 10 PDF documents and saves them to /Volumes/my_catalog/my_schema/raw_data/pdf_documents/ (using default volume and folder).
Use the generate_pdf_documents MCP tool:
catalog: "my_catalog"schema: "my_schema"description: "HR policy documents..."count: 10volume: "custom_volume"folder: "hr_policies"overwrite_folder: true| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
catalog | string | Yes | - | Unity Catalog name |
schema | string | Yes | - | Schema name |
description | string | Yes | - | Detailed description of what PDFs should contain |
count | int | Yes | - | Number of PDFs to generate |
volume | string | No | raw_data | Volume name (created if not exists) |
folder | string | No | pdf_documents | Folder within volume for output files |
doc_size | string | No | MEDIUM | Document size: SMALL (~1 page), MEDIUM (~5 pages), LARGE (~10+ pages) |
overwrite_folder | bool | No | false | If true, deletes existing folder contents first |
For each document, the tool creates two files:
<model_id>.pdf): The generated document<model_id>.json): Metadata for RAG evaluation{
"title": "API Authentication Guide",
"category": "Technical",
"pdf_path": "/Volumes/catalog/schema/volume/folder/doc_001.pdf",
"question": "What authentication methods are supported by the API?",
"guideline": "Answer should mention OAuth 2.0, API keys, and JWT tokens with their use cases."
}
Use the generate_pdf_documents MCP tool:
catalog: "ai_dev_kit"schema: "hr_demo"description: "HR policy documents for a technology company including employee handbook, leave policies, performance review procedures, benefits guide, and workplace conduct guidelines."count: 15folder: "hr_policies"overwrite_folder: trueUse the generate_pdf_documents MCP tool:
catalog: "ai_dev_kit"schema: "tech_docs"description: "Technical documentation for a SaaS analytics platform including installation guides, API references, troubleshooting procedures, security best practices, and integration tutorials."count: 20folder: "product_docs"overwrite_folder: trueUse the generate_pdf_documents MCP tool:
catalog: "ai_dev_kit"schema: "finance_demo"description: "Financial documents for a retail company including quarterly reports, expense policies, budget guidelines, and audit procedures."count: 12folder: "reports"overwrite_folder: trueUse the generate_pdf_documents MCP tool:
catalog: "ai_dev_kit"schema: "training"description: "Training materials for new software developers including onboarding guides, coding standards, code review procedures, and deployment workflows."count: 8folder: "courses"overwrite_folder: trueai_dev_kit catalog, ask user for schema namegenerate_pdf_documents MCP tool with appropriate parametersDetailed descriptions: The more specific your description, the better the generated content
Appropriate count:
Folder organization: Use descriptive folder names that indicate content type
hr_policies/technical_docs/training_materials/Use overwrite_folder: Set to true when regenerating to ensure clean state
The generated JSON files are designed for RAG evaluation:
question field to query your RAG systemguideline field to assess if the RAG response is correctExample evaluation workflow:
# Load questions from JSON files
questions = load_json_files(f"/Volumes/{catalog}/{schema}/{volume}/{folder}/*.json")
for q in questions:
# Query RAG system
response = rag_system.query(q["question"])
# Evaluate using guideline
is_correct = evaluate_response(response, q["guideline"])
The tool requires LLM configuration via environment variables:
# Databricks Foundation Models (default)
LLM_PROVIDER=DATABRICKS
DATABRICKS_MODEL=databricks-meta-llama-3-3-70b-instruct
# Or Azure OpenAI
LLM_PROVIDER=AZURE
AZURE_OPENAI_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_DEPLOYMENT=gpt-4o
| Issue | Solution |
|---|---|
| "No LLM endpoint configured" | Set DATABRICKS_MODEL or AZURE_OPENAI_DEPLOYMENT environment variable |
| "Volume does not exist" | The tool creates volumes automatically; ensure you have CREATE VOLUME permission |
| "PDF generation timeout" | Reduce count or check LLM endpoint availability |
| Low quality content | Provide more detailed description with specific topics and document types |
Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.
Search, retrieve, and install Agent Skills from the prompts.chat registry using MCP tools. Use when the user asks to find skills, browse skill catalogs, install a skill for Claude, or extend Claude's capabilities with reusable AI agent components.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.