Generate synthetic PDF documents for RAG and unstructured data use cases. Use when creating test PDFs, demo documents, or evaluation datasets for retrieval systems.
npx claudepluginhub leary-poken/ai-dev-kit --plugin databricks-ai-dev-kitThis skill uses the workspace's default tool permissions.
Generate realistic synthetic PDF documents using LLM for RAG (Retrieval-Augmented Generation) and unstructured data use cases.
Provides Ktor server patterns for routing DSL, plugins (auth, CORS, serialization), Koin DI, WebSockets, services, and testApplication testing.
Conducts multi-source web research with firecrawl and exa MCPs: searches, scrapes pages, synthesizes cited reports. For deep dives, competitive analysis, tech evaluations, or due diligence.
Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.
Generate realistic synthetic PDF documents using LLM for RAG (Retrieval-Augmented Generation) and unstructured data use cases.
This skill uses the generate_pdf_documents MCP tool to create professional PDF documents with:
Use the generate_pdf_documents MCP tool:
catalog: "my_catalog"schema: "my_schema"description: "Technical documentation for a cloud infrastructure platform including setup guides, troubleshooting procedures, and API references."count: 10This generates 10 PDF documents and saves them to /Volumes/my_catalog/my_schema/raw_data/pdf_documents/ (using default volume and folder).
Use the generate_pdf_documents MCP tool:
catalog: "my_catalog"schema: "my_schema"description: "HR policy documents..."count: 10volume: "custom_volume"folder: "hr_policies"overwrite_folder: true| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
catalog | string | Yes | - | Unity Catalog name |
schema | string | Yes | - | Schema name |
description | string | Yes | - | Detailed description of what PDFs should contain |
count | int | Yes | - | Number of PDFs to generate |
volume | string | No | raw_data | Volume name (created if not exists) |
folder | string | No | pdf_documents | Folder within volume for output files |
doc_size | string | No | MEDIUM | Document size: SMALL (~1 page), MEDIUM (~5 pages), LARGE (~10+ pages) |
overwrite_folder | bool | No | false | If true, deletes existing folder contents first |
For each document, the tool creates two files:
<model_id>.pdf): The generated document<model_id>.json): Metadata for RAG evaluation{
"title": "API Authentication Guide",
"category": "Technical",
"pdf_path": "/Volumes/catalog/schema/volume/folder/doc_001.pdf",
"question": "What authentication methods are supported by the API?",
"guideline": "Answer should mention OAuth 2.0, API keys, and JWT tokens with their use cases."
}
Use the generate_pdf_documents MCP tool:
catalog: "ai_dev_kit"schema: "hr_demo"description: "HR policy documents for a technology company including employee handbook, leave policies, performance review procedures, benefits guide, and workplace conduct guidelines."count: 15folder: "hr_policies"overwrite_folder: trueUse the generate_pdf_documents MCP tool:
catalog: "ai_dev_kit"schema: "tech_docs"description: "Technical documentation for a SaaS analytics platform including installation guides, API references, troubleshooting procedures, security best practices, and integration tutorials."count: 20folder: "product_docs"overwrite_folder: trueUse the generate_pdf_documents MCP tool:
catalog: "ai_dev_kit"schema: "finance_demo"description: "Financial documents for a retail company including quarterly reports, expense policies, budget guidelines, and audit procedures."count: 12folder: "reports"overwrite_folder: trueUse the generate_pdf_documents MCP tool:
catalog: "ai_dev_kit"schema: "training"description: "Training materials for new software developers including onboarding guides, coding standards, code review procedures, and deployment workflows."count: 8folder: "courses"overwrite_folder: trueai_dev_kit catalog, ask user for schema namegenerate_pdf_documents MCP tool with appropriate parametersDetailed descriptions: The more specific your description, the better the generated content
Appropriate count:
Folder organization: Use descriptive folder names that indicate content type
hr_policies/technical_docs/training_materials/Use overwrite_folder: Set to true when regenerating to ensure clean state
The generated JSON files are designed for RAG evaluation:
question field to query your RAG systemguideline field to assess if the RAG response is correctExample evaluation workflow:
# Load questions from JSON files
questions = load_json_files(f"/Volumes/{catalog}/{schema}/{volume}/{folder}/*.json")
for q in questions:
# Query RAG system
response = rag_system.query(q["question"])
# Evaluate using guideline
is_correct = evaluate_response(response, q["guideline"])
The tool requires LLM configuration via environment variables:
# Databricks Foundation Models (default)
LLM_PROVIDER=DATABRICKS
DATABRICKS_MODEL=databricks-meta-llama-3-3-70b-instruct
# Or Azure OpenAI
LLM_PROVIDER=AZURE
AZURE_OPENAI_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_DEPLOYMENT=gpt-4o
| Issue | Solution |
|---|---|
| "No LLM endpoint configured" | Set DATABRICKS_MODEL or AZURE_OPENAI_DEPLOYMENT environment variable |
| "Volume does not exist" | The tool creates volumes automatically; ensure you have CREATE VOLUME permission |
| "PDF generation timeout" | Reduce count or check LLM endpoint availability |
| Low quality content | Provide more detailed description with specific topics and document types |