Help us improve
Share bugs, ideas, or general feedback.
Deploys MLflow models, custom pyfunc, and GenAI agents to Databricks Model Serving endpoints. Queries endpoints, checks status, integrates UC Functions and Vector Search tools.
npx claudepluginhub databricks-solutions/ai-dev-kit --plugin databricks-ai-dev-kitHow this skill is triggered — by the user, by Claude, or both
Slash command
/databricks-ai-dev-kit:databricks-model-servingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Deploy MLflow models and AI agents to scalable REST API endpoints.
Manages Databricks Model Serving endpoints via CLI: create, configure, query, and maintain for LLM inference, custom ML models, and external models.
Create and manage Databricks Agent Bricks: Knowledge Assistants for document Q&A via RAG, Genie Spaces for natural language to SQL, and Supervisor Agents for multi-agent orchestration. Use for conversational AI apps on Databricks.
Executes Databricks ML workflow: Feature Store engineering, MLflow training/tracking, Unity Catalog registry, Mosaic AI serving for production inference.
Share bugs, ideas, or general feedback.
Deploy MLflow models and AI agents to scalable REST API endpoints.
| Model Type | Pattern | Reference |
|---|---|---|
| Traditional ML (sklearn, xgboost) | mlflow.sklearn.autolog() | 1-classical-ml.md |
| Custom Python model | mlflow.pyfunc.PythonModel | 2-custom-pyfunc.md |
| GenAI Agent (LangGraph, tool-calling) | ResponsesAgent | 3-genai-agents.md |
ALWAYS use exact endpoint names from this table. NEVER guess or abbreviate.
| Endpoint Name | Provider | Notes |
|---|---|---|
databricks-gpt-5-2 | OpenAI | Latest GPT, 400K context |
databricks-gpt-5-1 | OpenAI | Instant + Thinking modes |
databricks-gpt-5-1-codex-max | OpenAI | Code-specialized (high perf) |
databricks-gpt-5-1-codex-mini | OpenAI | Code-specialized (cost-opt) |
databricks-gpt-5 | OpenAI | 400K context, reasoning |
databricks-gpt-5-mini | OpenAI | Cost-optimized reasoning |
databricks-gpt-5-nano | OpenAI | High-throughput, lightweight |
databricks-gpt-oss-120b | OpenAI | Open-weight, 128K context |
databricks-gpt-oss-20b | OpenAI | Lightweight open-weight |
databricks-claude-opus-4-6 | Anthropic | Most capable, 1M context |
databricks-claude-sonnet-4-6 | Anthropic | Hybrid reasoning |
databricks-claude-sonnet-4-5 | Anthropic | Hybrid reasoning |
databricks-claude-opus-4-5 | Anthropic | Deep analysis, 200K context |
databricks-claude-sonnet-4 | Anthropic | Hybrid reasoning |
databricks-claude-opus-4-1 | Anthropic | 200K context, 32K output |
databricks-claude-haiku-4-5 | Anthropic | Fastest, cost-effective |
databricks-claude-3-7-sonnet | Anthropic | Retiring April 2026 |
databricks-meta-llama-3-3-70b-instruct | Meta | 128K context, multilingual |
databricks-meta-llama-3-1-405b-instruct | Meta | Retiring May 2026 (PT) |
databricks-meta-llama-3-1-8b-instruct | Meta | Lightweight, 128K context |
databricks-llama-4-maverick | Meta | MoE architecture |
databricks-gemini-3-1-pro | 1M context, hybrid reasoning | |
databricks-gemini-3-pro | 1M context, hybrid reasoning | |
databricks-gemini-3-flash | Fast, cost-efficient | |
databricks-gemini-2-5-pro | 1M context, Deep Think | |
databricks-gemini-2-5-flash | 1M context, hybrid reasoning | |
databricks-gemma-3-12b | 128K context, multilingual | |
databricks-qwen3-next-80b-a3b-instruct | Alibaba | Efficient MoE |
| Endpoint Name | Dimensions | Max Tokens | Notes |
|---|---|---|---|
databricks-gte-large-en | 1024 | 8192 | English, not normalized |
databricks-bge-large-en | 1024 | 512 | English, normalized |
databricks-qwen3-embedding-0-6b | up to 1024 | ~32K | 100+ languages, instruction-aware |
databricks-meta-llama-3-3-70b-instruct (good balance of quality/cost)databricks-gte-large-endatabricks-gpt-5-1-codex-mini or databricks-gpt-5-1-codex-maxThese are pay-per-token endpoints available in every workspace. For production, consider provisioned throughput mode. See supported models.
| Topic | File | When to Read |
|---|---|---|
| Classical ML | 1-classical-ml.md | sklearn, xgboost, autolog |
| Custom PyFunc | 2-custom-pyfunc.md | Custom preprocessing, signatures |
| GenAI Agents | 3-genai-agents.md | ResponsesAgent, LangGraph |
| Tools Integration | 4-tools-integration.md | UC Functions, Vector Search |
| Development & Testing | 5-development-testing.md | MCP workflow, iteration |
| Logging & Registration | 6-logging-registration.md | mlflow.pyfunc.log_model |
| Deployment | 7-deployment.md | Job-based async deployment |
| Querying Endpoints | 8-querying-endpoints.md | SDK, REST, MCP tools |
| Package Requirements | 9-package-requirements.md | DBR versions, pip |
%pip install -U mlflow==3.6.0 databricks-langchain langgraph==0.3.4 databricks-agents pydantic
dbutils.library.restartPython()
Or via MCP:
execute_code(code="%pip install -U mlflow==3.6.0 databricks-langchain langgraph==0.3.4 databricks-agents pydantic")
Create agent.py locally with ResponsesAgent pattern (see 3-genai-agents.md).
manage_workspace_files(
action="upload",
local_path="./my_agent",
workspace_path="/Workspace/Users/you@company.com/my_agent"
)
execute_code(
file_path="./my_agent/test_agent.py",
cluster_id="<cluster_id>"
)
execute_code(
file_path="./my_agent/log_model.py",
cluster_id="<cluster_id>"
)
See 7-deployment.md for job-based deployment that doesn't timeout.
manage_serving_endpoint(
action="query",
name="my-agent-endpoint",
messages=[{"role": "user", "content": "Hello!"}]
)
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
# Enable autolog with auto-registration
mlflow.sklearn.autolog(
log_input_examples=True,
registered_model_name="main.models.my_classifier"
)
# Train - model is logged and registered automatically
model = LogisticRegression()
model.fit(X_train, y_train)
Then deploy via UI or SDK. See 1-classical-ml.md.
If MCP tools are not available, use the SDK/CLI examples in the reference files below.
| Tool | Purpose |
|---|---|
manage_workspace_files (action="upload") | Upload agent files to workspace |
execute_code | Install packages, test agent, log model |
| Tool | Purpose |
|---|---|
manage_jobs (action="create") | Create deployment job (one-time) |
manage_job_runs (action="run_now") | Kick off deployment (async) |
manage_job_runs (action="get") | Check deployment job status |
| Action | Description | Required Params |
|---|---|---|
get | Check endpoint status (READY/NOT_READY/NOT_FOUND) | name |
list | List all endpoints | (none, optional limit) |
query | Send requests to endpoint | name + one of: messages, inputs, dataframe_records |
Example usage:
# Check endpoint status
manage_serving_endpoint(action="get", name="my-agent-endpoint")
# List all endpoints
manage_serving_endpoint(action="list")
# Query a chat/agent endpoint
manage_serving_endpoint(
action="query",
name="my-agent-endpoint",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=500
)
# Query a traditional ML endpoint
manage_serving_endpoint(
action="query",
name="sklearn-classifier",
dataframe_records=[{"age": 25, "income": 50000, "credit_score": 720}]
)
manage_serving_endpoint(action="get", name="my-agent-endpoint")
Returns:
{
"name": "my-agent-endpoint",
"state": "READY",
"served_entities": [...]
}
manage_serving_endpoint(
action="query",
name="my-agent-endpoint",
messages=[
{"role": "user", "content": "What is Databricks?"}
],
max_tokens=500
)
manage_serving_endpoint(
action="query",
name="sklearn-classifier",
dataframe_records=[
{"age": 25, "income": 50000, "credit_score": 720}
]
)
| Issue | Solution |
|---|---|
| Invalid output format | Use self.create_text_output_item(text, id) - NOT raw dicts! |
| Endpoint NOT_READY | Deployment takes ~15 min. Use manage_serving_endpoint(action="get") to poll. |
| Package not found | Specify exact versions in pip_requirements when logging model |
| Tool timeout | Use job-based deployment, not synchronous calls |
| Auth error on endpoint | Ensure resources specified in log_model for auto passthrough |
| Model not found | Check Unity Catalog path: catalog.schema.model_name |
WRONG - raw dicts don't work:
return ResponsesAgentResponse(output=[{"role": "assistant", "content": "..."}])
CORRECT - use helper methods:
return ResponsesAgentResponse(
output=[self.create_text_output_item(text="...", id="msg_1")]
)
Available helper methods:
self.create_text_output_item(text, id) - text responsesself.create_function_call_item(id, call_id, name, arguments) - tool callsself.create_function_call_output_item(call_id, output) - tool results