Help us improve
Share bugs, ideas, or general feedback.
From gr
Queries, tags, evaluates, and manages MLflow traces via MCP tools. Used for debugging, performance analysis, feedback logging, custom scoring, and trace cleanup.
npx claudepluginhub galbaz1/video-research-mcpHow this skill is triggered — by the user, by Claude, or both
Slash command
/gr:mlflow-tracesThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Query, tag, evaluate, and manage MLflow traces captured from video-research-mcp Gemini API calls. Uses `mcp__mlflow-mcp__*` MCP tools — no code writing needed for most operations.
Searches and retrieves MLflow traces by ID, session, user, status, or execution time using CLI commands. Helps debug failed traces and filter trace data.
Retrieves and debugs trace and span data from Arize ML observability platform using arize_toolkit CLI. Lists recent traces, fetches by ID, shows spans, analyzes latency/tokens/cost, exports data.
Tracks ML experiments with Trackio: log metrics and alerts via Python API during training, retrieve via CLI, sync to Hugging Face Spaces dashboards.
Share bugs, ideas, or general feedback.
Query, tag, evaluate, and manage MLflow traces captured from video-research-mcp Gemini API calls. Uses mcp__mlflow-mcp__* MCP tools — no code writing needed for most operations.
Core principle: Search first, then act. Always verify before destructive operations.
| Task | Tool | Key Params |
|---|---|---|
| Find traces | search_traces | experiment_id, filter_string, extract_fields |
| Get details | get_trace | trace_id, extract_fields |
| Tag trace | set_trace_tag | trace_id, key, value |
| Log score | log_feedback | trace_id, name, value, rationale |
| Run scorers | evaluate_traces | experiment_id, trace_ids, scorers |
| List scorers | list_scorers | — |
CRITICAL — only use fields that actually exist:
| Path | Content | Common mistake |
|---|---|---|
info.trace_id | Trace identifier | — |
info.state | Status: OK, ERROR | NOT info.status |
info.request_time | Timestamp | NOT info.timestamp_ms |
info.execution_duration_ms | Duration in ms | NOT info.execution_duration |
info.request_preview | First ~100 chars of request | — |
info.response_preview | First ~100 chars of response | — |
info.tags | All tags as object | Use info.tags.* for all |
data.spans.*.name | Span names | Must include data. prefix |
data.spans.*.status_code | Span status | NOT data.spans.*.status |
data.spans.*.inputs | Span inputs | Moderate size |
data.spans.*.outputs | Span outputs | Moderate size |
Always use extract_fields. Video-research-mcp traces contain video URIs, cached content references, full Gemini prompts/responses. A single get_trace without extract_fields can flood your context window.
// BAD - pulls everything
get_trace({ trace_id: "tr-..." })
search_traces({ experiment_id: "2" })
// GOOD - selective fields
get_trace({ trace_id: "tr-...",
extract_fields: "info.*,data.spans.*.name,data.spans.*.status_code" })
search_traces({ experiment_id: "2", max_results: 10,
extract_fields: "info.trace_id,info.state,info.execution_duration_ms" })
Never request data.spans.*.attributes unqualified — it silently drops dotted keys and can contain massive payloads.
CRITICAL: filter_string and extract_fields use DIFFERENT field names:
| Data | filter_string syntax | extract_fields syntax |
|---|---|---|
| Status | status = 'ERROR' | info.state |
| Timestamp | timestamp_ms > 170000... | info.request_time |
| Duration | execution_time_ms > 5000 | info.execution_duration_ms |
| Tags | tags.reviewed = 'true' | info.tags.* |
search_traces({ experiment_id: "<id>", filter_string: "status='ERROR'", max_results: 20,
extract_fields: "info.trace_id,info.state,info.execution_duration_ms,info.request_preview" })
get_trace({ trace_id: "tr-abc123",
extract_fields: "info.*,data.spans.*.name,data.spans.*.status_code" })
set_trace_tag({ trace_id: "tr-abc123", key: "needs_investigation", value: "true" })
search_traces({ experiment_id: "<id>", filter_string: "execution_time_ms > 5000",
max_results: 20, extract_fields: "info.trace_id,info.execution_duration_ms,data.spans.*.name" })
log_feedback({ trace_id: "tr-abc123", name: "response_quality", value: 4.5,
source_type: "human", rationale: "Accurate analysis, good structure" })
// List available scorers first
list_scorers()
// Run evaluation
evaluate_traces({ experiment_id: "<id>", trace_ids: "tr-abc,tr-def",
scorers: "Correctness,RelevanceToQuery" })
// Step 1: Preview
search_traces({ experiment_id: "<id>", filter_string: "timestamp < 1704067200000",
max_results: 10, extract_fields: "info.trace_id,info.request_time" })
// Step 2: Verify count and IDs, then delete
delete_traces({ experiment_id: "<id>", max_timestamp_millis: 1704067200000 })
"info.trace_id,info.state" // Minimal overview
"info.trace_id,info.execution_duration_ms,data.spans.*.name" // Performance
"info.*,data.spans.*.name,data.spans.*.status_code" // Full context (safe)
"info.trace_id,info.tags.*" // Tags only
"info.trace_id,info.assessments.*.feedback.value" // Feedback scores
| Setting | Value |
|---|---|
| Tracking server | http://127.0.0.1:5001 (default) |
| Experiment name | video-research-mcp |
| Env var | MLFLOW_TRACKING_URI |
| Autolog captures | All GeminiClient generate/generate_structured calls |
| Trace spans | Gemini API calls with model, thinking level, tokens, cost |
Traces are captured automatically when MLFLOW_TRACKING_URI is set. No code changes needed — mlflow.gemini.autolog() hooks into the google-genai SDK.
The MLflow tracking server must be running:
MLFLOW_TRACKING_URI=http://127.0.0.1:5001 mlflow server --port 5001
Then restart Claude Code to reconnect.
MLFLOW_TRACKING_URI is set in the server environmentmax_results: 1 across experiment IDsThe default experiment is video-research-mcp. If traces land in Default (experiment 0), the MLFLOW_EXPERIMENT_NAME env var is not set.