From opik
Builds, evaluates, and monitors AI agents using Opik: architecture patterns, metrics like hallucination and task completion, production observability, debugging, and best practices.
npx claudepluginhub comet-ml/opik-claude-code-pluginThis skill uses the workspace's default tool permissions.
This skill covers the agent lifecycle beyond basic tracing: architecture patterns, evaluation, metrics, and production monitoring. All examples use Opik for observability — for SDK details (tracing, integrations, span types), load the `opik` skill.
Evaluates and optimizes LLM agent output using MLflow datasets, scorers, judges, and tracing. Improves tool selection accuracy, answer quality, reduces costs, fixes incomplete responses.
Evaluates LLM agents via behavioral testing, capability assessment, reliability metrics, and production monitoring to assess real-world performance.
Designs observability for multi-agent systems with per-agent metrics, aggregate stats, agent cards, and event streams to monitor execution, track costs, log activities, and debug workflows.
Share bugs, ideas, or general feedback.
This skill covers the agent lifecycle beyond basic tracing: architecture patterns, evaluation, metrics, and production monitoring. All examples use Opik for observability — for SDK details (tracing, integrations, span types), load the opik skill.
opik skill)Trace every component of your agent with appropriate span types:
import opik
@opik.track(name="research_agent")
def agent(query: str) -> str:
plan = plan_action(query) # general span
results = execute_tool(plan) # tool span
return generate_response(results) # llm span
@opik.track(type="tool")
def execute_tool(action: dict) -> str:
return search_web(action["query"])
@opik.track(type="llm")
def generate_response(context: str) -> str:
return llm_call(context)
| Component | Span Type | Key Data |
|---|---|---|
| Planning | general | Reasoning steps, decisions |
| Tool calls | tool | Tool name, parameters, results |
| LLM calls | llm | Prompt, response, tokens |
| Retrieval | tool | Query, documents |
| Validation | guardrail | Check results, pass/fail |
Evaluate agents at multiple levels — end-to-end and per-component:
from opik.evaluation import evaluate
from opik.evaluation.metrics import AnswerRelevance, Hallucination, AgentTaskCompletion
results = evaluate(
experiment_name="agent-v2",
dataset=dataset,
task=lambda item: {"output": agent(item["input"])},
scoring_metrics=[
AnswerRelevance(),
Hallucination(),
AgentTaskCompletion(),
]
)
| Metric | What It Measures |
|---|---|
AgentTaskCompletion | Did the agent fulfill its task? |
AgentToolCorrectness | Were tools used correctly? |
TrajectoryAccuracy | Did actions match expected sequence? |
AnswerRelevance | Does the answer address the question? |
Hallucination | Are there unsupported claims? |
Heuristic (Equals, Contains, BLEU, ROUGE, BERTScore, IsJson, etc.), LLM-as-Judge (AnswerRelevance, Hallucination, Usefulness, GEval, etc.), RAG (ContextPrecision, ContextRecall, Faithfulness), and conversation metrics. See references/evaluation.md for the full list.
| Category | Anti-Pattern |
|---|---|
| Reliability | Unbounded loops, retry storms, silent failures |
| Security | Prompt injection, privilege escalation, data leakage |
| Observability | Late tracing (missing input), orphaned spans |
| Tools | Tool loops, hallucinated tools, parameter errors |
| Topic | Reference File |
|---|---|
| Agent architecture, reliability, security patterns | references/agent-patterns.md |
| Evaluation datasets, experiments, all 41 metrics | references/evaluation.md |
| Production dashboards, alerts, guardrails, cost tracking | references/production.md |