Skill

agent-ops

Builds, evaluates, and monitors AI agents using Opik: architecture patterns, metrics like hallucination and task completion, production observability, debugging, and best practices.

Python

ai-ml

monitoring

npx claudepluginhub comet-ml/opik-claude-code-plugin

Tool Access

This skill uses the workspace's default tool permissions.

Preview

This skill covers the agent lifecycle beyond basic tracing: architecture patterns, evaluation, metrics, and production monitoring. All examples use Opik for observability — for SDK details (tracing, integrations, span types), load the `opik` skill.

Supporting Assets

references/agent-patterns.mdreferences/evaluation.mdreferences/production.md

SKILL.md

Similar Skills

agent-evaluation

Evaluates and optimizes LLM agent output using MLflow datasets, scorers, judges, and tracing. Improves tool selection accuracy, answer quality, reduces costs, fixes incomplete responses.

17 files6 tools

mlflow

Agent Evaluation

Evaluates LLM agents via behavioral testing, capability assessment, reliability metrics, and production monitoring to assess real-world performance.

3 files

omer-metin-skills-for-antigravity-2

multi-agent-observability

Designs observability for multi-agent systems with per-agent metrics, aggregate stats, agent cards, and event streams to monitor execution, track costs, log activities, and debug workflows.

3 tools

tac

Stats

Stars14

Forks0

Last CommitMar 31, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Agent Operations: Build, Evaluate, and Monitor AI Agents

The Agent Lifecycle

Instrument — Add Opik tracing to make your agent's behavior visible (see opik skill)
Evaluate — Measure performance with datasets, metrics, and experiments
Monitor — Track quality, cost, and reliability in production
Optimize — Improve based on data from evaluation and production traces

Agent Architecture Patterns

Trace every component of your agent with appropriate span types:

import opik

@opik.track(name="research_agent")
def agent(query: str) -> str:
    plan = plan_action(query)        # general span
    results = execute_tool(plan)     # tool span
    return generate_response(results) # llm span

@opik.track(type="tool")
def execute_tool(action: dict) -> str:
    return search_web(action["query"])

@opik.track(type="llm")
def generate_response(context: str) -> str:
    return llm_call(context)

What to Trace

Component	Span Type	Key Data
Planning	`general`	Reasoning steps, decisions
Tool calls	`tool`	Tool name, parameters, results
LLM calls	`llm`	Prompt, response, tokens
Retrieval	`tool`	Query, documents
Validation	`guardrail`	Check results, pass/fail

Evaluation

Evaluate agents at multiple levels — end-to-end and per-component:

from opik.evaluation import evaluate
from opik.evaluation.metrics import AnswerRelevance, Hallucination, AgentTaskCompletion

results = evaluate(
    experiment_name="agent-v2",
    dataset=dataset,
    task=lambda item: {"output": agent(item["input"])},
    scoring_metrics=[
        AnswerRelevance(),
        Hallucination(),
        AgentTaskCompletion(),
    ]
)

Built-in Agent Metrics

Metric	What It Measures
`AgentTaskCompletion`	Did the agent fulfill its task?
`AgentToolCorrectness`	Were tools used correctly?
`TrajectoryAccuracy`	Did actions match expected sequence?
`AnswerRelevance`	Does the answer address the question?
`Hallucination`	Are there unsupported claims?

41 Total Built-in Metrics

Heuristic (Equals, Contains, BLEU, ROUGE, BERTScore, IsJson, etc.), LLM-as-Judge (AnswerRelevance, Hallucination, Usefulness, GEval, etc.), RAG (ContextPrecision, ContextRecall, Faithfulness), and conversation metrics. See references/evaluation.md for the full list.

Production Monitoring

Dashboards — Visualize quality, cost, latency, and error trends
Online evaluation — Automatically score production traces with LLM-as-Judge
Alerts — Get notified when metrics deviate (quality drops, cost spikes, error rates)
Guardrails — PII detection, topic validation, custom safety checks
Opik Assist — AI-powered root cause analysis for failed traces

Common Anti-Patterns

Category	Anti-Pattern
Reliability	Unbounded loops, retry storms, silent failures
Security	Prompt injection, privilege escalation, data leakage
Observability	Late tracing (missing input), orphaned spans
Tools	Tool loops, hallucinated tools, parameter errors

Detailed References

Topic	Reference File
Agent architecture, reliability, security patterns	`references/agent-patterns.md`
Evaluation datasets, experiments, all 41 metrics	`references/evaluation.md`
Production dashboards, alerts, guardrails, cost tracking	`references/production.md`