Help us improve
Share bugs, ideas, or general feedback.
From DeepEval
Adds end-to-end eval loops for AI agents and LLM apps: instrument, generate datasets, run pytest eval suites, and iterate on failures. Covers DeepEval SDK, CLI, tracing, and Confident AI reporting.
npx claudepluginhub confident-ai/deepeval --plugin deepevalHow this skill is triggered — by the user, by Claude, or both
Slash command
/deepeval:deepevalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill to add an end-to-end eval loop to AI applications:
LICENSEreferences/artifact-contracts.mdreferences/choose-use-case.mdreferences/confident-ai.mdreferences/datasets.mdreferences/intake.mdreferences/iteration-loop.mdreferences/metrics.mdreferences/pytest-e2e-evals.mdreferences/synthetic-data.mdreferences/traced-evals.mdtemplates/metrics.pytemplates/test_multi_turn_e2e.pytemplates/test_single_turn_no_tracing.pytemplates/test_single_turn_tracing.pyInstruments AI applications (LLM apps, agents, RAG pipelines) with DeepEval's native tracing for span-by-span visibility in Confident AI's Observatory. Supports framework integrations and manual @observe.
Evaluates and optimizes LLM agent output using MLflow datasets, scorers, judges, and tracing. Improves tool selection accuracy, answer quality, reduces costs, fixes incomplete responses.
Builds AI agent evaluations using Anthropic patterns: code/model/human graders, tasks, trials, benchmarks for coding, conversational, research agents.
Share bugs, ideas, or general feedback.
Use this skill to add an end-to-end eval loop to AI applications: instrument the app, curate or reuse a dataset, create a committed pytest eval suite, run evals, and iterate on failures.
Requires Python 3.9+ and pip install deepeval in the target project. Metrics
and synthetic generation need model credentials. Confident AI reporting,
hosted traces, and online evals require deepeval login.
deepeval generate.deepeval-tracing skill when
traced evals are used.deepeval test run.@observe — is
handled by the deepeval-tracing skill; raw OpenTelemetry export by the
deepeval-otel skill.deepeval generate for dataset generation. Use deepeval test run for
pytest eval execution. Do not default to the raw pytest command.metrics.py module for committed eval suites.references/choose-use-case.md.references/intake.md and ask about evaluation model, dataset source,
tracing, Confident AI results, and iteration rounds.references/pytest-e2e-evals.md.references/metrics.md.references/artifact-contracts.md for expected file locations.templates/test_multi_turn_e2e.py for chatbot / multi-turn agent.templates/test_single_turn_tracing.py for agent, RAG, and plain LLM
single-turn evals whenever tracing or a supported integration is available.templates/test_single_turn_no_tracing.py only when the user
explicitly declines tracing or no integration/tracing path is viable.templates/metrics.py or the project's existing
metrics module, not inline in the eval file.references/datasets.md.references/synthetic-data.md.deepeval generate; do not
hand-create or make up goldens.references/datasets.md.deepeval-tracing skill
(framework integrations and manual @observe).references/traced-evals.md for the traced eval shapes and span
metrics.Golden
input and call assert_test(golden=golden, metrics=[...]).for golden in dataset.evals_iterator(metrics=[...]).LLMTestCases.references/pytest-e2e-evals.md.next_*_span(metrics=[...]) or @observe(metrics=[...]).templates/ and replace every
placeholder before running anything.deepeval test run tests/evals/test_<app>.py.--num-processes 5,
--ignore-errors, --skip-on-missing-params, and --identifier.references/iteration-loop.md for the requested number of rounds.Bootstrap single-turn goldens from docs only when no curated dataset exists:
deepeval generate --method docs --variation single-turn --documents ./docs --output-dir ./tests/evals --file-name .dataset
Run the eval suite:
deepeval test run tests/evals/test_<app>.py --num-processes 5 --identifier "iterating-on-<purpose>-round-1"
Open the latest hosted report when Confident AI is enabled:
deepeval view
| Topic | File |
|---|---|
| Intake questions and branching | references/intake.md |
| Use case selection | references/choose-use-case.md |
| Dataset loading | references/datasets.md |
| Synthetic data generation | references/synthetic-data.md |
| Metrics | references/metrics.md |
| Pytest E2E evals | references/pytest-e2e-evals.md |
| Traced evals and span metrics | references/traced-evals.md |
| Confident AI | references/confident-ai.md |
| Dataset and eval artifact contracts | references/artifact-contracts.md |
| Iteration loop | references/iteration-loop.md |
| App type | Template |
|---|---|
| Single-turn tracing | templates/test_single_turn_tracing.py |
| Single-turn no tracing | templates/test_single_turn_no_tracing.py |
| Multi-turn E2E | templates/test_multi_turn_e2e.py |
| Shared metric lists | templates/metrics.py |