Skill

create-experiment

Scaffolds Dokimos Experiments wiring datasets, tasks, evaluators, and reporters for LLM evaluation pipelines, model testing, and end-to-end eval workflows.

Java

ai-ml

testing

npx claudepluginhub dokimos-dev/dokimos --plugin create-experiment

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Scaffold a Dokimos Experiment. The user will describe the evaluation goal via `$ARGUMENTS`.

SKILL.md

Similar Skills

create-tests

Scaffolds JUnit parameterized tests for LLM evaluations using dokimos-junit and @DatasetSource. Enables eval-driven development with datasets as test cases in CI.

create-tests

create-evaluator

Scaffolds new Evaluator classes for Dokimos LLM evaluation framework with custom metrics, scoring functions, and grading logic for LLM outputs.

create-evaluator

langsmith-evaluator

Builds LangSmith evaluation pipelines: create LLM-as-Judge/custom evaluators, capture agent outputs/trajectories via run functions, run locally with evaluate() or CLI.

langsmith-skills

Stats

Parent Repo Stars20

Parent Repo Forks3

Last CommitFeb 21, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Create Experiment

Scaffold a Dokimos Experiment. The user will describe the evaluation goal via $ARGUMENTS.

Where things live

Experiment class: dokimos-core/src/main/java/dev/dokimos/core/Experiment.java
Task interface: dokimos-core/src/main/java/dev/dokimos/core/Task.java
Example experiments: dokimos-examples/src/main/java/dev/dokimos/examples/

Before writing code, read these files to understand the API:

Experiment.java — the orchestrator
dokimos-examples/src/main/java/dev/dokimos/examples/basic/BasicEvaluationExample.java — simplest example

Experiment anatomy

An experiment consists of four parts:

Dataset — the test cases (see create-dataset skill)
Task — a function that takes an Example and returns Map<String, Object> of actual outputs
Evaluators — one or more Evaluator implementations that score the outputs
Reporter (optional) — sends results to the Dokimos server or elsewhere

Basic template

Dataset dataset = Dataset.fromJson(Path.of("src/test/resources/datasets/my-dataset.json"));

Task task = example -> {
    String input = example.input();
    String output = callYourLLM(input);
    return Map.of("output", output);
};

List<Evaluator> evaluators = List.of(
        ExactMatchEvaluator.builder()
                .name("Exact Match")
                .threshold(1.0)
                .build()
);

ExperimentResult result = Experiment.builder()
        .name("My Evaluation")
        .description("Evaluating my LLM on QA tasks")
        .dataset(dataset)
        .task(task)
        .evaluators(evaluators)
        .build()
        .run();

System.out.println("Pass rate: " + result.passRate());

With JUnit integration

public class MyEvaluationTest {

    @ParameterizedTest(name = "[{index}] {0}")
    @DatasetSource("classpath:datasets/my-dataset.json")
    void testMyLLM(Example example) {
        String actualOutput = callYourLLM(example.input());
        EvalTestCase testCase = example.toTestCase(actualOutput);

        List<Evaluator> evaluators = List.of(
                ExactMatchEvaluator.builder().build()
        );

        assertEval(testCase, evaluators);
    }
}

With server reporting

Reporter reporter = DokimosServerReporter.builder()
        .baseUrl("http://localhost:8080")
        .build();

ExperimentResult result = Experiment.builder()
        .name("My Evaluation")
        .dataset(dataset)
        .task(task)
        .evaluators(evaluators)
        .reporter(reporter)
        .build()
        .run();

Builder options

Method	Description	Default
`.name(String)`	Experiment name	`"unnamed"`
`.description(String)`	Description	`""`
`.dataset(Dataset)`	Test dataset	required
`.task(Task)`	System under test	required
`.evaluator(Evaluator)`	Add a single evaluator	—
`.evaluators(List)`	Add multiple evaluators	—
`.reporter(Reporter)`	Result reporter	`NoOpReporter`
`.parallelism(int)`	Concurrent examples	`1`
`.runs(int)`	Repeat count for variance reduction	`1`
`.metadata(String, Object)`	Add experiment metadata	—

Steps

Determine from $ARGUMENTS what the user wants to evaluate
Choose the right integration (plain Java, LangChain4j, Spring AI, Koog)
Create or reference a dataset
Define the task function wrapping the user's system
Select appropriate evaluators for the use case
Wire everything together with Experiment.builder()
Add a JUnit test if the experiment should run in CI