From create-experiment
Scaffolds Dokimos Experiments wiring datasets, tasks, evaluators, and reporters for LLM evaluation pipelines, model testing, and end-to-end eval workflows.
npx claudepluginhub dokimos-dev/dokimos --plugin create-experimentThis skill uses the workspace's default tool permissions.
Scaffold a Dokimos Experiment. The user will describe the evaluation goal via `$ARGUMENTS`.
Scaffolds JUnit parameterized tests for LLM evaluations using dokimos-junit and @DatasetSource. Enables eval-driven development with datasets as test cases in CI.
Scaffolds new Evaluator classes for Dokimos LLM evaluation framework with custom metrics, scoring functions, and grading logic for LLM outputs.
Builds LangSmith evaluation pipelines: create LLM-as-Judge/custom evaluators, capture agent outputs/trajectories via run functions, run locally with evaluate() or CLI.
Share bugs, ideas, or general feedback.
Scaffold a Dokimos Experiment. The user will describe the evaluation goal via $ARGUMENTS.
dokimos-core/src/main/java/dev/dokimos/core/Experiment.javadokimos-core/src/main/java/dev/dokimos/core/Task.javadokimos-examples/src/main/java/dev/dokimos/examples/Before writing code, read these files to understand the API:
Experiment.java — the orchestratordokimos-examples/src/main/java/dev/dokimos/examples/basic/BasicEvaluationExample.java — simplest exampleAn experiment consists of four parts:
create-dataset skill)Example and returns Map<String, Object> of actual outputsEvaluator implementations that score the outputsDataset dataset = Dataset.fromJson(Path.of("src/test/resources/datasets/my-dataset.json"));
Task task = example -> {
String input = example.input();
String output = callYourLLM(input);
return Map.of("output", output);
};
List<Evaluator> evaluators = List.of(
ExactMatchEvaluator.builder()
.name("Exact Match")
.threshold(1.0)
.build()
);
ExperimentResult result = Experiment.builder()
.name("My Evaluation")
.description("Evaluating my LLM on QA tasks")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.build()
.run();
System.out.println("Pass rate: " + result.passRate());
public class MyEvaluationTest {
@ParameterizedTest(name = "[{index}] {0}")
@DatasetSource("classpath:datasets/my-dataset.json")
void testMyLLM(Example example) {
String actualOutput = callYourLLM(example.input());
EvalTestCase testCase = example.toTestCase(actualOutput);
List<Evaluator> evaluators = List.of(
ExactMatchEvaluator.builder().build()
);
assertEval(testCase, evaluators);
}
}
Reporter reporter = DokimosServerReporter.builder()
.baseUrl("http://localhost:8080")
.build();
ExperimentResult result = Experiment.builder()
.name("My Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.reporter(reporter)
.build()
.run();
| Method | Description | Default |
|---|---|---|
.name(String) | Experiment name | "unnamed" |
.description(String) | Description | "" |
.dataset(Dataset) | Test dataset | required |
.task(Task) | System under test | required |
.evaluator(Evaluator) | Add a single evaluator | — |
.evaluators(List) | Add multiple evaluators | — |
.reporter(Reporter) | Result reporter | NoOpReporter |
.parallelism(int) | Concurrent examples | 1 |
.runs(int) | Repeat count for variance reduction | 1 |
.metadata(String, Object) | Add experiment metadata | — |
$ARGUMENTS what the user wants to evaluateExperiment.builder()