From create-tests
Scaffolds JUnit parameterized tests for LLM evaluations using dokimos-junit and @DatasetSource. Enables eval-driven development with datasets as test cases in CI.
npx claudepluginhub dokimos-dev/dokimos --plugin create-testsThis skill uses the workspace's default tool permissions.
Scaffold evaluation tests using `dokimos-junit` that run Dokimos evaluations as JUnit parameterized tests. The user will describe what they want to test via `$ARGUMENTS`.
Scaffolds Dokimos Experiments wiring datasets, tasks, evaluators, and reporters for LLM evaluation pipelines, model testing, and end-to-end eval workflows.
Scaffolds new Evaluator classes for Dokimos LLM evaluation framework with custom metrics, scoring functions, and grading logic for LLM outputs.
Creates evaluation datasets for Dokimos in JSON, CSV, or JSONL formats for LLM evaluation, test data, experiments, and format conversions.
Share bugs, ideas, or general feedback.
Scaffold evaluation tests using dokimos-junit that run Dokimos evaluations as JUnit parameterized tests. The user will describe what they want to test via $ARGUMENTS.
This approach — eval-driven development — lets users treat LLM evaluations like unit tests: each dataset example becomes a test case, evaluators act as assertions, and the whole thing runs in CI with mvn test.
dokimos-core/src/main/java/dev/dokimos/core/Assertions.javadokimos-junit/src/main/java/dev/dokimos/junit/DatasetSource.javadokimos-core/src/main/java/dev/dokimos/core/Example.javadokimos-examples/src/main/java/dev/dokimos/examples/junit/QuestionAnsweringTest.javaBefore writing code, read QuestionAnsweringTest.java to see a complete working example.
dokimos-junit bridges Dokimos datasets and JUnit's @ParameterizedTest. Each Example in a dataset becomes a separate test invocation. Inside the test, you call your LLM, convert the example to an EvalTestCase with the actual output, and run evaluators via Assertions.assertEval().
The test fails (throws AssertionError) if any evaluator's score falls below its threshold — just like a regular assertion.
import dev.dokimos.core.*;
import dev.dokimos.core.evaluators.*;
import dev.dokimos.junit.DatasetSource;
import org.junit.jupiter.params.ParameterizedTest;
import java.util.List;
class MyEvalTest {
@ParameterizedTest(name = "[{index}] {0}")
@DatasetSource("classpath:datasets/my-dataset.json")
void shouldPassEvaluations(Example example) {
// 1. Call your system under test
String response = callMyLLM(example.input());
// 2. Convert example to test case with actual output
EvalTestCase testCase = example.toTestCase(response);
// 3. Define evaluators
List<Evaluator> evaluators = List.of(
ExactMatchEvaluator.builder().build()
);
// 4. Assert all evaluators pass
Assertions.assertEval(testCase, evaluators);
}
}
@DatasetSource supports three modes:
// From classpath resource (most common in tests)
@DatasetSource("classpath:datasets/qa.json")
// From file path
@DatasetSource("src/test/resources/datasets/qa.json")
// Inline JSON (useful for small, self-contained tests)
@DatasetSource(json = """
{
"examples": [
{"input": "What is 2+2?", "expectedOutput": "4"},
{"input": "Capital of France?", "expectedOutput": "Paris"}
]
}
""")
// Inline JSONL
@DatasetSource(jsonl = """
{"input": "What is 2+2?", "expectedOutput": "4"}
{"input": "Capital of France?", "expectedOutput": "Paris"}
""")
Example provides two toTestCase methods:
// Simple: single output (stored as "output" key)
EvalTestCase testCase = example.toTestCase(response);
// Multi-output: for RAG or multi-field evaluations
EvalTestCase testCase = example.toTestCase(Map.of(
"output", response,
"retrievalContext", retrievedDocs
));
The multi-output form is needed when evaluators require additional data beyond the main output (e.g., FaithfulnessEvaluator needs a context key, ContextualRelevanceEvaluator needs a retrieval context key).
@ParameterizedTest(name = "[{index}] {0}")
@DatasetSource("classpath:datasets/qa.json")
void shouldAnswerCorrectly(Example example) {
String answer = myLLM.generate(example.input());
EvalTestCase testCase = example.toTestCase(answer);
Assertions.assertEval(testCase,
ExactMatchEvaluator.builder().build());
}
When exact matching is too rigid, use LLMJudgeEvaluator for semantic evaluation. It requires evaluationParams (which test case fields to include in the judge prompt) and a JudgeLM instance.
@ParameterizedTest(name = "[{index}] {0}")
@DatasetSource("classpath:datasets/qa.json")
void shouldAnswerCorrectly(Example example) {
String answer = myLLM.generate(example.input());
EvalTestCase testCase = example.toTestCase(answer);
Assertions.assertEval(testCase,
LLMJudgeEvaluator.builder()
.name("answer-quality")
.criteria("Does the actual output correctly answer the question?")
.evaluationParams(List.of(
EvalTestCaseParam.INPUT,
EvalTestCaseParam.EXPECTED_OUTPUT,
EvalTestCaseParam.ACTUAL_OUTPUT))
.threshold(0.7)
.judge(myJudge)
.build());
}
For RAG pipelines, pass retrieval context alongside the output. Use a consistent context key and set it explicitly on evaluators that need it.
@ParameterizedTest(name = "[{index}] {0}")
@DatasetSource("classpath:datasets/rag.json")
void shouldBeGroundedInContext(Example example) {
String query = example.input();
List<String> retrievedDocs = vectorStore.search(query);
String answer = myRAG.generate(query, retrievedDocs);
EvalTestCase testCase = example.toTestCase(Map.of(
"output", answer,
"context", retrievedDocs
));
Assertions.assertEval(testCase, List.of(
FaithfulnessEvaluator.builder()
.judge(judge)
.contextKey("context")
.threshold(0.7)
.build(),
ContextualRelevanceEvaluator.builder()
.judge(judge)
.retrievalContextKey("context")
.threshold(0.5)
.build()
));
}
Tests calling real LLM APIs should be tagged so they don't run in every mvn test invocation.
@Tag("integration")
@EnabledIfEnvironmentVariable(named = "OPENAI_API_KEY", matches = ".+")
class MyIntegrationEvalTest {
private static JudgeLM judge;
@BeforeAll
static void setup() {
// Set up your LLM client and judge here
judge = prompt -> callOpenAI(prompt);
}
@ParameterizedTest(name = "[{index}] {0}")
@DatasetSource("classpath:datasets/qa.json")
void shouldPassEvaluations(Example example) {
// ...
}
}
Run integration tests with: mvn verify -Dgroups=integration
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-junit</artifactId>
<version>${dokimos.version}</version>
<scope>test</scope>
</dependency>
dokimos-junit transitively includes dokimos-core, so you don't need both.
$ARGUMENTS what the user wants to evaluate (Q&A, RAG, chat, etc.)create-dataset skill if needed)src/test/resources/datasets/@ParameterizedTest and @DatasetSourceExample.toTestCase() with the appropriate output keysAssertions.assertEval() to run the evaluations@Tag("integration") if calling external APIs@ParameterizedTest with @DatasetSourcesrc/test/resources/datasets/ (or inline for small tests)Example.toTestCase() is called with the right output structureAssertions.assertEval() is used (not manual assertion on scores)LLMJudgeEvaluator has .evaluationParams(...) and .judge(...) set@Tag("integration")