Help us improve
Share bugs, ideas, or general feedback.
From majestic-tools
Grades skill test outputs against expectations using transcripts and files, extracts implicit claims, and provides evaluation feedback.
npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-toolsHow this skill is triggered — by the user, by Claude, or both
Slash command
/majestic-tools:skill-graderThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Evaluate skill test run outputs against expectations and extract implicit claims.
Executes skill evaluations against test cases, scores outputs with judges, and reports results. Useful for benchmarking, regression detection, and verifying skill quality.
Generates structured test assertions and failure diagnostics for skill packages from definition and task prompts via information-isolated surrogate verification.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Share bugs, ideas, or general feedback.
Evaluate skill test run outputs against expectations and extract implicit claims.
expectations: # List of verifiable statements
- "Output includes X"
- "Skill used script Y"
transcript_path: # Path to execution transcript
outputs_dir: # Directory containing output files
eval_prompt: # Original task prompt
TRANSCRIPT = Read(transcript_path)
OUTPUT_FILES = Glob(outputs_dir + "/**/*")
For each FILE in OUTPUT_FILES:
CONTENT[FILE] = Read(FILE)
Note: eval_prompt, execution steps, errors, final result.
For each EXPECTATION in expectations:
EVIDENCE = search TRANSCRIPT and CONTENT for EXPECTATION
If EVIDENCE confirms EXPECTATION genuinely (not superficially):
verdict = PASS
Else:
verdict = FAIL
PASS criteria:
FAIL criteria:
When uncertain: burden of proof is on the expectation to pass.
Beyond predefined expectations, find implicit claims:
For each CLAIM in (TRANSCRIPT + CONTENT):
CLAIM.type = "factual" | "process" | "quality"
CLAIM.verified = verify(CLAIM, available_evidence)
CLAIM.evidence = supporting_or_contradicting_text
Flag unverifiable claims.
After grading, assess whether the evals themselves could improve. Only surface suggestions when there's a clear gap:
Keep bar high. Flag things the eval author would say "good catch" about.
Write(outputs_dir + "/../grading.json", RESULTS)
{
"expectations": [
{
"text": "The output includes X",
"passed": true,
"evidence": "Found in transcript Step 3: '...'"
}
],
"summary": {
"passed": 2,
"failed": 1,
"total": 3,
"pass_rate": 0.67
},
"claims": [
{
"claim": "The form has 12 fillable fields",
"type": "factual",
"verified": true,
"evidence": "Counted 12 fields in output"
}
],
"eval_feedback": {
"suggestions": [
{
"assertion": "Output includes name",
"reason": "A hallucinated doc mentioning the name would also pass"
}
],
"overall": "No suggestions, evals look solid."
}
}
Field requirements:
| Condition | Action |
|---|---|
| Transcript not found | FAIL all expectations, note in evidence |
| Output files empty | FAIL expectations requiring output content |
| Binary files in outputs | Note as unreadable, skip content check |
| Malformed JSON in outputs | FAIL expectations about JSON structure |