Skill

skill-grader

Grades skill test outputs against expectations using transcripts and files, extracts implicit claims, and provides evaluation feedback.

testing

npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-tools

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/majestic-tools:skill-grader

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

Read Grep Glob Write

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Evaluate skill test run outputs against expectations and extract implicit claims.

SKILL.md

145 lines · ~946 tokens

Similar Skills

eval-run

Executes skill evaluations against test cases, scores outputs with judges, and reports results. Useful for benchmarking, regression detection, and verifying skill quality.

12 files9 tools

agent-eval-harness

surrogate-verifier

242

Generates structured test assertions and failure diagnostics for skill packages from definition and task prompts via information-isolated surrogate verification.

3 files

armory

skill-forge-eval

Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.

skill-forge

Stats

LanguageShell

Parent stars39

Parent forks7

MaintenanceExcellent

Last CommitMar 21, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Skill Grader

Evaluate skill test run outputs against expectations and extract implicit claims.

Input Schema

expectations:        # List of verifiable statements
  - "Output includes X"
  - "Skill used script Y"
transcript_path:     # Path to execution transcript
outputs_dir:         # Directory containing output files
eval_prompt:         # Original task prompt

Grading Process

Step 1: Read Context

TRANSCRIPT = Read(transcript_path)
OUTPUT_FILES = Glob(outputs_dir + "/**/*")
For each FILE in OUTPUT_FILES:
  CONTENT[FILE] = Read(FILE)

Note: eval_prompt, execution steps, errors, final result.

Step 2: Grade Expectations

For each EXPECTATION in expectations:

EVIDENCE = search TRANSCRIPT and CONTENT for EXPECTATION
If EVIDENCE confirms EXPECTATION genuinely (not superficially):
  verdict = PASS
Else:
  verdict = FAIL

PASS criteria:

Clear evidence in transcript or outputs
Evidence reflects genuine task completion, not surface compliance
A correct filename with wrong content is FAIL, not PASS

FAIL criteria:

No evidence found
Evidence contradicts expectation
Evidence is superficial (right format, wrong substance)
Cannot be verified from available information

When uncertain: burden of proof is on the expectation to pass.

Step 3: Extract Claims

Beyond predefined expectations, find implicit claims:

For each CLAIM in (TRANSCRIPT + CONTENT):
  CLAIM.type = "factual" | "process" | "quality"
  CLAIM.verified = verify(CLAIM, available_evidence)
  CLAIM.evidence = supporting_or_contradicting_text

Factual: "The form has 12 fields" — check against outputs
Process: "Used pypdf to fill the form" — verify from transcript
Quality: "All fields filled correctly" — evaluate if justified

Flag unverifiable claims.

Step 4: Critique the Evals

After grading, assess whether the evals themselves could improve. Only surface suggestions when there's a clear gap:

Assertion that passed but would also pass for clearly wrong output
Important outcome (good or bad) that no assertion covers
Assertion that can't actually be verified from available outputs

Keep bar high. Flag things the eval author would say "good catch" about.

Step 5: Write Results

Write(outputs_dir + "/../grading.json", RESULTS)

Output Schema

{
  "expectations": [
    {
      "text": "The output includes X",
      "passed": true,
      "evidence": "Found in transcript Step 3: '...'"
    }
  ],
  "summary": {
    "passed": 2,
    "failed": 1,
    "total": 3,
    "pass_rate": 0.67
  },
  "claims": [
    {
      "claim": "The form has 12 fillable fields",
      "type": "factual",
      "verified": true,
      "evidence": "Counted 12 fields in output"
    }
  ],
  "eval_feedback": {
    "suggestions": [
      {
        "assertion": "Output includes name",
        "reason": "A hallucinated doc mentioning the name would also pass"
      }
    ],
    "overall": "No suggestions, evals look solid."
  }
}

Field requirements:

expectations[].text, .passed, .evidence — all required (viewer depends on exact names)
summary.pass_rate — float 0.0 to 1.0
claims[] — optional but encouraged
eval_feedback — include only when warranted; "No suggestions" is fine

Error Handling

Condition	Action
Transcript not found	FAIL all expectations, note in evidence
Output files empty	FAIL expectations requiring output content
Binary files in outputs	Note as unreadable, skip content check
Malformed JSON in outputs	FAIL expectations about JSON structure

skill-grader

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

skill-grader

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Skill Grader

Input Schema

Grading Process

Step 1: Read Context

Step 2: Grade Expectations

Step 3: Extract Claims

Step 4: Critique the Evals

Step 5: Write Results

Output Schema

Error Handling

Similar Skills

Help us improve

Skill Grader

Input Schema

Grading Process

Step 1: Read Context

Step 2: Grade Expectations

Step 3: Extract Claims

Step 4: Critique the Evals

Step 5: Write Results

Output Schema

Error Handling