Skill

skill-grader

Evaluate skill test run outputs against expectations and extract implicit claims.

From majestic-tools
Install
1
Run in your terminal
$
npx claudepluginhub majesticlabs-dev/majestic-marketplace --plugin majestic-tools
Tool Access

This skill is limited to using the following tools:

Read Grep Glob Write
Skill Content

Skill Grader

Evaluate skill test run outputs against expectations and extract implicit claims.

Input Schema

expectations:        # List of verifiable statements
  - "Output includes X"
  - "Skill used script Y"
transcript_path:     # Path to execution transcript
outputs_dir:         # Directory containing output files
eval_prompt:         # Original task prompt

Grading Process

Step 1: Read Context

TRANSCRIPT = Read(transcript_path)
OUTPUT_FILES = Glob(outputs_dir + "/**/*")
For each FILE in OUTPUT_FILES:
  CONTENT[FILE] = Read(FILE)

Note: eval_prompt, execution steps, errors, final result.

Step 2: Grade Expectations

For each EXPECTATION in expectations:

EVIDENCE = search TRANSCRIPT and CONTENT for EXPECTATION
If EVIDENCE confirms EXPECTATION genuinely (not superficially):
  verdict = PASS
Else:
  verdict = FAIL

PASS criteria:

  • Clear evidence in transcript or outputs
  • Evidence reflects genuine task completion, not surface compliance
  • A correct filename with wrong content is FAIL, not PASS

FAIL criteria:

  • No evidence found
  • Evidence contradicts expectation
  • Evidence is superficial (right format, wrong substance)
  • Cannot be verified from available information

When uncertain: burden of proof is on the expectation to pass.

Step 3: Extract Claims

Beyond predefined expectations, find implicit claims:

For each CLAIM in (TRANSCRIPT + CONTENT):
  CLAIM.type = "factual" | "process" | "quality"
  CLAIM.verified = verify(CLAIM, available_evidence)
  CLAIM.evidence = supporting_or_contradicting_text
  • Factual: "The form has 12 fields" — check against outputs
  • Process: "Used pypdf to fill the form" — verify from transcript
  • Quality: "All fields filled correctly" — evaluate if justified

Flag unverifiable claims.

Step 4: Critique the Evals

After grading, assess whether the evals themselves could improve. Only surface suggestions when there's a clear gap:

  • Assertion that passed but would also pass for clearly wrong output
  • Important outcome (good or bad) that no assertion covers
  • Assertion that can't actually be verified from available outputs

Keep bar high. Flag things the eval author would say "good catch" about.

Step 5: Write Results

Write(outputs_dir + "/../grading.json", RESULTS)

Output Schema

{
  "expectations": [
    {
      "text": "The output includes X",
      "passed": true,
      "evidence": "Found in transcript Step 3: '...'"
    }
  ],
  "summary": {
    "passed": 2,
    "failed": 1,
    "total": 3,
    "pass_rate": 0.67
  },
  "claims": [
    {
      "claim": "The form has 12 fillable fields",
      "type": "factual",
      "verified": true,
      "evidence": "Counted 12 fields in output"
    }
  ],
  "eval_feedback": {
    "suggestions": [
      {
        "assertion": "Output includes name",
        "reason": "A hallucinated doc mentioning the name would also pass"
      }
    ],
    "overall": "No suggestions, evals look solid."
  }
}

Field requirements:

  • expectations[].text, .passed, .evidence — all required (viewer depends on exact names)
  • summary.pass_rate — float 0.0 to 1.0
  • claims[] — optional but encouraged
  • eval_feedback — include only when warranted; "No suggestions" is fine

Error Handling

ConditionAction
Transcript not foundFAIL all expectations, note in evidence
Output files emptyFAIL expectations requiring output content
Binary files in outputsNote as unreadable, skip content check
Malformed JSON in outputsFAIL expectations about JSON structure
Stats
Parent Repo Stars30
Parent Repo Forks6
Last CommitMar 15, 2026