From forge
Evaluates codebases across 12 pillars (hire/stress/day-2 lenses) using 3 parallel evaluator agents, generates eval doc for /pipeline remediation. Use before hiring or major refactors.
npx claudepluginhub hatmanstack/claude-forge --plugin forgeThis skill is limited to using the following tools:
You coordinate a 3-evaluator hiring panel assessment of a codebase. Each evaluator runs as a separate agent with its own context window.
Coordinates parallel agent audits for codebase health, evaluation (12-pillar scoring), technical debt, and documentation drift, producing intake docs for /pipeline.
Performs first-pass strategic review of repositories, producing evidence-cited maps calibrated to reference classes to guide where to engage, tread carefully, or leave alone. Advisory only.
Runs Agent-Ready Codebase Assessment scoring codebase across 8 dimensions with parallel agents, producing weighted 0-100 score, band rating, and improvement roadmap. Supports Ruby, Python, PHP, TypeScript, JavaScript, Go, Java, Scala, Rust.
Share bugs, ideas, or general feedback.
You coordinate a 3-evaluator hiring panel assessment of a codebase. Each evaluator runs as a separate agent with its own context window.
$ARGUMENTS is optional context — the repo path, role level being evaluated, or specific concerns. If empty, evaluate the current working directory.
Ask scoping questions one at a time, preferring multiple choice. Wait for each answer before asking the next.
The code evaluation runs 3 evaluator agents in parallel, each scoring 4 pillars (12 total). These questions calibrate the evaluation.
Question 1 — Known pain points give the evaluators a starting hypothesis instead of scanning cold:
Are there parts of the codebase you already know are problematic?
Things that keep breaking, areas you dread touching, modules that slow down every PR.
A) Yes (tell me which areas and what's wrong)
B) No — scan everything with fresh eyes
Question 2 — Role level sets the scoring bar:
What role level should I evaluate this codebase against?
A) Junior Developer — fundamentals: readability, basic error handling, test presence
B) Mid-Level Developer — patterns: separation of concerns, consistent conventions, test coverage
C) Senior Developer — production: defensive coding, observability, performance awareness, type rigor
D) Staff+ / Principal — systems: architectural coherence, scalability, operational excellence
Question 3 — Focus areas weight what evaluators pay extra attention to (they still score all 12 pillars):
Any specific concerns the evaluators should weight more heavily?
A) Performance — hot paths, algorithmic complexity, resource management
B) Security — input validation, auth patterns, secrets handling
C) Testing — coverage quality, test architecture, edge cases
D) Architecture — separation of concerns, modularity, coupling
E) Multiple (tell me which)
F) None — balanced evaluation across all pillars
Question 4 — Scope and exclusions:
What should the evaluators look at?
A) Full repo, standard exclusions (vendor, generated, node_modules, __pycache__)
B) Full repo, no exclusions
C) Specific directories only (tell me which to include or exclude)
Question 5 — Pillar overrides. By default, /pipeline remediates until all 12 pillars hit 9/10. Some pillars may not be improvable through code changes. The 12 pillars are:
Any pillars to accept below the default 9/10 threshold?
A) None — require 9/10 on all 12 pillars
B) Specific overrides (tell me which pillars and target scores, e.g., "Creativity: 7, Git Hygiene: accept")
Record overrides in the eval.md frontmatter.
Generate the directory name: YYYY-MM-DD-eval-slug
eval-ragstack, eval-billing-api)docs/plans/YYYY-MM-DD-eval-slug/Create the directory.
You (the orchestrator) must read the role prompt files and embed their contents in each agent's prompt. Agents cannot access skill directory files.
skills/pipeline/eval-hire.md — store contents as HIRE_PROMPTskills/pipeline/eval-stress.md — store contents as STRESS_PROMPTskills/pipeline/eval-day2.md — store contents as DAY2_PROMPTThen spawn 3 Agents in parallel:
<role_prompt>
[Contents of eval-hire.md]
</role_prompt>
<task>
Evaluate the codebase in the current working directory.
Role level: [from Step 1]
Focus areas: [from Step 1]
Exclusions: [from Step 1]
</task>
<role_prompt>
[Contents of eval-stress.md]
</role_prompt>
<task>
Evaluate the codebase in the current working directory.
Role level: [from Step 1]
Focus areas: [from Step 1]
Exclusions: [from Step 1]
</task>
<role_prompt>
[Contents of eval-day2.md]
</role_prompt>
<task>
Evaluate the codebase in the current working directory.
Role level: [from Step 1]
Focus areas: [from Step 1]
Exclusions: [from Step 1]
</task>
Verify each evaluator's output contains its completion signal before proceeding:
EVAL_HIRE_COMPLETEEVAL_STRESS_COMPLETEEVAL_DAY2_COMPLETEIf any signal is missing, the agent may have been truncated. Report the incomplete evaluator to the user and do NOT write eval.md with partial data.
If all signals present, Write docs/plans/YYYY-MM-DD-eval-slug/eval.md:
---
type: repo-eval
target: 9
role_level: [from Step 1]
date: YYYY-MM-DD
pillar_overrides:
# Pillars with custom thresholds (omit for default 9)
# creativity: 7
# git_hygiene: accept
---
# Repo Evaluation: [repo name]
## Configuration
- **Role Level:** [Junior | Mid | Senior | Staff+]
- **Focus Areas:** [list]
- **Exclusions:** [list]
## Combined Scorecard
| # | Lens | Pillar | Score | Target | Status |
|---|------|--------|-------|--------|--------|
| 1 | Hire | Problem-Solution Fit | X/10 | 9 | [PASS ≥target | NEEDS WORK <target] |
| 2 | Hire | Architecture | X/10 | ... |
| 3 | Hire | Code Quality | X/10 | ... |
| 4 | Hire | Creativity | X/10 | ... |
| 5 | Stress | Pragmatism | X/10 | ... |
| 6 | Stress | Defensiveness | X/10 | ... |
| 7 | Stress | Performance | X/10 | ... |
| 8 | Stress | Type Rigor | X/10 | ... |
| 9 | Day 2 | Test Value | X/10 | ... |
| 10 | Day 2 | Reproducibility | X/10 | ... |
| 11 | Day 2 | Git Hygiene | X/10 | ... |
| 12 | Day 2 | Onboarding | X/10 | ... |
**Pillars at target (≥9):** N/12
**Pillars needing work (<9):** M/12
## Hire Evaluation — The Pragmatist
[Full evaluator output]
## Stress Evaluation — The Oncall Engineer
[Full evaluator output]
## Day 2 Evaluation — The Team Lead
[Full evaluator output]
## Consolidated Remediation Targets
[Merged and deduplicated targets from all 3 evaluators, prioritized by:
1. Lowest score first
2. Highest complexity last
3. Overlapping findings consolidated]
Append an entry to .claude/skill-runs.json in the repo root. If the file does not exist, create it with an empty array first.
{
"skill": "repo-eval",
"date": "YYYY-MM-DD",
"plan": "YYYY-MM-DD-eval-slug"
}
Evaluation complete: docs/plans/YYYY-MM-DD-eval-slug/eval.md
Scores: [N]/12 pillars at target (≥9)
Lowest: [pillar] at [X]/10
To remediate and bring all pillars to 9/10, run:
/pipeline YYYY-MM-DD-eval-slug
/pipeline after all remediation is complete.