Trigger with "evaluate autoresearch fit", "score this skill for karpathy loop", "is this a good autoresearch candidate", "assess autoresearch viability for", "which skills are best for autonomous loop optimization", "score skills for 3-file architecture", or when the user wants to determine if a skill is a good candidate for applying the Karpathy autoresearch autonomous optimization loop pattern.
From agent-plugin-analyzernpx claudepluginhub richfrem/agent-plugins-skills --plugin agent-plugin-analyzerThis skill is limited to using the following tools:
evals/evals.jsonevals/results.tsvreferences/program.mdscripts/update_ranked_skills.pyCreates consistent pitch decks, one-pagers, investor memos, financial models, accelerator apps, and fundraising materials from a single source of truth.
Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.
Provides expertise on electricity/gas procurement, tariff optimization, demand charge management, renewable PPA evaluation, hedging, load profiling, and multi-facility energy strategies.
Assess whether a skill is a viable candidate for the Karpathy 3-File Autoresearch autonomous
optimization loop. Scores each skill on four dimensions, proposes what the 3-file architecture
would look like, and updates the canonical summary-ranked-skills.json via the update script.
The Karpathy autoresearch pattern requires three conditions simultaneously:
Skills that lack these properties cannot run an effective autonomous loop.
The canonical ranked skills list lives at:
plugin-research/experiments/analyze-candidates-for-auto-reseaarch/skills/eval-autoresearch-fit/assets/resources/summary-ranked-skills.json
After every evaluation, update it with the update script (see Step 5).
Each dimension is scored 1-10. Max total = 40.
| Dimension | 10 (Best) | 1 (Worst) |
|---|---|---|
| Objectivity | Binary pass/fail or exact numeric output from a shell command | Purely subjective, requires human taste judgment |
| Execution Speed | Completes in seconds | Requires 30+ min or human input |
| Frequency of Use | Triggered multiple times per day | Rarely needed (monthly or less) |
| Potential Utility | Prevents systemic failures or saves hours per session | Nice-to-have improvement |
Viability thresholds:
If $ARGUMENTS is a path to a directory containing SKILL.md, read it directly.
Otherwise find it by name from the repo root:
PROJECT_ROOT=$(git rev-parse --show-toplevel)
find "$PROJECT_ROOT/plugins" -name "SKILL.md" | grep "$ARGUMENTS" | head -5
Read the SKILL.md fully before scoring.
Reason through each dimension explicitly before assigning a number.
Objectivity (1-10)
Execution Speed (1-10)
Frequency of Use (1-10)
Potential Utility (1-10)
Determine the loop type:
Important: if a skill has both a script component and a prompt component, propose splitting into two separate loops. Label them Loop A (script) and Loop B (prompt). Score and barrier each separately.
The Spec (program.md):
What is the optimization goal? What constraints apply? What is the NEVER STOP directive?
The Mutation Target: Which single file does the agent modify per iteration? If the skill inherently requires multi-file changes, flag this as a barrier and propose how to isolate it.
The Evaluator (evaluate.py):
Note: this
evaluate.pyis a script you would write when implementing the autoresearch loop for the target skill — it is NOT part of this skill. This skill only describes what it would look like. When ready to build the loop, createevaluate.pyinside the target skill'sautoresearch/directory.
Produce the assessment in this format:
## Autoresearch Fit Assessment: [Skill Name]
**Plugin:** [plugin-name]
**Skill path:** [relative path from repo root]
### Scores
| Dimension | Score | Rationale |
|---|---|---|
| Objectivity | X/10 | [one line] |
| Execution Speed | X/10 | [one line] |
| Frequency of Use | X/10 | [one line] |
| Potential Utility | X/10 | [one line] |
| **TOTAL** | **X/40** | |
**Verdict: [HIGH / MEDIUM / LOW / NOT_VIABLE]**
**Loop type: [DETERMINISTIC / LLM_IN_LOOP / HYBRID]**
### Proposed 3-File Architecture
**Spec (`program.md`):**
> [2-3 sentences: optimization goal + constraints + NEVER STOP directive]
**Mutation Target:** `[path/to/file]`
**Evaluator command:**
```bash
[shell command that outputs a single number]
Deterministic: [YES / NO + explanation]
[1-2 sentences. If MEDIUM: what to address first.]
Then update the JSON using the update script:
```bash
DATA_JSON=$(git rev-parse --show-toplevel)/plugin-research/experiments/analyze-candidates-for-auto-reseaarch/skills/eval-autoresearch-fit/assets/resources/summary-ranked-skills.json
SKILL_DIR=.
python "$SKILL_DIR/scripts/update_ranked_skills.py" \
--json-path "$DATA_JSON" \
--plugin <plugin> \
--skill <skill> \
--objectivity X --speed X --frequency X --utility X \
--verdict HIGH|MEDIUM|LOW|NOT_VIABLE \
--loop-type DETERMINISTIC|LLM_IN_LOOP|HYBRID \
--mutation-target "path/to/file" \
--evaluator-command "python evaluate.py ..." \
--barriers "Barrier 1" "Barrier 2" \
--eval-notes "Key insight from this evaluation" \
--status EVALUATED
DATA_JSON=$(git rev-parse --show-toplevel)/plugin-research/experiments/analyze-candidates-for-auto-reseaarch/skills/eval-autoresearch-fit/assets/resources/summary-ranked-skills.json
# List all entries with current status
python ./scripts/update_ranked_skills.py --json-path "$DATA_JSON" --list
# Show a specific entry
python ./scripts/update_ranked_skills.py --json-path "$DATA_JSON" \
--plugin agent-execution-disciplines --skill verification-before-completion --show
# List only PENDING entries (next batch to evaluate)
python ./scripts/update_ranked_skills.py --json-path "$DATA_JSON" \
--list --filter-status PENDING
# Generate morning report (full ranked table + recommendation)
python ./scripts/update_ranked_skills.py --json-path "$DATA_JSON" --morning-report
When the user says "evaluate next batch" or "continue the list":
--list --filter-status PENDING to see remaining skillstotal_autoresearch_viabilityWhen a skill scores HIGH or MEDIUM, scaffold the actual autoresearch loop inside the target skill
using the autoresearch/ convention. This folder lives inside the target skill directory.
Directory convention (inside the target skill):
plugins/<plugin>/skills/<skill>/
SKILL.md ← mutation target (agent edits this each iteration)
autoresearch/ ← the loop lives here
program.md ← the spec (goal + constraints + NEVER STOP)
evaluate.py ← LOCKED evaluator (agent must never modify this)
results.tsv ← experiment ledger (one row per iteration)
tasks/ ← golden task fixtures (LLM_IN_LOOP skills only)
test-fixtures/ ← deterministic inputs (DETERMINISTIC skills only)
Why evaluate.py runs every iteration:
The loop is: agent mutates SKILL.md → run autoresearch/evaluate.py → record metric in
results.tsv → KEEP (commit) or DISCARD (git reset). evaluate.py is locked — the agent only
touches the mutation target.
Cost by loop type:
Implement DETERMINISTIC candidates first where possible (fast, free, many trials).
Scaffold steps for HIGH/MEDIUM verdicts:
autoresearch/ inside the target skill directoryprogram.md from the template (goal, metric, mutation target, NEVER STOP)evaluate.py implementing the evaluator command from the assessmentresults.tsv with header: commit\tmetric\tstatus\tdescriptiontest-fixtures/ with at least one deterministic inputtasks/ with at least one human-validated golden taskplugins/ from repo root, report path before proceeding