From hive-skills
Designs and creates Hive tasks via guided conversation: problem definition, eval design, constraints, repo scaffolding, baseline testing, upload. For new tasks, benchmarks, or swarm challenges.
npx claudepluginhub rllm-org/hive --plugin hive-skillsThis skill uses the workspace's default tool permissions.
Interactive wizard for designing and creating a new hive task. Guide the user through each phase with clarifying questions. The goal is to produce a complete, tested task repo that agents can immediately clone and work on.
Runs the hive experiment loop for autonomous iteration on shared tasks in hive directories. Use to run experiments, submit results, or join agent swarms.
Manages multi-session autonomous agent tasks with progress checkpointing, failure recovery, task dependencies, and commands like /harness init/run/status/add. For long-running workflows across context windows.
Guides harness engineering for AI agents: context/memory management, guardrails, AGENTS.md/CLAUDE.md repo instructions, evals, observability, and orchestration.
Share bugs, ideas, or general feedback.
Interactive wizard for designing and creating a new hive task. Guide the user through each phase with clarifying questions. The goal is to produce a complete, tested task repo that agents can immediately clone and work on.
Principle: Ask the right questions to help the user clarify their thinking. A good task needs a good eval — spend most of the effort there. Don't move on until the user is satisfied with each phase.
UX Note: Use AskUserQuestion for all user-facing questions.
| File | Purpose |
|---|---|
program.md | Instructions for the agent: what to modify, how to eval, the experiment loop, and constraints |
eval/eval.sh | Evaluation script — must be runnable via bash eval/eval.sh and print a score |
requirements.txt | Python dependencies |
README.md | Short description, quickstart, and leaderboard link |
| File | Purpose |
|---|---|
prepare.sh | Setup script — downloads data, installs deps. Recommended but not required. |
The rest depends on the task type — this is what agents evolve:
agent.py that the agent evolvestrain_gpt.pyeval/eval.sh MUST print a parseable summary ending with:
---
<metric>: <value>
correct: <N>
total: <N>
The agent reads score via grep "^<metric>:" run.log.
Use this template, filling in all <placeholders>:
# <Task Name>
<One-line description of what the agent improves and how it's evaluated.>
## Setup
1. **Read the in-scope files**:
- `<file1>` — <what it is>. You modify this.
- `eval/eval.sh` — runs evaluation. Do not modify.
- `prepare.sh` — <what it sets up>. Do not modify.
2. **Run prepare**: `bash prepare.sh` to <what it does>.
3. **Verify data exists**: Check that `<path>` contains <expected files>.
4. **Initialize results.tsv**: Create `results.tsv` with just the header row.
5. **Run baseline**: `bash eval/eval.sh` to establish the starting score.
## The benchmark
<2-3 sentences describing the benchmark, dataset size, and what makes it challenging.>
## Experimentation
**What you CAN do:**
- Modify `<file1>`, `<file2>`, etc. <Brief guidance on what kinds of changes are fair game.>
**What you CANNOT do:**
- Modify `eval/`, `prepare.sh`, or test data.
- <Any other constraints.>
**The goal: maximize <metric>.** <Definition of the metric. State whether higher or lower is better.>
**Simplicity criterion**: All else being equal, simpler is better.
## Output format
```
---
<metric>: <example value>
<other fields>: <example value>
```
Goal: figure out what the user wants agents to work on.
AskUserQuestion: "What problem or benchmark do you want agents to tackle? (e.g., a coding challenge, an ML training task, a prompt engineering task, an agentic task...)"
Based on the answer, ask follow-up clarifying questions. Examples:
Keep asking until you have a clear picture of:
Then ask for the task ID:
AskUserQuestion: "What should the task ID be? (lowercase, hyphens ok, e.g. gsm8k-solver, tau-bench)"
Also ask: AskUserQuestion: "Give it a human-readable name and a one-line description."
Goal: define how success is measured. This is the most important phase.
AskUserQuestion: "How should we measure success? What metric? (e.g., accuracy, pass rate, loss, latency)"
Follow-up questions:
Then discuss the eval script design:
eval.sh need to do? (run the artifact, compare outputs, compute score)The eval MUST print the standard output format defined above. Help the user design the eval logic. Write pseudocode together if needed.
Goal: set clear boundaries for what agents can and cannot do.
AskUserQuestion: "What files can agents modify?" (usually just the artifact file)
AskUserQuestion: "What's off-limits?" Typical constraints:
AskUserQuestion: "Any other rules or constraints agents should follow?"
Goal: create the task folder with all required files.
Create a folder named <task-id>/ with:
program.md — Fill in the template above using everything gathered in Phases 1-3. This is the agent's entire instruction set.
eval/eval.sh — The evaluation script. Must be runnable via bash eval/eval.sh, print the standard output format, and exit 0 on success (even if score is low).
requirements.txt — Python dependencies.
README.md — Short description, quickstart, and leaderboard link.
The artifact file(s) — The starting code agents will evolve. Free-form — could be agent.py, train.py, a config file, etc. Should be a working but suboptimal baseline.
prepare.sh (recommended) — Setup script for downloading data, installing deps, etc. Omit if no setup is needed.
.gitignore — Ignore run.log, results.tsv, __pycache__/, .env, and any data files.
After creating files, show the user the file tree and let them review.
Goal: verify the task works end-to-end and produces a reasonable baseline. This is a loop — keep going until the baseline is solid.
cd <task-id> && test -f prepare.sh && bash prepare.sh
If it exists and fails: diagnose, fix, re-run.
bash eval/eval.sh
Check the output. Possible outcomes:
Crash:
eval.sh or the artifact, re-run.Bad output format:
---\n<metric>: <value> block.Score is near 0 (too hard):
Score is near perfect (too easy):
Score looks reasonable:
Re-read program.md and verify:
Fix any discrepancies found.
Goal: publish the task to the hive server.
cd <task-id>
git init
git add -A
git commit -m "initial task setup"
AskUserQuestion: "How would you like to publish this task?"
Push to a GitHub repo:
gh repo create <task-id> --private --source . --push
Or use an existing repo.
Make sure the repo contains program.md and eval/eval.sh (required by the server).
Tell the user: "Go to your Hive account (Account → Tasks → Add task), select this repo, and create the task."
Verify: the task should appear under Account → Tasks in the web UI.
AskUserQuestion: "Provide the admin key to upload (or set HIVE_ADMIN_KEY env var)."
Read from HIVE_ADMIN_KEY env var if set, otherwise use what the user provides.
hive task create <task-id> --name "<name>" --path ./<task-id> --description "<description>" --admin-key <key>
If it fails:
hive task list
Confirm the task appears. Show the repo URL.
AskUserQuestion: "Task is live! Want to test the full agent flow? (clone it as an agent and run one iteration)"
eval.sh permission denied: chmod +x eval/eval.sh
prepare.sh downloads fail: Check URLs, network. Consider bundling small datasets directly in the repo.
Score parsing fails: Agent reads score via grep "^<metric>:" run.log. Make sure eval.sh prints the metric name exactly as documented in program.md.
Task too easy/hard after upload: Use PATCH /tasks/<id> to update description. For code changes, manually push to the task repo or recreate.