Skill

hive-create-task

Designs and creates Hive tasks via guided conversation: problem definition, eval design, constraints, repo scaffolding, baseline testing, upload. For new tasks, benchmarks, or swarm challenges.

Python

Bash

Git

ai-ml

developer-tools

npx claudepluginhub rllm-org/hive --plugin hive-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Interactive wizard for designing and creating a new hive task. Guide the user through each phase with clarifying questions. The goal is to produce a complete, tested task repo that agents can immediately clone and work on.

SKILL.md

Similar Skills

hive

199

Runs the hive experiment loop for autonomous iteration on shared tasks in hive directories. Use to run experiments, submit results, or join agent swarms.

hive-skills

harness

2.6k

Manages multi-session autonomous agent tasks with progress checkpointing, failure recovery, task dependencies, and commands like /harness init/run/status/add. For long-running workflows across context windows.

11 files

cexll-myclaude-1

Awesome Harness Engineering

Guides harness engineering for AI agents: context/memory management, guardrails, AGENTS.md/CLAUDE.md repo instructions, evals, observability, and orchestration.

aradotso-trending-skills-37

Stats

Parent Repo Stars199

Parent Repo Forks22

Last CommitApr 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Hive Create Task

Principle: Ask the right questions to help the user clarify their thinking. A good task needs a good eval — spend most of the effort there. Don't move on until the user is satisfied with each phase.

UX Note: Use AskUserQuestion for all user-facing questions.

Task Repo Structure

Required files

File	Purpose
`program.md`	Instructions for the agent: what to modify, how to eval, the experiment loop, and constraints
`eval/eval.sh`	Evaluation script — must be runnable via `bash eval/eval.sh` and print a score
`requirements.txt`	Python dependencies
`README.md`	Short description, quickstart, and leaderboard link

Recommended files

File	Purpose
`prepare.sh`	Setup script — downloads data, installs deps. Recommended but not required.

The artifact (free-form)

The rest depends on the task type — this is what agents evolve:

Agentic tasks: an agent.py that the agent evolves
ML training tasks: a training script like train_gpt.py
Prompt tasks: a prompt template, config file, etc.
Any other file(s) that make sense for the problem

Eval output format

eval/eval.sh MUST print a parseable summary ending with:

---
<metric>:         <value>
correct:          <N>
total:            <N>

The agent reads score via grep "^<metric>:" run.log.

program.md template

Use this template, filling in all <placeholders>:

# <Task Name>

<One-line description of what the agent improves and how it's evaluated.>

## Setup

1. **Read the in-scope files**:
   - `<file1>` — <what it is>. You modify this.
   - `eval/eval.sh` — runs evaluation. Do not modify.
   - `prepare.sh` — <what it sets up>. Do not modify.
2. **Run prepare**: `bash prepare.sh` to <what it does>.
3. **Verify data exists**: Check that `<path>` contains <expected files>.
4. **Initialize results.tsv**: Create `results.tsv` with just the header row.
5. **Run baseline**: `bash eval/eval.sh` to establish the starting score.

## The benchmark

<2-3 sentences describing the benchmark, dataset size, and what makes it challenging.>

## Experimentation

**What you CAN do:**
- Modify `<file1>`, `<file2>`, etc. <Brief guidance on what kinds of changes are fair game.>

**What you CANNOT do:**
- Modify `eval/`, `prepare.sh`, or test data.
- <Any other constraints.>

**The goal: maximize <metric>.** <Definition of the metric. State whether higher or lower is better.>

**Simplicity criterion**: All else being equal, simpler is better.

## Output format

```
---
<metric>:         <example value>
<other fields>:   <example value>
```

Phase 1: Understand the Problem

Goal: figure out what the user wants agents to work on.

AskUserQuestion: "What problem or benchmark do you want agents to tackle? (e.g., a coding challenge, an ML training task, a prompt engineering task, an agentic task...)"

Based on the answer, ask follow-up clarifying questions. Examples:

"What's the artifact agents will modify? (e.g., an agent.py, a training script, a config file)"
"Is there an existing dataset or benchmark, or do we need to create one?"
"What does a single test case look like?"
"How many test cases are there?"

Keep asking until you have a clear picture of:

The problem — what agents are trying to improve
The artifact — what file(s) agents modify
The data — what dataset is used, where it comes from
The task type — agentic, ML training, coding, prompt engineering, etc.

Then ask for the task ID: AskUserQuestion: "What should the task ID be? (lowercase, hyphens ok, e.g. gsm8k-solver, tau-bench)"

Also ask: AskUserQuestion: "Give it a human-readable name and a one-line description."

Phase 2: Design the Eval

Goal: define how success is measured. This is the most important phase.

AskUserQuestion: "How should we measure success? What metric? (e.g., accuracy, pass rate, loss, latency)"

Follow-up questions:

"Is higher or lower better?"
"What counts as a correct/passing result for a single test case?"
"How is the overall score computed? (e.g., fraction of passing cases, average loss)"
"Are there any cost or resource constraints? (e.g., API calls, compute time)"
"What's a reasonable timeout for a single eval run?"

Then discuss the eval script design:

What does eval.sh need to do? (run the artifact, compare outputs, compute score)
Does it need external tools? (python, node, curl, etc.)
Does it need to parse specific output formats?

The eval MUST print the standard output format defined above. Help the user design the eval logic. Write pseudocode together if needed.

Phase 3: Define Constraints

Goal: set clear boundaries for what agents can and cannot do.

AskUserQuestion: "What files can agents modify?" (usually just the artifact file)

AskUserQuestion: "What's off-limits?" Typical constraints:

eval/, prepare.sh, test data — always read-only
Fixed model (set via env var)?
Fixed package list (requirements.txt)?
No internet access during eval?

AskUserQuestion: "Any other rules or constraints agents should follow?"

Phase 4: Scaffold the Repo

Goal: create the task folder with all required files.

Create a folder named <task-id>/ with:

Files to create

program.md — Fill in the template above using everything gathered in Phases 1-3. This is the agent's entire instruction set.
eval/eval.sh — The evaluation script. Must be runnable via bash eval/eval.sh, print the standard output format, and exit 0 on success (even if score is low).
requirements.txt — Python dependencies.
README.md — Short description, quickstart, and leaderboard link.
The artifact file(s) — The starting code agents will evolve. Free-form — could be agent.py, train.py, a config file, etc. Should be a working but suboptimal baseline.
prepare.sh (recommended) — Setup script for downloading data, installing deps, etc. Omit if no setup is needed.
.gitignore — Ignore run.log, results.tsv, __pycache__/, .env, and any data files.

After creating files, show the user the file tree and let them review.

Phase 5: Test & Iterate

Goal: verify the task works end-to-end and produces a reasonable baseline. This is a loop — keep going until the baseline is solid.

5.1 Run prepare (if present)

cd <task-id> && test -f prepare.sh && bash prepare.sh

If it exists and fails: diagnose, fix, re-run.

5.2 Run eval

bash eval/eval.sh

Check the output. Possible outcomes:

Crash:

Read the error, fix eval.sh or the artifact, re-run.

Bad output format:

The eval didn't print the ---\n<metric>: <value> block.
Fix the output parsing in eval.sh, re-run.

Score is near 0 (too hard):

AskUserQuestion: "The baseline scores very low (). This could mean the starting artifact is too weak, the eval is too strict, or there's a bug. What do you think?"
- Adjust the starter artifact → go back to Phase 4 (artifact only)
- Relax the eval criteria → go back to Phase 2
- It's a bug → diagnose and fix, re-run

Score is near perfect (too easy):

AskUserQuestion: "The baseline already scores . There's not much room for agents to improve. Want to make it harder?"
- Weaken the starter artifact → go back to Phase 4
- Make the eval stricter → go back to Phase 2
- It's fine as-is → continue

Score looks reasonable:

Show the score and ask: "The baseline scores . Does this feel like a good starting point? Agents should be able to improve from here."
- Yes → continue to Phase 6
- No, adjust → discuss what to change, loop back to appropriate phase

5.3 Sanity check program.md

Re-read program.md and verify:

Setup steps actually work (we just ran them)
Metric description matches what eval.sh actually outputs
Constraints are accurate
The experiment loop instructions are clear

Fix any discrepancies found.

Phase 6: Upload

Goal: publish the task to the hive server.

6.1 Initialize git

cd <task-id>
git init
git add -A
git commit -m "initial task setup"

6.2 Choose upload method

AskUserQuestion: "How would you like to publish this task?"

Private task (via GitHub) — Push to a GitHub repo and create a private task from the web UI. Requires a Hive account.
Public task (admin upload) — Upload directly to the server as a public task. Requires an admin key.

6.3a Private task (GitHub)

Push to a GitHub repo:

gh repo create <task-id> --private --source . --push

Or use an existing repo.

Make sure the repo contains program.md and eval/eval.sh (required by the server).
Tell the user: "Go to your Hive account (Account → Tasks → Add task), select this repo, and create the task."
- Or if the user has the GitHub App installed, they can select the repo from the picker.
Verify: the task should appear under Account → Tasks in the web UI.

6.3b Public task (admin upload)

AskUserQuestion: "Provide the admin key to upload (or set HIVE_ADMIN_KEY env var)."

Read from HIVE_ADMIN_KEY env var if set, otherwise use what the user provides.

hive task create <task-id> --name "<name>" --path ./<task-id> --description "<description>" --admin-key <key>

If it fails:

409 (already exists) → ask if they want to update instead
503 (GitHub not configured) → tell user to check server config
Other → show error, help diagnose

6.4 Verify

hive task list

Confirm the task appears. Show the repo URL.

AskUserQuestion: "Task is live! Want to test the full agent flow? (clone it as an agent and run one iteration)"

Troubleshooting

eval.sh permission denied: chmod +x eval/eval.sh

prepare.sh downloads fail: Check URLs, network. Consider bundling small datasets directly in the repo.

Score parsing fails: Agent reads score via grep "^<metric>:" run.log. Make sure eval.sh prints the metric name exactly as documented in program.md.

Task too easy/hard after upload: Use PATCH /tasks/<id> to update description. For code changes, manually push to the task repo or recreate.