Skill

execute

Executes benchmark test cases in sandboxed VMs using AI agents. Spins up containers, scaffolds workspace, runs agent, and extracts solution artifacts.

testing

automation

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/agentic-usability:execute [project-directory] [--tests TC-001,TC-002] [--run runId]

User invocable

Model invocation disabled

Inline context

Default effort

Uses dynamic context injection — preprocesses shell commands at runtime

Argument hint[project-directory] [--tests TC-001,TC-002] [--run runId]

Tool Access

This skill is limited to the following tools:

Bash(agentic-usability *)ReadGlob

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Run the executor stage of the benchmark pipeline. For each test case and target, this:

SKILL.md

71 lines · ~651 tokens

Stats

LanguageTypeScript

Stars15

MaintenanceExcellent

Last CommitJun 11, 2026

Actions

View Source View Plugin View on GitHub View README

Execute Test Cases

Run the executor stage of the benchmark pipeline. For each test case and target, this:

Creates a sandboxed VM from the target image
Scaffolds workspace (template → setup script → per-test setup instructions)
Uploads PROBLEM.md with the problem statement
Installs the agent CLI inside the sandbox
Uploads public sources (docs, packages) into /workspace/sources/
Runs the executor agent to solve the problem
Extracts solution files from /workspace/solution/
Saves all artifacts and destroys the sandbox

echo "Arguments: $ARGUMENTS"

Options

--tests <ids>: Comma-separated test case IDs to run (e.g., --tests TC-001,TC-003)
--run <runId>: Target a specific run directory (default: latest run)

Per-Test Output Files

Saved to results/<runId>/<target>/<testId>/:

File	Description
`generated-solution.json`	Agent's solution `[{path, content}]`
`agent-notes.md`	Agent's self-reported working notes
`agent-output.log`	Raw agent stdout/stderr
`agent-cmd.log`	Exact command executed
`agent-session.jsonl`	Agent conversation log (if available)
`agent-egress.log.json`	Network traffic logs
`workspace-snapshot.tar.gz`	Full sandbox workspace tarball
`setup.log`	Workspace scaffolding log
`agent-error.log`	Error details (only on failure)
`install-error.log`	Agent install failure (only on error)

Progress Tracking

Progress is tracked in results/<runId>/pipeline-state.json:

completed.execute["<target>"] lists test IDs that have finished
State is saved after each test — safe to interrupt and resume
Use agentic-usability eval --resume to continue from where it stopped

To check which tests completed, read the pipeline state:

results/<runId>/pipeline-state.json → completed.execute.<targetName>

Retry Behavior

Failed tests are retried up to 2 times with backoffs of 1s and 3s before being marked as failed.

Concurrency

Controlled by sandbox.concurrency in config.json. Multiple sandboxes run in parallel.

Run agentic-usability execute -p $ARGUMENTS and report the results.

For detailed internals, see pipeline-guide.md.

execute

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

execute

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Execute Test Cases

Options

Per-Test Output Files

Progress Tracking

Retry Behavior

Concurrency

Similar Skills

Execute Test Cases

Options

Per-Test Output Files

Progress Tracking

Retry Behavior

Concurrency

Similar Skills