From devboy
Fans out parallel QA sub-agents against devboy CLI to detect regressions (exit codes, stdout hygiene, error propagation, schema drift) and merges findings into a bug log. Use before releases, after merges, or CI expansions.
npx claudepluginhub meteora-pro/devboy-tools --plugin devboyThis skill uses the workspace's default tool permissions.
Run a batch of narrowly-scoped QA sub-agents against a built `devboy` binary, each chasing one class of regression, and merge their findings into a single bug log. The skill does **not** exercise third-party provider APIs by itself — it focuses on contract / hygiene / schema-drift regressions that the unit test suite does not catch (exit codes, stdout vs stderr separation, `ProviderUnsupported`...
Runs multi-agent verification loop post-implementation, dispatching specialized agents for review with autonomous subagent fixes and retries until unanimous approval.
Installs repo QA runtime: scaffolds .maestro/qa/ with config, sub-skills, templates; symlinks qa/ skills to .claude/.codex/skills/; optional GitHub Actions workflow. Use for /maestro-qa or 'set up QA'.
Performs Phase 4 quality gate for cli-web Python CLIs: 3-agent implementation review, 75-check checklist, pip package publishing, and read/write smoke tests.
Share bugs, ideas, or general feedback.
Run a batch of narrowly-scoped QA sub-agents against a built devboy binary, each chasing one class of regression, and merge their findings into a single bug log. The skill does not exercise third-party provider APIs by itself — it focuses on contract / hygiene / schema-drift regressions that the unit test suite does not catch (exit codes, stdout vs stderr separation, ProviderUnsupported fallbacks, schema ↔ executor mismatches, etc.).
Unlike daily-report (trace-driven retrospective) and retro (multi-day pattern detector), this skill is execution-driven: it spins up the CLI many times with crafted inputs, diffs observed behaviour against a documented contract, and records every mismatch.
Not a substitute for:
cargo test / cargo clippy / cargo fmt --check (the skill assumes those already passed).The skill takes three optional inputs, all with sensible defaults:
--binary <path> — which devboy to test. Default: ./target/release/devboy, falling back to $PATH.--bug-log <path> — where the merged findings go. Default: /tmp/devboy-qa/BUGS_FOUND.md.--classes <a,b,c,…> — comma-separated list of bug classes to run (see the agent charters below). Default: all of them.Before spinning up any sub-agents, the main agent confirms the binary is invokable and prints a meaningful --version:
DEVBOY="${DEVBOY:-./target/release/devboy}"
"$DEVBOY" --version || { echo "devboy binary at $DEVBOY is not runnable"; exit 1; }
Also snapshots the git SHA of the source tree — every bug-log entry references it so findings are attributable to a specific build.
The sweep itself is a traced session — retros care about how often a QA pass surfaces new findings and how long it takes:
result=$(devboy trace begin --skill qa-sweep)
SESSION_DIR=$(echo "$result" | jq -r .session_dir)
SESSION_ID=$(echo "$result" | jq -r .session_id)
Record a decision event listing which classes will run.
Each class gets its own sub-agent invocation (Claude Code: Agent tool; other runtimes: equivalent task / subprocess primitive). They run in parallel — the charters are independent and non-destructive. Each sub-agent gets:
$HOME, $TMPDIR, /tmp — nothing that modifies a real project.Launch all sub-agents in one fan-out; the main agent then waits for every report.
exit-codes — Shell scripting readinessCharter. Every error path must return non-zero. Shell scripts cannot differentiate success from failure if the CLI always exits 0.
Probe set (non-exhaustive):
| Case | Invocation |
|---|---|
| unknown skill | devboy skills show bogus-skill |
| unknown tool | devboy tools call bogus_tool '{}' |
| missing required arg | devboy tools call (no NAME) |
| malformed JSON arg | devboy tools call get_issues 'not-json' |
| unknown context | devboy context use unknown |
| unknown agent flag | devboy skills install x --agent bogus --dry-run |
| conflicting flags | devboy skills install x --global --local --dry-run |
| no args in install | devboy skills install |
| install missing skill | devboy skills install nonexistent --dry-run |
remove missing skill --strict | devboy skills remove nonexistent --global --strict --dry-run |
| empty stdin for pipe | echo '' | devboy format-pipeline |
| bad stdin JSON | echo 'bad' | devboy format-pipeline |
| 401 from provider | DEVBOY_GITHUB_TOKEN=bad devboy test github |
tools/call with isError: true | call any tool unsupported by the configured provider |
Expected: every row → exit code != 0.
Report shape: per failing row — the invocation, the stderr, the observed exit code, and (if the agent can infer it) the call chain that dropped the error.
stdout-hygiene — Pipe-abilityCharter. Any subcommand whose output is meant to be parsed by a script must not mix tracing logs into stdout. The canonical test is cmd | jq . — if jq rejects the first line, hygiene is broken.
Probe set:
| Invocation | Expected |
|---|---|
devboy tools call get_issues '{"limit":1}' | pure JSON on stdout |
devboy tools list | tabular / JSON, no interleaved INFO/WARN |
devboy tools call get_issue '{"key":"<real>"}' | pure JSON |
devboy format-pipeline (with valid JSON stdin) | pure TOON / JSON output |
devboy proxy status --json | pure JSON |
devboy mcp (one initialize + one tools/list) | every line on stdout must parse as JSON-RPC |
Report: for each broken invocation, a diff of where the noise starts and the env / subscriber config that routes it to the wrong stream.
error-propagation — Upstream failures surface to the callerCharter. When a provider returns a real error (NotFound, InvalidData, Http, Unauthorized), the caller must see that error — not a generic "no provider supports X" fallback from should_try_next_provider.
Probe set:
tools call get_pipeline '{"branch":"empty-branch"}' — expect a concrete NotFound(...) message, not "No provider supports 'get_pipeline'".DEVBOY_GITHUB_TOKEN=ghp_bad tools call get_issue '{"key":"gh#1"}' → Unauthorized text.tools call get_merge_request_diffs '{"key":"pr#999999"}' → concrete NotFound, not a provider-skipping fallback.tools call get_structure_forest '{"structureId":1}' against a GitHub-only setup → ProviderUnsupported IS acceptable here (the provider legitimately does not implement the tool).Report: the invocation, the error text the user saw, and whether it matches the expected variant. False-positive "No provider supports …" messages are the highest-severity finding.
schema-sync — Tool schema vs executor paramsCharter. For every tool in devboy tools list, the arg names the schema declares must be exactly the arg names the executor deserialises — camelCase vs snake_case mismatches silently drop parameters through unwrap_or_default() in the current code.
Procedure:
devboy mcp with an initialize + tools/list stdin sequence; capture every inputSchema.properties.<name>.devboy tools call <name> '<payload>' against a local fake provider (GitHub test repo is OK) — compare the response against what each param should have done.state value that did not actually filter).Report: per tool, the full declared schema vs the names the executor actually honours.
help-accuracy — --help vs realityCharter. Every --help page should accurately describe what the subcommand does. --global is not "upgrade everything across every recorded target"; it is "target ~/.agents/skills/".
Procedure:
devboy <subcommand> [--subsubcommand] --help.Report: per mismatch, the help text and the contradicting observation.
config-resolution — Where does .devboy.toml come from?Charter. All subcommands should agree on which .devboy.toml is active. Current bugs: config list/config path look only in the global config, while tools call/context list/test walk up from cwd — and cwd discovery can walk into an unrelated parent project.
Procedure:
/tmp/a (empty), /tmp/a/b (contains .devboy.toml), /tmp/a/b/c (empty, subdir of b).config list, config path, tools call get_issues, context list, test github, doctor..devboy.toml each subcommand resolved.Report: the resolution matrix. Inconsistencies are bugs.
credential-resolution — env vars vs keychainCharter. DEVBOY_<PROVIDER>_TOKEN env vars should be honoured by every subcommand that consults credentials (not just test <provider>). devboy doctor should report the same credential state that tools call actually uses.
Procedure:
DEVBOY_GITHUB_TOKEN=<valid>; no keychain entry.devboy test github (expect PASS), devboy tools call get_issues '{}' (expect real data), devboy doctor (expect "GitHub token present").Report: per subcommand, "honours env var" / "does not".
skills-lifecycle — Install idempotenceCharter. Install is idempotent; re-install of the same bytes is a no-op; re-install of changed bytes respects --force; manifest stays in sync with disk.
Procedure (per target: --global, --agent claude, --agent all):
installed outcomes + manifest present.unchanged outcomes; manifest unchanged.--force → expect skipped (user-modified).--force → expect forced.Compare the manifest SHA256 against the shipped history.json at each step.
Report: any step that does not match the contract above.
Each sub-agent appends a section to the bug log in this format:
## qa-sweep / <class-id> — <short title>
**Run:** <timestamp> — binary <path> @ <git sha>
**Status:** FOUND | CLEAN
### Findings
- **[SEVERITY]** *Component* — one-sentence summary
- **Repro:** <commands>
- **Expected:** <behaviour>
- **Actual:** <behaviour>
- **Hint:** <grep-level pointer into the codebase, if any>
Severity scale: BLOCKER (release gate), CRITICAL, MAJOR, MINOR, COSMETIC.
The main agent:
devboy trace end ... --outcome success --summary "<N> findings across <K> classes").After the bug log is written, the main agent prints a short text to stdout:
QA sweep: 17 findings (2 blocker, 3 critical, 9 major, 3 minor)
- exit-codes: 5 findings
- stdout-hygiene: 2 findings (1 critical)
- error-propagation: 2 findings (1 critical)
- schema-sync: 7 findings (2 blocker)
- help-accuracy: 1 finding
- config-resolution: 0 findings ✓
- credential-resolution: 1 finding
- skills-lifecycle: 0 findings ✓
Full log: /tmp/devboy-qa/BUGS_FOUND.md
Status: CLEAN marker. Silent skips are not acceptable.start, one decision, one note per sub-agent (status + finding count), one end.--allow-writes. The default test repo for live provider calls must be scoped by env var (DEVBOY_QA_TEST_REPO=owner/repo) so a typo cannot spam a production project.trace redactor already covers the meta-trace; sub-agents must mirror the same discipline on their own stdout / stderr captures.HOME / XDG_CONFIG_HOME / .devboy.toml search paths.FOUND BLOCKER: binary swapped mid-sweep.--reset flag.unwrap() / off-by-one / rare deserialisation edge cases are the province of cargo test; this skill is for behavioural hygiene only.devboy benchmark already covers the format pipeline; latency / throughput findings are out of scope.