From cli-printing-press
Reviews printed CLI sampled command outputs for plausibility issues like semantic substring mismatches, silent source drops, ranking failures, wrong URLs, and format bugs missed by rule-based checks.
npx claudepluginhub mvanhorn/cli-printing-press --plugin cli-printing-pressThis skill is limited to using the following tools:
Review the sampled outputs from a printed CLI for plausibility bugs that dogfood, verify, and the rule-based `scorecard --live-check` rules can't catch. Wave B policy: all findings surface as warnings, never errors.
Polishes generated Go CLIs for publishing: runs diagnostics (dogfood, verify, scorecard, go vet), auto-fixes issues (dead code, descriptions, README, MCP quality), reports deltas, offers publish. Use after /printing-press.
Performs Phase 4 quality gate for cli-web Python CLIs: 3-agent implementation review, 75-check checklist, pip package publishing, and read/write smoke tests.
Provides review dimensions and bug patterns for journey artifact reviews, checking coherence, emotional arcs, variable tracking, data quality, CLI UX, and issues like version mismatches.
Share bugs, ideas, or general feedback.
Review the sampled outputs from a printed CLI for plausibility bugs that dogfood, verify, and the rule-based scorecard --live-check rules can't catch. Wave B policy: all findings surface as warnings, never errors.
This skill is internal-only (user-invocable: false). It's invoked by parents — main printing-press skill at shipcheck Phase 4.85, polish skill during its diagnostic loop. Running it standalone would produce floating findings text with no ship verdict, no fixes applied, no publish offer; the actionable wrappers are /printing-press and /printing-press-polish. The skill carries context: fork so the reviewer agent's diagnostic chatter stays isolated from the calling skill's context.
The caller passes $CLI_DIR as the argument: an absolute path to the printed CLI's working directory.
Bugs that rule-based checks miss, typically surfaced by 5 minutes of hands-on testing but slipping past dogfood, verify, and scorecard --live-check rules:
# Locate research.json. Adjacent to the binary covers the post-promote
# layout (standalone polish, shipcheck against the library copy). The
# grandparent fallback covers mid-pipeline invocations where $CLI_DIR is
# $PRESS_RUNSTATE/runs/<id>/working/<cli> and research.json lives at
# $PRESS_RUNSTATE/runs/<id>/research.json. Without the fallback, scorecard
# reports `unable: true` mid-pipeline and we SKIP the most informative review.
# Use a bash array so the flag survives paths with spaces.
RESEARCH_ARGS=()
if [ ! -f "$CLI_DIR/research.json" ]; then
_grandparent="$(dirname "$(dirname "$CLI_DIR")")"
if [ -f "$_grandparent/research.json" ]; then
RESEARCH_ARGS=(--research-dir "$_grandparent")
fi
fi
printing-press scorecard --dir "$CLI_DIR" "${RESEARCH_ARGS[@]}" --live-check --json > /tmp/output-review-livecheck.json 2>&1 || true
If the scorecard call fails or /tmp/output-review-livecheck.json is empty, return the SKIP result (Step 3) without dispatching the reviewer.
Use the Agent tool (general-purpose) with this prompt contract:
Review the sampled outputs from the shipped CLI at
$CLI_DIR. You have these ground-truth sources:
- Sampled command output: read
/tmp/output-review-livecheck.jsonand inspect thelive_check.features[]array. Each entry has the command, example invocation, actual stdout (inoutput_sample, bounded to ~4 KiB), the pass/fail reason, and awarningsarray (populated by rule-based checks like the raw-HTML-entity detector).- Review only
status: passentries. Entries withstatus: faileither crashed, timed out, or had placeholder args (<id>,<url>) that never produced real output — their sample is empty and there's nothing for you to judge. Phase 5 dogfood handles test-coverage and exit-code concerns.$CLI_DIR/research.jsonnovel_features(planned behavior per feature) andnovel_features_built(verified built commands).- The CLI binary at
$CLI_DIR/<cli-name>-pp-cli— you may invoke additional commands to gather more output when a finding needs verification.For each of these checks, report findings under 50 words each. Only report issues a human user would notice in 5 minutes of hands-on testing — not every edge case a thorough QA pass might find:
- Output semantically matches query intent. For sampled novel features with a query argument, judge relevance beyond what the mechanical query-token check in live-check already enforced. A feature that passed live-check's
outputMentionsQuerytest still contains some query token somewhere — but "buttermilk" appearing as a substring of "butter" results, or "brownies" returning a chili recipe because the extractor fell back to adjacent content, both slip past the mechanical check. Only flag when a human user would look at the top results and say "this isn't what I asked for." Skip this check when the example has no query argument.- No obvious format bugs. Does the output contain raw HTML entities, mojibake (question marks or replacement chars in titles), or malformed URLs (pointing at category index pages, feed endpoints, or random-selector routes rather than canonical content permalinks)? Rule-based live-check catches numeric entities; this layer catches the broader class.
- Aggregation commands show all requested sources. For commands with a
--source/--site/--regionCSV flag: if the user requested N sources, does output show N, or does stderr explain the missing ones? Silent drops of failed sources are a top failure mode for fan-out commands.- Result ordering/ranking makes sense. For commands that claim to rank or sort, does the top result look plausibly best given the query? Watch for broken score weights, off-by-one sort bugs, and silent fallback to recency when relevance computation fails.
Return a list of findings. For each: check name, severity (
warningin Wave B;errorreserved for Wave C), one-line description, one-sentence fix suggestion. If the CLI passes all four checks, return "PASS — no findings."
End the skill response with a ---OUTPUT-REVIEW-RESULT--- block the parent parses:
On clean pass:
---OUTPUT-REVIEW-RESULT---
status: PASS
findings: []
---END-OUTPUT-REVIEW-RESULT---
On warnings:
---OUTPUT-REVIEW-RESULT---
status: WARN
findings:
- check: <check-name>
severity: warning
description: <one-line>
suggestion: <one-sentence>
- ...
---END-OUTPUT-REVIEW-RESULT---
On reviewer failure (timeout, agent-budget exhaustion, missing live-check data):
---OUTPUT-REVIEW-RESULT---
status: SKIP
reason: <one-line description>
findings: []
---END-OUTPUT-REVIEW-RESULT---
warning — never error. Shipcheck proceeds regardless.manuscripts/<api>/<run>/proofs/phase-4.85-findings.md) and surfaces them to the user. Findings are not persisted to scorecard.json — that path is reserved for Wave C.Non-interactive contract (CI, cron, batch regeneration):
status: SKIP (reviewer crash, timeout, missing data) is informational — shipcheck does not block on it.--auto-approve-warnings flag yet. The policy is already "warnings don't block" in Wave B, so the flag has no effect to gate.Wave C (separate future PR) will flip error-severity findings to blocking after calibration data across the library shows false-positive rate below 10%.
Output-plausibility questions are not pattern-matchable against source. Rule-based live-check rules cover what regexes can (numeric HTML entities, query-token absence). Everything else — "are these substitution results plausibly correct for the query?", "does the top search result look related?" — is an LLM-shaped question. The token cost is bounded (once per run, not per command) and the catch rate against the bug classes that motivated this phase justifies the dispatch.
live_check.features[]. Full command-tree coverage belongs in Phase 5 dogfood.