Slash Command

/ci-fix

Auto-remediate CI, staging, and production failures. 3-attempt retry with investigation. Discord alert on exhaustion. Usage: /ci-fix [ci|staging|prod] [branch]

From scalpel

Install

Run in your terminal

npx claudepluginhub arthtech-ai/arthai-marketplace --plugin scalpel

Command Content

Other plugins with /ci-fix

/ci-fix

Diagnoses CI pipeline issues from URLs, logs, or descriptions, previews fixes via agents, and applies changes with interactive user approval. Supports --dry-run, --auto-apply flags.

opusTaskReadGrep+5

acc

/ci-fix

CI/CD パイプラインの障害を体系的なログ分析と的確な修正で診断・解決する（必要条件: GitHub Actions 用の GitHub CLI 'gh'）

ReadWriteGlob+7

spec-workflow-toolkit

/ci-fix

code/

Fix CI Failures

craft

/ci-fix

Autonomously fix CI failures in a loop until all checks pass or a stop condition is reached.

alfred

/ci-fix

Fix failing CI workflows and push fixes until they pass

Bash(gh:*)Bash(just:*)Bash(git:*)+5

savi

/ci-fix

現在のブランチのPRでCI失敗を自動修正する

Bash(gh pr checks *)Bash(gh run view *)Bash(git add *)+14

doarakko-config

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitApr 5, 2026

Actions

View Source View Plugin View on GitHub View README

/ci-fix — Unified Auto-Remediation

Automatically diagnose and fix CI failures, staging deploy failures, and production deploy failures. Retries up to 3 times. Alerts on Discord #deployments if all attempts fail.

Project Context Discovery (Deep)

Before attempting any fix, understand the FULL CI/CD landscape.

Step 1: Read existing context

Read in this order (skip gracefully if missing):

CLAUDE.md — Test Commands table, Infrastructure table
.claude/project-profile.md — CI platform, test frameworks (from /calibrate)
.claude/knowledge/skills/ci-fix.md — Past CI failures and patterns
.claude/qa-knowledge/incidents/ — Past incidents from CI failures
.claude/qa-knowledge/bug-patterns.md — Known failure patterns

Step 2: Detect CI platform

Don't assume GitHub Actions. Detect the actual CI system:

Signal	CI Platform	How to get logs	How to check status	How to retry
`.github/workflows/*.yml`	GitHub Actions	`gh run view <id> --log-failed`	`gh run list --branch <b>`	`gh run rerun <id>`
`.gitlab-ci.yml`	GitLab CI	`glab ci view --branch <b>`	`glab ci status`	`glab ci retry`
`Jenkinsfile`	Jenkins	Jenkins API: `/job/<name>/lastBuild/consoleText`	Jenkins API: `/job/<name>/lastBuild/api/json`	Jenkins API: `/job/<name>/build`
`.circleci/config.yml`	CircleCI	`circleci` CLI or API	`circleci` CLI	API trigger
`bitbucket-pipelines.yml`	Bitbucket Pipelines	Bitbucket API	Bitbucket API	API trigger
`.buildkite/pipeline.yml`	Buildkite	`buildkite-agent` or API	API	`bk build retry`
`azure-pipelines.yml`	Azure Pipelines	`az pipelines runs show`	`az pipelines runs list`	`az pipelines run`
`.travis.yml`	Travis CI	Travis API	Travis API	API trigger
`Taskfile.yml` + CI wrapper	Task-based	Depends on wrapper CI	Depends	Depends

If MULTIPLE CI systems detected (e.g., GitHub Actions for CI + ArgoCD for deploy), document both.

Step 3: Detect test infrastructure

Map the project's test stack so fixes target the right layer:

Signal	Test Layer	Fix Strategy
`pytest.ini` / `pyproject.toml [tool.pytest]`	Python unit/integration	Fix Python code, run `pytest` locally
`jest.config.` / `vitest.config.`	JS/TS unit tests	Fix JS/TS code, run `npm test` locally
`playwright.config.` / `cypress.config.`	E2E tests	May need running services, check docker-compose
`ruff.toml` / `.flake8` / `.eslintrc.*`	Linting	Auto-fix with `ruff check --fix` / `eslint --fix`
`mypy.ini` / `tsconfig.json`	Type checking	Fix type annotations
`Dockerfile` / `docker-compose.test.yml`	Containerized tests	Build container first, then run
`tox.ini`	Multi-env Python testing	Run specific tox env
`Makefile` with test targets	Make-based	Run `make test`

Step 4: Detect environment requirements

CI often fails because of environment differences. Check:

# What Python/Node/Go version does CI use?
grep -r "python-version\|node-version\|go-version" .github/workflows/ .gitlab-ci.yml 2>/dev/null

# What services does CI spin up? (postgres, redis, etc.)
grep -r "services:" .github/workflows/ 2>/dev/null
grep -r "image:" .gitlab-ci.yml 2>/dev/null

# What env vars does CI set?
grep -r "env:" .github/workflows/ 2>/dev/null | grep -v "#"

# What secrets are used?
grep -r "secrets\." .github/workflows/ 2>/dev/null

Step 5: Check for known flaky patterns

Before investigating a failure, check if it's a known issue:

# Check past CI fix incidents
grep -l "ci-fix\|CI\|flaky" .claude/qa-knowledge/incidents/ 2>/dev/null

# Check knowledge base for CI patterns
cat .claude/knowledge/skills/ci-fix.md 2>/dev/null | grep -A3 "## Patterns"

If a known flaky test matches the failure, report it instead of debugging.

Step 6: Recommend CI improvements

After every fix (or on first run), check for CI best practices:

CI recommendations for {project}:
  ⚠ No dependency caching — CI runs will be slow
    Add: actions/cache for node_modules / .venv / cargo
  ⚠ No test parallelism — tests run sequentially
    Add: pytest-xdist / jest --workers / go test -parallel
  ⚠ No test timeout — flaky tests can hang forever
    Add: timeout-minutes in workflow / pytest --timeout=30
  ⚠ No retry for flaky tests — one flake fails the whole run
    Add: pytest-rerunfailures / jest --retry
  ⚠ Services not health-checked — tests may start before DB is ready
    Add: health check options to service containers
  ⚠ No artifact upload on failure — can't debug after the run
    Add: actions/upload-artifact for test reports on failure

Mode Detection

Parse from arguments:

No argument / ci → CI mode — fix GitHub Actions lint/test/build failures
staging → Staging mode — investigate + fix staging deployment failures
prod / production → Prod mode — investigate + fix production deployment failures
Optional second arg: branch name (default: current branch)

Architecture

Failure detected → /ci-fix invoked
  → Determine mode (CI / staging / prod)
  → Attempt 1: diagnose → fix → verify → push/redeploy
    → Green? DONE
    → Still failing? Attempt 2 (different strategy)
  → Attempt 2: re-diagnose → fix → verify → push/redeploy
    → Green? DONE
    → Still failing? Attempt 3 (escalate approach)
  → Attempt 3: deep investigation → fix → verify → push/redeploy
    → Green? DONE
    → Still failing? → Discord alert + QA incident file

CI Mode (default)

Fix GitHub Actions lint, type, test, and build failures.

Per-Attempt Steps

Step 1: Identify the failure

BRANCH=$(git branch --show-current)
gh run list --branch "$BRANCH" --limit 5 --json databaseId,conclusion,status,name

Get failed run logs:

gh run view <FAILED_RUN_ID> --log-failed 2>&1 | tail -200

Step 2: Classify and fix

Category	Indicators	Fix Strategy
Lint	ruff/eslint errors	Auto-fix tools scoped to specific directories
Type errors	tsc/mypy failures	Read error file:line, fix type annotation
Test failures	pytest/jest failures	Read failing test + source, fix root cause
Build failures	build errors	Read error, fix import/export/config
Migration	Alembic/Django errors	Fix migration file
Dependency	pip/npm install failures	Fix requirements/package.json
Toolkit tests	15/20-skill-runtime-safety, manifest-coverage	See Toolkit Test Fixes below

Toolkit-Specific Test Fixes (claude-agents repo)

When CI fails on the mechanical test suite (tests/run.sh), these are the common failures and auto-fixes:

Test	Failure message	Root cause	Auto-fix
`20-skill-runtime-safety`	"regex-unsafe [brackets] in descriptions"	SKILL.md `description:` or `arguments:` field contains `[text]`	Replace `[text]` with `<text>` in the frontmatter field. Brackets break regex matching in Claude Code.
`20-skill-runtime-safety`	"Skills missing required frontmatter fields"	SKILL.md missing `user-invocable: true` or `arguments:`	Add missing field to the YAML frontmatter between `---` markers. Check `git show HEAD~1:path/to/SKILL.md` for the original.
`15-manifest-coverage`	"entries mapped to categories"	New file in `portable.manifest` not listed in any `get_category_items()` category in `install.sh`	Add the manifest entry to the appropriate category in `install.sh:get_category_items()`.
`15-manifest-coverage`	"Install creates all expected symlinks"	New file in `portable.manifest` but install didn't create the symlink	Usually follows from the category mapping fix above.
`15-manifest-coverage`	"Entry counts are consistent"	Mismatch between manifest entries and installed files	Check that new manifest entries have matching source files.
`19-brownfield-assessment`	"classify_file returns IDENTICAL"	Agent fixture is stale after editing an agent `.md` file	Update fixture: `cp agents/{name}.md tests/fixtures/claude-setups/poweruser/.claude/agents/`

Auto-fix sequence for toolkit tests:

# 1. Get the exact failure
gh run view <ID> --log-failed 2>&1 | grep -E "FAIL|✗" | head -5

# 2. For bracket issues — find and fix ALL bracket descriptions
grep -rn 'description:.*\[' skills/*/SKILL.md
# Replace [text] with <text> in each match

# 3. For missing frontmatter — compare against last known good
git show HEAD~1:path/to/SKILL.md | head -6
# Restore missing fields

# 4. For manifest coverage — add to install.sh categories
grep "get_category_items" install.sh
# Add new entries to the right category

# 5. Verify locally before pushing
bash tests/run.sh --suite 15,20 --scenario a

Attempt escalation:

Attempt 1: Apply the obvious fix (auto-fix tools, direct code fix)
Attempt 2: Read more context (surrounding code, related tests, recent commits)
Attempt 3: Deep investigation — read git log, compare against last green commit

Step 3: Verify locally (MANDATORY before pushing)

Run ONLY the checks that failed — use commands from CLAUDE.md Test Commands table.

Step 4: Commit, push, wait for CI

git add <only_changed_files>
git commit -m "$(cat <<'EOF'
fix: <category> — <what was fixed>

Co-Authored-By: Claude <noreply@anthropic.com>
EOF
)"
git push

Wait for CI result:

sleep 90 && gh run list --branch "$BRANCH" --limit 1 --json conclusion,status

Staging Mode

Investigate and fix staging deployment failures.

Diagnose: Check deployment platform health endpoints (from CLAUDE.md), get deploy logs
Classify: Migration failed, missing env var, health timeout, build error, runtime crash
Fix: Code fix → verify locally → push. Env var fix → use platform tools.
Verify: Wait for deploy, check all health endpoints

Prod Mode

Same as staging but with extra caution:

NEVER modify production database directly
NEVER run destructive commands against production
Fix code locally → push to main → platform auto-deploys
On attempt 3, consider rollback: git revert HEAD --no-edit && git push

After 3 Failed Attempts (ALL MODES)

1. Discord alert (if discord-ops is configured)

mcp__discord-mcp__send-message(
  channel: "deployments",
  message: "Auto-Fix Failed — 3 attempts exhausted\n\nMode: {mode}\nBranch: {branch}\nError: {last_error_summary}\n\nAction needed: Manual review required."
)

2. Write QA incident

Create .claude/qa-knowledge/incidents/{date}-autofix-exhausted-{slug}.md with root cause, how QA missed it, and regression test recommendation.

3. Report to user

Auto-fix exhausted after 3 attempts ({mode} mode) on `{branch}`.
QA incident logged.
Manual review required.

Rules (ALL MODES)

NEVER run lint auto-fix on the entire repo — scope to specific directories
NEVER modify dependency files unless that's the actual failure
NEVER add # type: ignore or # noqa to suppress errors — fix them properly
NEVER delete or skip tests — fix the underlying issue
NEVER use railway up or equivalent — always push to git for auto-deploy
ALWAYS verify locally before pushing
ALWAYS scope fixes narrowly — only touch files that caused the failure
ALWAYS try a DIFFERENT strategy on each retry
Maximum 3 attempts. After that, alert humans.
If fix requires architectural changes → report and stop, don't attempt large refactors

Knowledge Base Integration

Protocol: Read small files fully (Tier 1). For directories, grep first, read only matches (Tier 2). Skip gracefully if missing. Only write back when something NEW is discovered.

Read (on every /ci-fix invocation):

Tier 1: knowledge/skills/ci-fix.md — past failure patterns
Tier 2: qa-knowledge/incidents/ — grep for CI-related incidents, read matching only
Match current failure against known patterns before investigating

Write back (after every fix, success or exhaustion):

Append to knowledge/skills/ci-fix.md:

### {date} — {failure type}
**CI Platform**: {platform}
**Category**: {lint | type | test | build | migration | dependency | flaky}
**Root cause**: {what was wrong}
**Fix**: {what was done}
**Attempts**: {N}
**Was it a known pattern?**: {yes — matched incident X | no — new pattern}
**Prevention**: {how to avoid next time}

NEW patterns (not in KB) → qa-knowledge/bug-patterns.md under CI section
Exhaustion (3 attempts) → create qa-knowledge/incidents/{date}-ci-exhaust-{slug}.md

/calibrate Integration

When /calibrate runs, it reads CI configuration to understand the project's test pipeline. The /ci-fix skill contributes:

CI platform detection → /calibrate writes to project profile
Test infrastructure map → /calibrate includes in profile "Testing Conventions" section
Environment requirements → /calibrate notes version mismatches between local and CI
Flaky test inventory → /calibrate includes in recommendations
CI improvement recommendations → /calibrate includes in "Recommended Tools" section

If /calibrate hasn't run, /ci-fix does its own discovery and recommends running /calibrate. If /calibrate has run, /ci-fix reads the profile instead of re-discovering.