Auto-remediate CI, staging, and production failures. 3-attempt retry with investigation. Discord alert on exhaustion. Usage: /ci-fix [ci|staging|prod] [branch]
From scalpelnpx claudepluginhub arthtech-ai/arthai-marketplace --plugin scalpel/ci-fixDiagnoses CI pipeline issues from URLs, logs, or descriptions, previews fixes via agents, and applies changes with interactive user approval. Supports --dry-run, --auto-apply flags.
/ci-fixCI/CD パイプラインの障害を体系的なログ分析と的確な修正で診断・解決する(必要条件: GitHub Actions 用の GitHub CLI 'gh')
/ci-fixAutonomously fix CI failures in a loop until all checks pass or a stop condition is reached.
/ci-fixFix failing CI workflows and push fixes until they pass
/ci-fix現在のブランチのPRでCI失敗を自動修正する
Automatically diagnose and fix CI failures, staging deploy failures, and production deploy failures. Retries up to 3 times. Alerts on Discord #deployments if all attempts fail.
Before attempting any fix, understand the FULL CI/CD landscape.
Read in this order (skip gracefully if missing):
CLAUDE.md — Test Commands table, Infrastructure table.claude/project-profile.md — CI platform, test frameworks (from /calibrate).claude/knowledge/skills/ci-fix.md — Past CI failures and patterns.claude/qa-knowledge/incidents/ — Past incidents from CI failures.claude/qa-knowledge/bug-patterns.md — Known failure patternsDon't assume GitHub Actions. Detect the actual CI system:
| Signal | CI Platform | How to get logs | How to check status | How to retry |
|---|---|---|---|---|
.github/workflows/*.yml | GitHub Actions | gh run view <id> --log-failed | gh run list --branch <b> | gh run rerun <id> |
.gitlab-ci.yml | GitLab CI | glab ci view --branch <b> | glab ci status | glab ci retry |
Jenkinsfile | Jenkins | Jenkins API: /job/<name>/lastBuild/consoleText | Jenkins API: /job/<name>/lastBuild/api/json | Jenkins API: /job/<name>/build |
.circleci/config.yml | CircleCI | circleci CLI or API | circleci CLI | API trigger |
bitbucket-pipelines.yml | Bitbucket Pipelines | Bitbucket API | Bitbucket API | API trigger |
.buildkite/pipeline.yml | Buildkite | buildkite-agent or API | API | bk build retry |
azure-pipelines.yml | Azure Pipelines | az pipelines runs show | az pipelines runs list | az pipelines run |
.travis.yml | Travis CI | Travis API | Travis API | API trigger |
Taskfile.yml + CI wrapper | Task-based | Depends on wrapper CI | Depends | Depends |
If MULTIPLE CI systems detected (e.g., GitHub Actions for CI + ArgoCD for deploy), document both.
Map the project's test stack so fixes target the right layer:
| Signal | Test Layer | Fix Strategy |
|---|---|---|
pytest.ini / pyproject.toml [tool.pytest] | Python unit/integration | Fix Python code, run pytest locally |
jest.config.* / vitest.config.* | JS/TS unit tests | Fix JS/TS code, run npm test locally |
playwright.config.* / cypress.config.* | E2E tests | May need running services, check docker-compose |
ruff.toml / .flake8 / .eslintrc.* | Linting | Auto-fix with ruff check --fix / eslint --fix |
mypy.ini / tsconfig.json | Type checking | Fix type annotations |
Dockerfile / docker-compose.test.yml | Containerized tests | Build container first, then run |
tox.ini | Multi-env Python testing | Run specific tox env |
Makefile with test targets | Make-based | Run make test |
CI often fails because of environment differences. Check:
# What Python/Node/Go version does CI use?
grep -r "python-version\|node-version\|go-version" .github/workflows/ .gitlab-ci.yml 2>/dev/null
# What services does CI spin up? (postgres, redis, etc.)
grep -r "services:" .github/workflows/ 2>/dev/null
grep -r "image:" .gitlab-ci.yml 2>/dev/null
# What env vars does CI set?
grep -r "env:" .github/workflows/ 2>/dev/null | grep -v "#"
# What secrets are used?
grep -r "secrets\." .github/workflows/ 2>/dev/null
Before investigating a failure, check if it's a known issue:
# Check past CI fix incidents
grep -l "ci-fix\|CI\|flaky" .claude/qa-knowledge/incidents/ 2>/dev/null
# Check knowledge base for CI patterns
cat .claude/knowledge/skills/ci-fix.md 2>/dev/null | grep -A3 "## Patterns"
If a known flaky test matches the failure, report it instead of debugging.
After every fix (or on first run), check for CI best practices:
CI recommendations for {project}:
⚠ No dependency caching — CI runs will be slow
Add: actions/cache for node_modules / .venv / cargo
⚠ No test parallelism — tests run sequentially
Add: pytest-xdist / jest --workers / go test -parallel
⚠ No test timeout — flaky tests can hang forever
Add: timeout-minutes in workflow / pytest --timeout=30
⚠ No retry for flaky tests — one flake fails the whole run
Add: pytest-rerunfailures / jest --retry
⚠ Services not health-checked — tests may start before DB is ready
Add: health check options to service containers
⚠ No artifact upload on failure — can't debug after the run
Add: actions/upload-artifact for test reports on failure
Parse from arguments:
ci → CI mode — fix GitHub Actions lint/test/build failuresstaging → Staging mode — investigate + fix staging deployment failuresprod / production → Prod mode — investigate + fix production deployment failuresFailure detected → /ci-fix invoked
→ Determine mode (CI / staging / prod)
→ Attempt 1: diagnose → fix → verify → push/redeploy
→ Green? DONE
→ Still failing? Attempt 2 (different strategy)
→ Attempt 2: re-diagnose → fix → verify → push/redeploy
→ Green? DONE
→ Still failing? Attempt 3 (escalate approach)
→ Attempt 3: deep investigation → fix → verify → push/redeploy
→ Green? DONE
→ Still failing? → Discord alert + QA incident file
Fix GitHub Actions lint, type, test, and build failures.
BRANCH=$(git branch --show-current)
gh run list --branch "$BRANCH" --limit 5 --json databaseId,conclusion,status,name
Get failed run logs:
gh run view <FAILED_RUN_ID> --log-failed 2>&1 | tail -200
| Category | Indicators | Fix Strategy |
|---|---|---|
| Lint | ruff/eslint errors | Auto-fix tools scoped to specific directories |
| Type errors | tsc/mypy failures | Read error file:line, fix type annotation |
| Test failures | pytest/jest failures | Read failing test + source, fix root cause |
| Build failures | build errors | Read error, fix import/export/config |
| Migration | Alembic/Django errors | Fix migration file |
| Dependency | pip/npm install failures | Fix requirements/package.json |
| Toolkit tests | 15/20-skill-runtime-safety, manifest-coverage | See Toolkit Test Fixes below |
When CI fails on the mechanical test suite (tests/run.sh), these are the common failures and auto-fixes:
| Test | Failure message | Root cause | Auto-fix |
|---|---|---|---|
20-skill-runtime-safety | "regex-unsafe [brackets] in descriptions" | SKILL.md description: or arguments: field contains [text] | Replace [text] with <text> in the frontmatter field. Brackets break regex matching in Claude Code. |
20-skill-runtime-safety | "Skills missing required frontmatter fields" | SKILL.md missing user-invocable: true or arguments: | Add missing field to the YAML frontmatter between --- markers. Check git show HEAD~1:path/to/SKILL.md for the original. |
15-manifest-coverage | "entries mapped to categories" | New file in portable.manifest not listed in any get_category_items() category in install.sh | Add the manifest entry to the appropriate category in install.sh:get_category_items(). |
15-manifest-coverage | "Install creates all expected symlinks" | New file in portable.manifest but install didn't create the symlink | Usually follows from the category mapping fix above. |
15-manifest-coverage | "Entry counts are consistent" | Mismatch between manifest entries and installed files | Check that new manifest entries have matching source files. |
19-brownfield-assessment | "classify_file returns IDENTICAL" | Agent fixture is stale after editing an agent .md file | Update fixture: cp agents/{name}.md tests/fixtures/claude-setups/poweruser/.claude/agents/ |
Auto-fix sequence for toolkit tests:
# 1. Get the exact failure
gh run view <ID> --log-failed 2>&1 | grep -E "FAIL|✗" | head -5
# 2. For bracket issues — find and fix ALL bracket descriptions
grep -rn 'description:.*\[' skills/*/SKILL.md
# Replace [text] with <text> in each match
# 3. For missing frontmatter — compare against last known good
git show HEAD~1:path/to/SKILL.md | head -6
# Restore missing fields
# 4. For manifest coverage — add to install.sh categories
grep "get_category_items" install.sh
# Add new entries to the right category
# 5. Verify locally before pushing
bash tests/run.sh --suite 15,20 --scenario a
Attempt escalation:
Run ONLY the checks that failed — use commands from CLAUDE.md Test Commands table.
git add <only_changed_files>
git commit -m "$(cat <<'EOF'
fix: <category> — <what was fixed>
Co-Authored-By: Claude <noreply@anthropic.com>
EOF
)"
git push
Wait for CI result:
sleep 90 && gh run list --branch "$BRANCH" --limit 1 --json conclusion,status
Investigate and fix staging deployment failures.
Same as staging but with extra caution:
git revert HEAD --no-edit && git pushmcp__discord-mcp__send-message(
channel: "deployments",
message: "Auto-Fix Failed — 3 attempts exhausted\n\nMode: {mode}\nBranch: {branch}\nError: {last_error_summary}\n\nAction needed: Manual review required."
)
Create .claude/qa-knowledge/incidents/{date}-autofix-exhausted-{slug}.md with root cause, how QA missed it, and regression test recommendation.
Auto-fix exhausted after 3 attempts ({mode} mode) on `{branch}`.
QA incident logged.
Manual review required.
# type: ignore or # noqa to suppress errors — fix them properlyrailway up or equivalent — always push to git for auto-deployProtocol: Read small files fully (Tier 1). For directories, grep first, read only matches (Tier 2). Skip gracefully if missing. Only write back when something NEW is discovered.
knowledge/skills/ci-fix.md — past failure patternsqa-knowledge/incidents/ — grep for CI-related incidents, read matching onlyknowledge/skills/ci-fix.md:
### {date} — {failure type}
**CI Platform**: {platform}
**Category**: {lint | type | test | build | migration | dependency | flaky}
**Root cause**: {what was wrong}
**Fix**: {what was done}
**Attempts**: {N}
**Was it a known pattern?**: {yes — matched incident X | no — new pattern}
**Prevention**: {how to avoid next time}
qa-knowledge/bug-patterns.md under CI sectionqa-knowledge/incidents/{date}-ci-exhaust-{slug}.mdWhen /calibrate runs, it reads CI configuration to understand the project's test pipeline.
The /ci-fix skill contributes:
If /calibrate hasn't run, /ci-fix does its own discovery and recommends running /calibrate.
If /calibrate has run, /ci-fix reads the profile instead of re-discovering.