From smith
Diagnoses errors, failures, and unexpected behavior across services by gathering evidence, identifying root causes, and producing structured debug reports. Read-only; feeds into /smith-bugfix.
npx claudepluginhub attckdigital/smithThis skill uses the workspace's default tool permissions.
A diagnostic-only workflow that systematically investigates errors, failures, and unexpected behavior across Armory's services. Produces a structured debug report stored in the relevant system's `.specify/` folder. Does NOT modify code — the report becomes input to `/smith-bugfix` if a fix is warranted.
Guides error debugging: triage stack traces with AI subagent, collect observability data (Sentry/DataDog), generate ranked hypotheses, select strategies (VS Code, chaos engineering).
Provides AI-powered error triage, stack trace analysis, hypothesis generation, and debugging strategies using observability tools like Sentry, DataDog, and VS Code.
Performs root cause analysis for bugs by tracing errors through code, analyzing stack traces, forming and testing hypotheses, then hands off to fix. Auto-triggers on stack traces.
Share bugs, ideas, or general feedback.
A diagnostic-only workflow that systematically investigates errors, failures, and unexpected behavior across Armory's services. Produces a structured debug report stored in the relevant system's .specify/ folder. Does NOT modify code — the report becomes input to /smith-bugfix if a fix is warranted.
Arguments: $ARGUMENTS
Throughout this action, log significant events to the vault session log. Read the session log path from .smith/vault/.current-session. If the file is missing or the vault is not initialized, skip all logging silently.
Append entries using this format:
### [HH:MM:SS] /smith-debug <event>
**User Request:**
> <verbatim user message that triggered this action — capture the exact error description, symptoms, or question the user asked. Include any error messages they pasted.>
**Synthesized Input:** <brief summary of what's being investigated>
**Outcome:** <what happened>
**Artifacts:** <files created/modified>
**Systems affected:** <system IDs>
Log at these points:
Immediately before every Agent tool call in this workflow (especially the 4 triage agents in Phase 3), append a block to the session log. The Agent tool's return value does not expose subagent_type or model to the parent, so this is the only place that information can be captured.
### [HH:MM:SS] Subagent invoked: <description>
**Type:** <subagent_type or "general">
**Model:** <model override passed to Agent, or "inherited" if none>
After the Agent tool returns, the subagent-vault-writeback.sh hook automatically appends a matching "Subagent completed" block with metrics read from the sidechain transcript — do not duplicate that logging in the skill.
Use /smith-debug when:
Do NOT use when:
/smith-bugfix directly/smith-newIf the user says any of the following (or similar phrases), treat it as invoking this command:
When triggered by natural language, synthesize the conversation history into the symptom description and proceed as if that was passed as $ARGUMENTS.
If .smith/vault/ledger/ exists and contains non-empty files, load relevant Ledger sections to inform diagnosis. If the directory is missing, empty, or unreadable, skip silently — the Ledger is purely additive and never required.
ls .smith/vault/ledger/*.md 2>/dev/null.smith/vault/ledger/antipatterns.md (past failure modes — directly useful for narrowing hypotheses).smith/vault/ledger/edge-cases.md (known weird states the system has hit before).smith/vault/ledger/project-quirks.md (project-specific gotchas — e.g., "this service takes 30s to start, don't assume crash").smith/vault/ledger/tool-preferences.md (which diagnostic tools/commands are known to work well in this project)antipatterns.md to avoid re-investigating already-known failure modes from scratch, and project-quirks.md to skip false-positive theories. The Ledger informs judgment, it does not override evidence collected during this run.context_budget_violations in .smith/vault/ledger/.meta.json by 1. If .meta.json does not exist, create it from the default template first. This signal tells the reconciliation system that the Ledger is too large for the configured budget.Extract or ask for these structured fields from the user's description:
| Field | Description | Example |
|---|---|---|
| Error message | Exact text of error or unexpected output | [Errno 111] Connection refused |
| Trigger | What the user was doing when it happened | Running background reports |
| Conditions | What else was running, recent changes, environment state | Sentiment analysis running concurrently |
| Frequency | Always, sometimes, new, intermittent | Every time background reports run |
| Affected service(s) | Best guess from the symptom | content-engine, sentiment-engine |
If the user's initial description is missing 2+ of these fields, ask a focused set of clarifying questions BEFORE proceeding. Present them as a numbered list the user can answer quickly:
To investigate this efficiently, I need a few more details:
[1] What exactly were you doing when this happened? (e.g., which button, command, or workflow)
[2] Does this happen every time, or only sometimes?
[3] Were any other operations running at the same time?
[4] When did this start? (always been this way, or recent change?)
Only ask for what's actually missing. If the description already covers 3+ fields, proceed directly — don't slow the user down with unnecessary questions.
If ALL fields are present in the initial description or $ARGUMENTS: skip prompting entirely and proceed to Phase 2.
Determine which Armory system(s) this debug session relates to.
Map the symptom to systems using service-to-system mapping:
command-center / port 8080 → system-15-command-centersentiment-engine / port 8081 → system-15-command-center (scoring subsystem)content-strategy / port 8082 → system-12-content-social-engineemail-pipeline → system-03-email-archive-contact-graphcommunication-triage → system-05-communication-triagevoice-training → system-04-personal-voiceopenclaw / Jason / port 18789 → cross-system (agent layer)social-listening → system-10-social-listeningtrend-intelligence → system-13-trend-intelligencen8n / port 5678 → system-01-infrastructurepostgres / neo4j / qdrant / redis → system-01-infrastructuresystem-01-infrastructuresystem-02-ai-models-layerIf ambiguous: pick the most likely primary system and note secondary systems.
Set the report path:
.specify/systems/<primary-system>/debug/debug-YYYY-MM-DD-<slug>.md
Create the debug/ directory if it doesn't exist.
Launch up to 4 diagnostic sub-agents in parallel. Each is read-only — no code modifications.
Model: haiku Task: Check the health of all services and resource usage.
- Run: docker compose ps
- Run: bash scripts/health-check.sh
- Run: docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"
- Check if affected service(s) are running and healthy
- Check port availability for affected services
- Check Colima resource allocation: colima status
- Report: which services are up/down, resource pressure, port conflicts
Model: haiku Task: Search service logs for the error and surrounding context.
- Run: docker compose logs <affected-service> --tail 200 --timestamps
- Run: docker compose logs <upstream-dependencies> --tail 100 --timestamps
- Grep logs for the exact error message
- Grep for related patterns (connection refused, timeout, OOM, restart)
- Look for temporal correlation with other service errors
- Report: relevant log excerpts, error frequency, first occurrence timestamp
Model: sonnet Task: Map the request path and check each hop.
- Identify the full request chain for the failing operation
(e.g., UI → Express → FastAPI → Ollama → Qdrant)
- For each hop:
- Is the upstream service reachable? (curl health endpoints)
- Is the connection using the right host/port?
- Are there resource contention issues? (shared Ollama, shared PG connections)
- Check docker-compose.yml for network configuration
- Check environment variables for correct service URLs
- Report: which hop fails, why, and what the expected vs actual behavior is
Model: haiku Task: Check if this is a known issue or related to recent changes.
- Read the primary system spec.md for known limitations or caveats
- Search specs/debug/ and .specify/systems/*/debug/ for prior debug reports with similar symptoms
- Run: git log --oneline -20 -- <affected-service-paths>
- Check if recent commits could have introduced the issue
- Search GitHub issues: gh issue list --search "<error keywords>" --limit 5
- Report: prior occurrences, related changes, known issues
Not all 4 agents are always needed. Select based on symptom:
| Symptom type | Agents to launch |
|---|---|
| Connection refused / timeout | All 4 |
| Wrong data / unexpected output | 3.2 (logs) + 3.3 (trace) + 3.4 (history) |
| Slow performance | 3.1 (health) + 3.2 (logs) + 3.3 (trace) |
| Service won't start | 3.1 (health) + 3.2 (logs) |
| Intermittent failure | All 4 |
| UI rendering issue | 3.2 (logs) + 3.4 (history) |
After sub-agents return, synthesize findings into a root cause analysis:
Write the report to the path determined in Phase 2:
---
reported: YYYY-MM-DD
status: diagnosed | needs-investigation | cannot-reproduce
severity: blocking | degraded | cosmetic
primary_system: <system-folder-name>
also_affects:
- <other-system-folder-name>
trigger: <what the user was doing>
error: <exact error text>
---
# Debug: <short description>
## Symptom
<Structured description from Phase 1>
## Evidence
### Infrastructure Health
<Agent 3.1 findings — service status, resource usage, port checks>
### Log Analysis
<Agent 3.2 findings — relevant log excerpts, error patterns>
### Dependency Trace
<Agent 3.3 findings — request path analysis, failing hop>
### Spec & History
<Agent 3.4 findings — prior occurrences, recent changes>
## Root Cause
<Identified cause OR ranked hypotheses with evidence for each>
### Confidence: <confirmed | probable | possible>
<Reasoning for the confidence level>
## Recommended Action
- [ ] **Fix via `/smith-bugfix`** — <one-liner description of the fix>
- [ ] **Config change** — <what to change and where>
- [ ] **Known limitation** — <document and accept>
- [ ] **Needs deeper investigation** — <what to investigate next>
## Related
- <links to relevant specs, issues, prior debug reports>
Present the diagnosis summary to the user and ask:
## Diagnosis Complete
**Root cause:** <one-sentence summary>
**Confidence:** <confirmed/probable/possible>
**Report saved:** .specify/systems/<system>/debug/debug-YYYY-MM-DD-<slug>.md
Would you like me to:
[1] Fix it — kick off /smith-bugfix with this diagnosis as context
[2] Investigate deeper — drill into <specific hypothesis or area>
[3] Close — the report is enough for now
/smith-bugfix with the diagnosis context:
fix-in-progress## Follow-up Investigation sectionclosed or documentedbash hooks/workflow-summary.sh --totals-only and include the two lines it prints (Total tokens used: ~<n> and Total duration: <d>) verbatim at the bottom of the closing chat message=== Workflow Summary === block is appended to the session log file automatically by the workflow-summary.sh Stop hook once the active-workflow file is cleaned up — that's for audit only, do not duplicate it in chatAfter workflow completion (regardless of which Phase 6 option the user selected), trigger a Ledger reflection if enabled. Debug runs surface valuable signal — root causes, false hypotheses, diagnostic dead-ends — that should feed back into antipatterns.md and edge-cases.md for future runs.
.smith/config.json — if ledger.auto_reflect is true (default), proceed.smith/vault/ledger/ path, and the debug report pathsmith-reflect workflow.smith/config.json is missing or ledger.auto_reflect is false, skip silentlyAfter reflection completes (or is skipped):
.smith/config.json — if ledger.reconcile.auto_reconcile is false, skip.smith/vault/ledger/.meta.json — check signals against thresholds:
estimated_tokens > thresholds.total_tokens_max (default 30000)context_budget_violations > thresholds.context_violations_threshold (default 3)reinforcements_since_reconcile > thresholds.reinforcements_threshold (default 50)last_reconcile is less than minimum_hours_between_reconciles (default 6) hours ago, skipreconcile_model (default: Haiku).meta.json is missing, or config is missing, skip silently