From guidewire-pack
Operate a Guidewire Cloud API integration in production — define SLIs/SLOs for token availability, bind success rate, FNOL p99 latency; route alerts so the on-call gets paged for real outages and never for transient noise; triage 401 spikes, 409 storms, 429 saturation, scope drift, and Gosu OOM cascades from signal to recovery in 15 minutes or less. Use when designing a dashboard for a new integration, writing the on-call runbook, or running a post-incident review. Trigger with "guidewire observability", "guidewire slo", "guidewire on-call", "guidewire 401 spike", "guidewire 409 storm", "guidewire incident".
npx claudepluginhub flight505/skill-forge --plugin guidewire-packThis skill is limited to using the following tools:
Run a production Guidewire Cloud API integration with the dashboards, alerts, and runbooks an on-call engineer can act on at 3am. This skill consolidates the operational layer: what to measure, what to alert on, how to triage the top five incident classes, and how to close the loop with a post-incident review that prevents recurrence rather than performing root-cause theater.
Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Share bugs, ideas, or general feedback.
Run a production Guidewire Cloud API integration with the dashboards, alerts, and runbooks an on-call engineer can act on at 3am. This skill consolidates the operational layer: what to measure, what to alert on, how to triage the top five incident classes, and how to close the loop with a post-incident review that prevents recurrence rather than performing root-cause theater.
Five operational failures this skill prevents:
5xx pages someone; in three weeks the team mutes the channel; a real incident two weeks later goes unnoticed for an hour.integration_audit table from guidewire-security-and-rbac populated — incident triage depends on knowing what the integration tried to docorrelation_id propagated end-to-end through every Cloud API callBuild the operational layer in this order. Each step targets one of the five operational failures listed in Overview.
Track what users care about, not what is easy to measure. For a Guidewire integration, the SLIs that matter:
| SLI | Measurement | Target |
|---|---|---|
| Token-endpoint availability | success rate of /oauth/token over a rolling 5min window | ≥99.9% |
| Cloud API write success | 2xx rate on POST/PATCH calls, excluding 4xx caller errors | ≥99.5% |
| Bind success rate | bound / quoted ratio over rolling 1h, excluding referrals | ≥98% |
| FNOL intake p99 latency | end-to-end time from inbound event to claim.created log | ≤2s |
| Quote-to-bind median latency | median time from quote-call to bind-success | ≤30s |
5xx from upstream Cloud API counts against availability; 4xx from caller bugs (validation failures, scope mismatches) does not — those are caller errors, not integration outages. Distinguishing the two is the single biggest cause of either alert fatigue or missed incidents, depending on which way the bias goes.
Alerting on "error rate > 1%" pages the on-call every time a transient 502 happens. Alert on SLO burn rate instead — the rate at which the error budget is being consumed.
Fast burn: 2% error budget consumed in 1 hour → page immediately
Slow burn: 5% error budget consumed in 6 hours → page during business hours
Trickle: 10% error budget consumed in 3 days → ticket, not page
A 5-minute outage that consumes 1% of the monthly budget should not page; a 20-minute outage that consumes 4% should. Burn-rate alerts encode this naturally.
Each tree is the ~5-step decision sequence on-call follows from signal to recovery. Memorize the entry signal; the body is in the runbook.
T1: 401 spike on Cloud API calls
401s > 1% for 5min
├─ Check token age in cache: is the integration refreshing? → if no, restart token-cache process
├─ Decode a recent token, verify exp > now + 60s → if no, clock skew or aggressive proxy caching
├─ Check GCC: is the Service Application enabled? → if no, talk to tenant admin
└─ Check secret-rotation history: was a secret rotated in the last 24h? → if yes, run dual-secret swap or restart
T2: 409 storm on PATCH calls
409s > 5% for 5min
├─ Are concurrent writers expected? → if yes, scale checksum round-trip retry budget
├─ Is one resource-id producing all 409s? → if yes, it's a hot key; coordinate writes via queue
└─ Is the client retrying the bare PATCH instead of the GET-PATCH cycle? → fix the retry layer
T3: 429 saturation on Cloud API or Hub
429s > 1% for 10min
├─ Is the Hub /oauth/token endpoint 429ing? → token cache missing single-flight gate; deploy fix immediately
├─ Is the data-plane API 429ing? → check tenant quota in GCC, request increase if legitimate growth
└─ Is one customer driving the saturation? → tenant-side rate limit on the integration's intake
T4: Scope drift detected
Scope-drift alert from auth refresh
├─ What scope is missing? → check GCC > Identity & Access > Applications > [app] > Permissions
├─ Was a permission removed by a tenant admin? → coordinate; restore or accept loss of capability
└─ Was a scope renamed in a tenant config push? → update GW_SCOPES env, redeploy
T5: GUnit / runServer OOM (dev or staging)
OOMKilled or heap dump generated
├─ Is sample data set abnormally large? → reset dev DB, reload fixtures
├─ Is a recent change loading too many entities (no pagination, no filter)? → revert; add limit
└─ Is the JVM under-provisioned? → bump -Xmx in gradle.properties; restart
Each playbook is one page in the on-call runbook. Concrete commands, not prose.
## Playbook: 401 spike — secret rotation suspected
1. Rotate in GCC > Identity & Access > Applications > [app] > Generate Secret
2. sops secrets.prod.<tenant>.sops.yaml # set new GW_CLIENT_SECRET (and SECONDARY = old)
3. git commit -m "rotate(secrets): incident-driven rotation $(date -Iseconds)"
4. Trigger deploy of token service
5. Confirm token claims show iat > rotation timestamp: kubectl exec ... -- /token-debug
6. After 24h zero failures from primary: remove SECONDARY
7. Open audit row with reason="incident-401-spike"
Each playbook ends with an audit-row insert so the next post-incident review has the full timeline.
# PIR: <date> — <one-line summary>
## Timeline
- HH:MM — first signal (alert link, dashboard screenshot)
- HH:MM — on-call paged
- HH:MM — root cause identified
- HH:MM — mitigation applied
- HH:MM — confirmation of recovery
## SLO impact
- Error budget consumed: X%
- Customer-impacting requests: N
- Compliance impact: <yes / no — if yes, NAIC notification window>
## Root cause
<the actual root cause, not a symptom; use 5-whys but don't perform; one paragraph max>
## What worked
<what the team did right; preserve these>
## What didn't
<what slowed the response; this drives action items>
## Action items (each with owner + target date)
- [ ] OWNER, BY: <date> — <concrete change>
- [ ] OWNER, BY: <date> — <concrete change>
## Recurrence prevention
<the one change that makes this exact incident impossible next time, not "be more careful">
The discipline is the action items having owners and dates. PIRs without those are theater.
A production-grade observability layer ships with all of the following:
# error_budget = 0.5%/month = ~3.6h/month
# fast-burn: 2% budget in 1h → page
sum:trace.http_request.errors{service:guidewire-integration}.as_count() /
sum:trace.http_request.hits{service:guidewire-integration}.as_count() > 0.144 # 14.4x normal
-- "What was the integration trying to do during the 401 spike?"
SELECT correlation_id, actor, api_method, api_path, status_code, reason, at
FROM integration_audit
WHERE at BETWEEN '2026-04-15T03:00:00Z' AND '2026-04-15T03:30:00Z'
AND status_code = 401
ORDER BY at;
# Alert config
title: "GW Cloud API 401 spike"
runbook: https://github.com/acme/guidewire-integration/blob/main/runbooks/401-spike.md
priority: P1
escalation: "platform-oncall"
The on-call clicks the runbook link from the page; the runbook is in the repo so it versions with the code; updates land in PRs reviewed like any other change.
| Symptom | First check | Likely cause |
|---|---|---|
| 401 spike on Cloud API | token cache age | reactive refresh, clock skew, secret rotation in flight |
| 409 storm on PATCH | concurrent writers + retry pattern | client retrying bare PATCH instead of GET-then-PATCH |
429 on /oauth/token | single-flight gate active? | thundering herd refresh from a stampede on token expiry |
| 429 on data-plane API | per-tenant quota | another integration sharing the tenant quota; growth past quota |
403 Forbidden despite valid token | scope-drift gate | tenant admin removed a permission |
PKIX path building failed | cert chain in JVM trust store | private-CA cert renewal not propagated |
| Gosu OOM in production | recent deploy + heap dump | unbounded entity load (missing filter or pagination) |
| FNOL p99 latency spike | upstream lookup latency | external policy-resolution service degraded |
| Bind success rate drops below SLO | UW-issue volume + product mix | new product or new broker producing high-referral volume; not an outage |
| Audit table empty for an outage window | audit insert path failing | check the audit-write retry budget; never let audit failures block the request |
For deeper coverage (RED method vs USE method tradeoffs, multi-tenant dashboard partitioning, synthetic FNOL probes, chaos-engineering tests for token-cache resilience), see implementation guide and API reference.
guidewire-install-auth — produces the auth-side signals this skill alerts on (token cache, scope drift)guidewire-sdk-patterns — the structured GwError and 409 retry semantics this skill's triage trees assumeguidewire-security-and-rbac — the integration_audit table this skill queries during incidentsguidewire-ci-cd-pipeline — the deployment surface incident response interacts with (rollback, canary)