From marsai-dev-team
Implements comprehensive readiness probes (/readyz) and startup self-probes for V4-Company services. Goes beyond basic K8s liveness: validates every external dependency (database, cache, queue, TLS handshakes) and exposes per-dependency status with latency and TLS info. Designed to be consumed by Tenant Manager post-provisioning. Origin: Monetarie SaaS incident — product-console started successfully but MongoDB was silently unreachable (TLS mismatch with DocumentDB). K8s liveness passed, traffic routed, client hit errors. This skill ensures that never happens again.
npx claudepluginhub v4-company/marsai --plugin marsai-dev-teamThis skill uses the workspace's default tool permissions.
Scan the project to detect ALL external dependencies:
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Scan the project to detect ALL external dependencies:
# TypeScript/Next.js: detect connection patterns
grep -rn 'MongoClient\|mongoose\|pg\|Pool\|redis\|amqplib\|S3Client' package.json src/ app/ lib/
Build dependency map: PostgreSQL (pg/prisma), MongoDB (mongoose/mongodb), Redis/Valkey (ioredis), RabbitMQ (amqplib), S3 (aws-sdk), HTTP clients. For each, detect if TLS is configured (sslmode, tls=true, rediss://, amqps://).
SaaS deployment mode: TLS is MANDATORY for all database connections. No exceptions.
{
"status": "healthy",
"checks": {
"postgres": { "status": "up", "latency_ms": 2, "tls": true },
"mongodb": { "status": "up", "latency_ms": 3, "tls": true },
"rabbitmq": { "status": "up", "connected": true },
"valkey": { "status": "up", "latency_ms": 1, "tls": false }
},
"version": "1.2.3",
"deployment_mode": "saas"
}
status: "healthy" if ALL checks pass, "unhealthy" if ANY failslatency_ms (for connections with ping) and tls (boolean)deployment_mode: from DEPLOYMENT_MODE env or inferred from configversion: from build info or VERSION envEach checker MUST verify TLS state from the connection options (e.g., mongoClient.options?.tls for TS). This is what would have caught the Monetarie bug.
RabbitMQ note: For RabbitMQ, TLS detection MUST inspect the connection URL scheme (amqps:// = TLS, amqp:// = plaintext). The checker constructor MUST accept the connection URL and derive tls: true/false from the scheme.
"SaaS deployment mode: TLS is MANDATORY" means two separate things that are both required:
| Concern | Responsibility | Mechanism |
|---|---|---|
| Surface TLS state | /readyz probe | Reports "tls": true/false per dependency in JSON response |
| Enforce TLS | Bootstrap / connection code | MUST refuse to start if DEPLOYMENT_MODE=saas and TLS is not configured |
MUST implement both. Surfacing without enforcement means the service starts silently insecure. Enforcement without surfacing means the Tenant Manager cannot confirm TLS posture post-provisioning. Neither alone is sufficient.
Same pattern at app/api/admin/health/readyz/route.ts: ping each dependency, measure latency, check TLS, return 200/503 with the same JSON contract. Use Response.json() with appropriate status code.
| Stack | Ready Path | Health Path |
|---|---|---|
| Next.js | /api/admin/health/readyz | same as Ready Path |
Next.js exposes a single /api/admin/health/readyz endpoint which serves both readiness and health checks.
The app MUST run all readiness checks at boot and log results BEFORE accepting traffic.
Key insight: /health is no longer just "process alive." It's "startup self-probe passed AND runtime dependency state is healthy." A pod that starts but can't reach its databases will be restarted by K8s instead of silently serving errors.
/health reflects result/health, /readyz operates normally/health, K8s restarts pod via liveness probeSELF_PROBE_INTERVAL envNext.js instrumentation.ts register() executes once at process startup and BLOCKS before the first request is served — this IS the self-probe point for Next.js. Use it.
MUST NOT call process.exit() on probe failure inside register(). Doing so prevents K8s from collecting a useful log tail. Instead:
register(): run all dependency checks; if any fail, set a module-level flag (let startupHealthy = false)./api/admin/health/readyz route handler checks this flag./api/admin/health/readyz, sees 503, and withholds traffic — no process.exit() needed.// instrumentation.ts
let startupHealthy = false;
let startupChecks: Record<string, DependencyCheck> = {};
export async function register() {
const results = await runAllChecks();
startupChecks = results;
startupHealthy = Object.values(results).every(c => c.status === "up");
// log results here — process stays alive regardless
}
export { startupHealthy, startupChecks };
The /api/admin/health/readyz route imports startupHealthy and startupChecks from instrumentation.ts and returns 200 or 503 accordingly.
These two mechanisms are complementary, not redundant:
| Mechanism | When | Purpose |
|---|---|---|
| Self-probe | STARTUP — before first request | Validates dependencies are reachable before traffic is allowed |
/readyz | RUNTIME — per request | Validates dependencies are still reachable as K8s readinessProbe |
/health | RUNTIME — per request | Reflects self-probe result AND runtime circuit-breaker state |
A pod that passes startup self-probe can still fail /readyz later (e.g., DB goes away mid-run). A pod that fails self-probe should never receive traffic in the first place. Both gates are necessary.
Verify /readyz endpoint, RunSelfProbe function, and /health self-probe wiring all exist.
| Rationalization | Why It's WRONG | Required Action |
|---|---|---|
| "K8s TCP probe is enough" | TCP ≠ app ready. Monetarie incident: pod alive, Mongo dead. | Implement /readyz |
| "/health covers it" | /health without self-probe is blind to dep failures | Add self-probe, wire to /health |
| "TLS check is overhead" | TLS mismatch = silent failure for every query | Check TLS per dependency |
| "Only backend needs this" | Console (frontend) caused the incident | All apps, no exceptions |
| "Dependencies are reliable" | Networks partition. Configs drift. Certs expire. | Check every time |
| "Too many checks slow startup" | Bounded per-dependency timeouts keep overhead low. Incident costs hours. | No excuse |
| "Service has only one dependency" | One broken dependency = total outage. Complexity argument is irrelevant at zero scale. Self-probe is three lines of code. | Implement self-probe, no exceptions |