Skill

service-reliability-design

Designs service reliability mechanisms using SLO decomposition, bottleneck analysis, load shaping, queue fairness, and observability-by-design. Use for admission control, throttling, workload separation, and operational controls.

backend

monitoring

npx claudepluginhub andrew0928/andrew-skills --plugin architect

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/architect:service-reliability-design

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Use this skill when the main problem is not only correctness, but whether the service can keep its promise under load, delay, spikes, and limited capacity.

Supporting Files

README.mdagents/openai.yamlagents/overview.yamlreferences/lineup-and-queue-fairness-principles.mdreferences/slo-and-toc-principles.mdreferences/throttle-and-qos-principles.md

SKILL.md

225 lines · ~2.6k tokens

Similar Skills

observability-monitoring-slo-implement

37.9k

Designs SLO frameworks, defines SLIs and error budgets, and implements monitoring systems balancing reliability with feature velocity. For service reliability targets and dashboards.

1 file

antigravity-awesome-skills

slo-designer

Designs SLOs with SLIs, targets, alerting thresholds, and error budgets following Google SRE best practices. Use for defining reliability targets or service indicators.

4 files

copilot-cli-toolkit

reliability-design

Design systems that fail gracefully and recover automatically. Use when defining SLAs, designing for fault tolerance, or improving uptime.

quality-attributes

Stats

Parent stars34

Parent forks5

MaintenanceGood

Last CommitMay 25, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Stats

Actions

Help us improve

Share bugs, ideas, or general feedback.

Andrew Service Reliability Design

Overview

Use this skill when the main problem is not only correctness, but whether the service can keep its promise under load, delay, spikes, and limited capacity.

Treat service quality as an architecture problem. Start from the promised outcome, turn it into measurable indicators, find the real bottleneck, then shape traffic and feedback so the whole system stays predictable instead of collapsing under pressure.

Working Style

Write in Traditional Chinese when the user or repo context is Chinese.
Start from the service promise and user-visible outcome before discussing scaling tricks.
Distinguish clearly:
- SLA
- SLO
- SLI
Treat application and domain metrics as design outputs, not as afterthoughts for Ops.
Use bottleneck reasoning before adding capacity blindly.
Prefer global optimization over local optimization.
Separate high-value and low-value workloads when they compete for the same constrained resource.
Decide explicitly how the system should reject, defer, queue, degrade, or reroute work.
Treat queue fairness, feedback quality, and abandonment handling as part of reliability, not only UX polish.
Validate control logic and monitoring logic with small executable models before production rollout.

Workflow

1. Problem Analysis

Define the exact service promise in user-visible terms:
- completion time
- success ratio
- freshness
- availability
- queue wait
- admission fairness
Clarify the pressure shape:
- steady load
- short spikes
- seasonal burst
- shared downstream dependency
- mixed-priority traffic
State what failure looks like:
- delayed too long
- dropped too late
- wrong users admitted first
- overload spreading downstream
- high-priority traffic blocked by low-priority traffic

2. Service Objective Model

Define the SLO before proposing the mechanism.
Decompose the top-level promise into measurable SLIs at real observation points.
For staged or asynchronous flows, separate:
- preparation time
- queue wait time
- execution time
- downstream handoff time
Distinguish internal operating goals from external agreements.
Mark which indicators are:
- hard service guarantees
- early warning signals
- optimization targets

3. Bottleneck And Workload Model

Identify where work accumulates and why.
Find the actual constrained resource:
- worker throughput
- queue capacity
- storage IOPS
- external provider quota
- lock contention
- human-facing checkout slots
Model workload classes explicitly when they do not deserve the same service level.
Estimate safe capacity and failure threshold with real units, not vague adjectives.
Use queue buildup, wait-time drift, and backlog growth to infer whether the bottleneck is arrival rate, processing speed, or downstream slowness.

4. Capacity Control And Admission

Decide how the system controls intake before overload becomes irreversible.
Choose the admission strategy that matches the problem:
- immediate reject
- bounded queue
- delayed execution
- rate limit
- token or budget allocation
- feature toggle
- workload separation
Define what "service amount" means in measurable units.
Define what happens after the limit is reached:
- reject
- retry later
- join queue
- degrade feature
- switch to alternate path
Reserve capacity deliberately for the workloads whose SLO matters most.
Prefer simple, explainable control rules over opaque heuristics unless the added complexity is justified.

5. Queue And Lineup Design

Use queueing when deferral is acceptable and fairness or ordering matters.
Define queue semantics explicitly:
- ordering rule
- fairness rule
- max queue length
- admission window
- abandonment timeout
- duplicate join rule
Make user-facing status cheap to query and easy to explain:
- current state
- queue position
- estimated wait
- may-enter signal
Keep queue data structures aligned with the dominant operations.
Support dynamic tuning and operational intervention when real load differs from prediction.
Decide when users should be told not to enter the queue at all because the system already knows they will miss the objective.

6. Observability And Metrics

Treat metric emission, dashboard design, and alert points as part of the design.
Emit application and domain metrics directly from the service when infrastructure metrics cannot express the real promise.
Prefer metrics that explain the situation, such as:
- end-to-end completion time
- stage latency
- queue length
- predicted delay
- accepted vs rejected count
- executed count
- abandonment count
- fairness violation count
- workload split volume
- capacity utilization
Use dashboards to support diagnosis, not decoration.
Define alert rules that map to real operator actions.

7. Operational Controls And Response

Protect the bottleneck instead of feeding it blindly.
Decide what operators or automated controls can change safely:
- worker count
- queue limit
- admission threshold
- per-class capacity
- feature toggle
- manual removal
Define the emergency path when the SLO is already impossible to meet.
Push the signal upstream when admitting more work only increases failure and cost.
Consider cost together with SLO when long-term operation matters.
If a local optimization shifts the bottleneck elsewhere, update the control plan accordingly.

8. POC And Simulation

Build the thinnest executable model that proves the service-control logic and the metrics model.
Use controllable traffic patterns:
- normal load
- burst load
- idle gaps
- mixed-priority traffic
- slow downstream
Prefer CSV, charts, or simple dashboards for early comparison if that is enough to expose the behavior.
Validate both the control mechanism and the observability plan in the same POC.
Use cheap local models before production-scale infrastructure when the goal is to validate the design.

9. Evaluation

Compare designs by whether they preserve the intended service objective under realistic pressure.
Prefer the design whose overload behavior is explicit, diagnosable, and operationally controllable.
If queueing is chosen, evaluate both system protection and user fairness.
If throttling is chosen, evaluate both protection strength and lost business value.
If workload separation is chosen, confirm it protects the high-priority SLO rather than only moving metrics around.
Name the assumptions that would invalidate the design later.

Design Checks

Revisit the design if the SLO cannot be stated as something measurable.
Revisit the decomposition if the top-level promise cannot be traced to stage-level indicators.
Revisit the bottleneck analysis if scale-out is proposed before backlog and delay patterns are understood.
Revisit the control model if overload is detected only after downstream damage is already visible.
Revisit the queue design if status queries are expensive, unfairness is unexplained, or abandonment is ignored.
Revisit the workload model if low-value traffic can still consume capacity needed by high-value traffic.
Revisit the metrics if they cannot distinguish wait, execution, rejection, and backlog growth.
Revisit the dashboard if operators can see charts but still cannot decide what to do.
Revisit the POC if it proves only average load and hides the spike behavior that actually matters.

Default Output Shape

When the user asks for service reliability design, respond in this order unless they request another structure:

Problem Analysis
Service Objective Model
Bottleneck And Workload Model
Capacity Control And Admission
Queue And Lineup Design
Observability And Metrics
Operational Controls And Response
POC
Evaluation
Risks And Refactor Triggers

Use these expectations for each section:

Problem Analysis: define the promised outcome, load shape, failure mode, and the real pressure source.
Service Objective Model: define SLA, SLO, SLI, observation points, and hard-vs-soft indicators.
Bottleneck And Workload Model: define constrained resources, workload classes, backlog signals, and safe capacity assumptions.
Capacity Control And Admission: define how work is accepted, deferred, rejected, or degraded.
Queue And Lineup Design: define fairness, position/status feedback, abandonment policy, and queue-boundary rules.
Observability And Metrics: define application metrics, dashboards, alerts, and diagnosis signals.
Operational Controls And Response: define knobs, escalation path, upstream protection, and cost-aware actions.
POC: propose the smallest simulation or runnable slice that exposes overload behavior and metrics usefulness.
Evaluation: compare alternatives and state the decision.
Risks And Refactor Triggers: name what traffic, dependency, or business changes would force redesign.

References

Read references/slo-and-toc-principles.md when the task is about decomposing a service promise into SLIs, finding bottlenecks, or designing dashboards and controls around SLOs.
Read references/throttle-and-qos-principles.md when the task is about rate limiting, admission control, service-amount modeling, or choosing how to reject, defer, or reserve capacity.
Read references/lineup-and-queue-fairness-principles.md when the task is about waiting-room design, fairness, polling status, queue-state modeling, or operator controls for large waiting populations.
Keep article-specific notes in references/.
Promote only stable cross-article guidance back into this SKILL.md.

service-reliability-design

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Similar Skills

Help us improve

Help us improve

Find plugins for your project

service-reliability-design

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Andrew Service Reliability Design

Overview

Working Style

Workflow

1. Problem Analysis

2. Service Objective Model

3. Bottleneck And Workload Model

4. Capacity Control And Admission

5. Queue And Lineup Design

6. Observability And Metrics

7. Operational Controls And Response

8. POC And Simulation

9. Evaluation

Design Checks

Default Output Shape

References

Similar Skills

Help us improve

Andrew Service Reliability Design

Overview

Working Style

Workflow

1. Problem Analysis

2. Service Objective Model

3. Bottleneck And Workload Model

4. Capacity Control And Admission

5. Queue And Lineup Design

6. Observability And Metrics

7. Operational Controls And Response

8. POC And Simulation

9. Evaluation

Design Checks

Default Output Shape

References