Help us improve
Share bugs, ideas, or general feedback.
From rampstack-skills
Implements operational discipline for feature flags: types, naming, lifecycle, targeting, rollouts, governance, stale flag cleanup.
npx claudepluginhub rampstackco/claude-skills --plugin rampstack-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/rampstack-skills:feature-flaggingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A senior engineer's playbook for using feature flags well, not just frequently.
Produces a feature flag management playbook covering taxonomy, rollout strategy, monitoring, cleanup, and governance. Use when asked to document flag practices or create a rollout plan.
Detects feature flag providers and usage in code, catalogs definitions and categories, audits staleness, recommends naming conventions, A/B testing, and rollout strategies.
Guides feature flag design, gradual rollouts, A/B testing, kill switches. Recommends LaunchDarkly, Unleash, Flagsmith based on flag count and needs for safer deployments.
Share bugs, ideas, or general feedback.
A senior engineer's playbook for using feature flags well, not just frequently.
Feature flags are infrastructure. Treated as such, they enable kill switches, gradual rollouts, A/B experiments, permission gates, and operational toggles without redeploys. Treated casually, they become the largest accumulating technical debt in your codebase: thousands of dead flags, conflicting evaluation logic, brittle targeting, and a permission surface no one fully understands.
This skill is the discipline that prevents the second outcome. It assumes you have a feature flag platform (LaunchDarkly, Flagsmith, Split.io, VWO FME, GrowthBook, Statsig, PostHog, Optimizely; the platform does not matter for the principles). It assumes your engineering team can implement targeting rules and SDK integration. The hard part is the operational discipline, and that is what is here.
When to use this skill: any time you are about to introduce a flag, modify a flag, audit existing flags, or design a flag-related governance policy.
The skill spans the operational lifecycle of a flag from creation through retirement. Flag types and the discipline of not mixing them. Naming conventions that survive code review. Lifecycle expectations baked in at creation, not bolted on later. Targeting rules that compose without fragility. Rollout strategies that match the risk profile of the launch. Stale flag management on a quarterly cadence. Governance and permissions that balance access with audit. Performance considerations so flags do not become a latency tax. Testing patterns that cover both branches. Rollback discipline. Observability across rollouts.
The skill does not cover experiment design; for hypothesis writing, sample size, MDE, and the discipline of arriving at a defensible decision, see the experiment-design skill. It does not cover statistical analysis or variance reduction; those live in the experimentation-analytics skill. It does not cover platform-specific tooling; for MCP commands, auth, and platform-specific configuration, consult the chosen platform's documentation. This skill produces the operational shape; the platform implements it.
Mixing flag types is the root cause of most flag mess. A flag is one of five things; commit at creation and do not let it drift.
Release flag. Code is in production but disabled. The new feature ships dark, gets toggled on for a percentage, then ramps to 100. Lifetime: short, days to weeks. Lifecycle: clean removal after launch. Common name prefix: release_.
Experiment flag. Users are randomly assigned to variants; conversion is measured. The flag controls which variant a user sees. Lifetime: medium, one to six weeks. Lifecycle: variant chosen, code paths consolidated, flag removed. Common name prefix: exp_.
Operational flag. A kill switch or circuit breaker that lets ops disable a misbehaving feature without a redeploy. Lifetime: long-lived, often years. Lifecycle: usually never removed; remains as standby. Common name prefix: ops_.
Permission flag. Controls feature access by plan tier, customer cohort, or region. The free tier sees one set of features; the enterprise tier sees another. Lifetime: long-lived. Lifecycle: managed alongside billing and access infrastructure. Common name prefix: perm_.
Configuration flag. Lets some customers see different behavior based on contractual configuration. White-label tenants, regulated regions, custom rollout schedules per account. Lifetime: long-lived. Lifecycle: governed by sales and product agreements. Common name prefix: cfg_.
Each type has different lifecycle, governance, and removal expectations. Mixing them in one flag (the flag is both a kill switch AND a permission gate AND now we are using it for an experiment) is the most common source of flag mess. When the flag's purpose changes, create a new flag and migrate. Do not overload an existing one.
A flag name encodes type, owner, and purpose. Without that encoding, a flag list at month nine is unreadable and the cleanup playbook from references/stale-flag-cleanup-playbook.md cannot tell what is safe to remove.
The convention this skill recommends is <type>_<owner>_<semantic_name>_<version_or_date>:
release_checkout_redesign_2026q2exp_billing_pricing_v2ops_search_circuit_breakerperm_enterprise_audit_logcfg_tenant_acme_custom_dashboardPick snake_case or kebab-case once, organization-wide, and stick with it. Mixing both in the same platform produces typo-driven bugs that bite at 3 AM. Vague names die a slow death: new_feature, temp_toggle, test_flag, pricing_update_v3. Within months, no one knows which pricing_update shipped and which is dead.
For deeper coverage including the table of typed prefixes, owner conventions, and the migration plan for existing badly-named flags, see references/flag-naming-conventions.md.
Every flag has five life phases. Each phase has explicit entry and exit criteria. Skipping phases is how flag mess accumulates.
Birth. Flag created with explicit metadata: owner, type, target removal date (for release and experiment flags), rollout plan, monitoring approach. The metadata is not optional; without it the flag has no end-of-life.
Adolescence. The feature behind the flag is being built. Code paths exist for both the disabled (current production) branch and the enabled (new) branch. Both are tested. The flag remains off in production.
Launch. Production rollout begins. Percentage starts low (1 or 5 percent), monitored at each step, ramps if metrics hold. Ramp gates documented in references/flag-rollout-strategies.md.
Maturity. The flag is at 100 percent rollout. The new code path is the production path. Monitoring continues for at least 30 days to catch issues that did not show up during the ramp.
Death. The flag is removed from code (PR that deletes the gating logic) and removed from the platform. The audit trail records the removal.
The asymmetry: birth is fast (one PR creates the flag and gates the new code), death requires intentional cleanup. Most flag mess is unfinished death. Birth-and-death have to be planned together; the death plan is part of the birth metadata.
For the per-phase checklist, see references/flag-lifecycle-checklist.md.
A targeting rule is the boolean expression that determines whether a user, account, or request gets the treatment branch. There are four useful target dimensions:
user.email_domain == "acme.com", user.signup_date > "2026-01-01", user.plan == "enterprise".account.tier == "enterprise", account.region == "EU", account.feature_x_enabled == true.request.country == "US", request.device_type == "mobile", request.api_version >= 3.time > "2026-06-01T00:00:00Z", time < "2026-12-31T23:59:59Z".Compose with AND, OR, and NOT. Keep the expressions simple. If your rule needs three nested clauses, your taxonomy is wrong. Either define a segment (a named group of users with shared attributes) and target the segment, or split into multiple flags.
The most common pitfall: targeting on attributes that change frequently. If the rule is user.last_login_date > "2026-04-01", a user who logs in on May 1 sees the treatment, then their value changes the next day, and they see the control again. The user experience whiplashes. Volatile attributes belong in segments computed off snapshots, not in live targeting rules.
For pattern catalog and anti-patterns, see references/targeting-rule-patterns.md.
Different launches deserve different rollout strategies. Match the strategy to the risk profile.
Percentage rollout. The default for release flags. Start at 1 or 5 percent. Watch error rates, latency, and the primary success metric for at least one peak hour. If clean, ramp to 10. Watch. Ramp to 25, then 50, then 100. Total elapsed time depends on traffic, but a typical pattern is 1 percent for one day, then 10 percent for two days, then 50 for two days, then 100. Production rollouts that complete in under four hours are usually rolling too fast.
Cohort rollout. Internal employees first, then beta customers, then free tier, then paid. Each cohort is a named segment; the flag advances through them in sequence. Use this for risky launches where the blast radius matters more than rollout speed.
Geo-staged rollout. One region at a time. The default for compliance-sensitive features (data residency, regulatory disclosures, region-specific UX). Pair with the platform's region targeting capability so the rule does not have to be hand-written.
Time-based rollout. Scheduled flips at known times. Useful for marketing-coordinated launches where the toggle has to fire at a specific moment. Watch the timezone; midnight Eastern is not midnight UTC. Avoid time-based rollouts on holidays unless you have weekend on-call coverage.
The "ramp and watch" rule: at each percentage step, monitor the system for at least one peak hour before advancing. The peak hour is when the rate of any new failure mode is highest; if the system holds at peak for an hour, the next step is safe. If you ramp during off-peak and immediately advance, you have not actually tested the change.
For full strategy detail and a worked example for a checkout redesign, see references/flag-rollout-strategies.md.
Every live flag adds runtime cost: evaluation overhead, two code paths to maintain, mental overhead for engineers who read the code. One flag is fine. Two hundred dead flags is a productivity tax that compounds.
Most platforms support stale flag reporting: flags that have not been evaluated in N days. Threshold defaults: 30 days for release flags and experiment flags, 90 days for operational and permission flags. A flag that has not been evaluated in 30 days is either retired in production (no users in the targeting rule) or completely dead in code (the gating call has been removed but the flag itself was never deleted).
The cleanup playbook is mechanical:
The reason cleanup does not happen by default: removing a flag is a code change PR with review, testing, and deploy overhead. PMs do not reward it (the feature already shipped). SREs eventually get fed up with the noise. The fix is to make removal part of the launch checklist, not a separate effort. When a flag is created, the removal date is in the metadata. The launch ticket has a follow-up sub-ticket scheduled 30 days out that says "remove flag X." If the launch goes well, the sub-ticket gets done.
For the full quarterly cadence including ownership rotation for orphan flags, see references/stale-flag-cleanup-playbook.md.
Different roles need different access. Engineering can create flags broadly. Product can author targeting rules in dev and staging. Production targeting changes deserve a smaller set of approvers.
Recommended permission tiers:
Production-impacting flag changes go through review like deploys, not chat. The change request describes the rule diff, the rollout plan, and the abort criteria. An approver applies the change after review; the rule does not silently appear in production because someone clicked a button.
Audit trail: every flag change logged with actor, timestamp, before/after rule, reason. Most platforms expose this; configure it to write to the central log aggregator alongside other infrastructure events.
Environment promotion: rules tested in dev, promoted to staging, promoted to production. Authoring a production rule directly in production skips both staging tests and the review gate. Treat it as a deploy: dev → staging → production, with the same review and approval at each stage.
Emergency override: who can flip the kill switch at 3 AM? Document, test in incident drills, and confirm the override path actually works. A kill switch nobody can find is not a kill switch.
Two failure modes that come from flags interacting:
Dependency. Flag B's code path requires flag A to be on. Hard to model in most platforms; the SDK does not know about the inter-flag relationship. Avoid where possible; if you need to express it, document in the flag description and add a unit test that fails if A is off and B is on.
Conflict. Two flags target the same surface, last-write-wins. The result depends on evaluation order, which depends on SDK implementation, which depends on platform internals. Detect via shared-key audit: any two flags whose targeting rules overlap on the same code path is a conflict candidate. The platform usually does not flag this; the team has to audit.
Cross-team conflict is the dangerous version: team Alpha's release flag for the checkout redesign interacts with team Bravo's experiment on the pricing page. The interaction is invisible until production users see broken UI because both flags fired. The fix is up-front coordination via the experiment registry described in experiment-design. Discover before launch via planning, not via a 2 AM incident.
For mutually exclusive experiment groups (where the platform enforces that a user is in either test A or test B but never both), see the experiment-design skill's coverage of mutex enforcement.
A flag evaluation is a function call (cheap) plus possibly a network call to the platform (expensive). The performance question is how much you spend in the latter.
Cache aggressively. Most server-side SDKs cache flag values per request, per session, or per user; configure the cache so the network call happens once per relevant scope, not on every flag check. A request that evaluates fifty flags should hit the platform once at request start, not fifty times.
Bulk evaluation: most platforms support evaluating all flags for a user or context in one call. Use it. The latency cost of "fifty individual flag checks" is fifty times the per-check overhead; the latency cost of "one bulk evaluation that returns all fifty" is one round trip plus a fast lookup per check.
SDK selection: server-side SDKs are usually preferable to client-side for sensitive logic where latency, security, and freshness matter. Client-side SDKs are appropriate for UI-only behavior where the user can see the flag value in DevTools anyway. Mixing the two creates inconsistencies; if a user gets a different flag value on the server than the client, the UI breaks.
Latency budget: evaluating fifty flags per request should add no more than 5 ms in the typical case. If it adds more, the SDK is misconfigured (no caching, no bulk evaluation, evaluating in a hot loop). Audit before scaling.
Both branches need tests. The disabled branch is the existing production behavior; the enabled branch is the new code. Unit tests cover both with explicit flag overrides at test setup.
Test the transition. What happens to a user mid-session when their flag value changes? In most well-behaved systems, the user finishes their current request with the value they had at request start, then sees the new value on the next request. Document the assumption explicitly. If your system mid-flips a user inside a session, that is usually a bug.
Integration tests with flag overrides: most platforms support test-only flag overrides that do not affect production. Use them for end-to-end tests that exercise both branches. Without test overrides, integration tests have to share the production flag config, which means you cannot test new features that are still gated off in production.
Common failure: shipping a "wins" experiment but only the test environment had the right targeting rules. The test environment let in 100 percent of test users via a permissive rule; production rules are stricter and let in only 20 percent. The lift you measured does not translate. Catch via staged rollout: the first 1 percent in production should match the test result before you ramp further.
Flags are useful for instant rollback of the gated change. They are not a substitute for code rollback. If a bug ships in code that is not flag-gated (a fix to the existing path, an unrelated change in the same deploy), flipping the flag does nothing. Treat flag rollback as one tool among several, not as the only tool.
If flipping the flag does work, document the rollback. If it does not work, the bug is in non-flag-gated code or in flag dependencies. Diagnose accordingly; do not assume the flag failed.
Speed matters. How fast can you flip a flag in production? The target is under 60 seconds via UI or MCP, under 5 minutes via API plus auth. If your platform takes 15 minutes to propagate a flag change, that is a real incident-response constraint and the team should know.
Practice. Include flag rollback in incident drills. The first time you flip a kill switch should not be at 3 AM with revenue dropping; it should be in a drill with everyone calm.
Log flag evaluations alongside other request data. The flag value at evaluation time goes on the request log as a contextual field. When you investigate a slow request or an error, the log shows which flag values were active.
Alert on flag evaluation rate changes. If a flag's evaluation rate suddenly drops, the rule is misconfigured (no users matching) or the SDK is broken. Either way, the team should know within minutes.
Build dashboards for production rollouts. What percentage of users see treatment versus control, broken down by region, plan, device, etc. Without the dashboard, the rollout is a black box; with it, the team can answer "are we actually rolling out evenly across segments?" without writing a SQL query.
Connect to error tracking. When errors spike during a rollout, the dashboard should show whether errors correlate with the flag value. Spike on treatment branch = roll back the flag. Spike on both branches = the issue is unrelated and the flag rollback will not help.
A short pattern catalog. Detail in references/.
When designing or auditing feature flags, walk these 14 considerations:
The following reference files extend this skill:
references/flag-naming-conventions.md: typed prefixes, owner prefixes, semantic naming patterns, and the migration plan for existing badly-named flags.references/flag-lifecycle-checklist.md: birth-to-death checklist organized by lifecycle phase.references/flag-types-reference.md: the five flag types in detail with worked examples and common pitfalls.references/stale-flag-cleanup-playbook.md: quarterly cleanup process and the orphan-ownership pattern.references/targeting-rule-patterns.md: common patterns and anti-patterns for targeting rules.references/flag-rollout-strategies.md: percentage, cohort, geo, time-based, and combination strategies with a worked example.references/governance-and-permissions.md: permission tiers, environment promotion, approval workflows, and emergency override.The default disposition for any flag should be removal. Flags that earn permanent residence in the codebase (operational kill switches, permission flags tied to billing, configuration flags tied to contracts) are exceptions, not norms. Most release and experiment flags should be removed within 30 days of the launch decision.
The cost of leaving them is real: every dead flag is two code paths to maintain, two test suites to keep green, and one more line of mental overhead for the engineer who reads the file next quarter. The cost compounds across hundreds of flags.
The discipline of removing them is the discipline of caring about the codebase you will work in next quarter. It is the same discipline behind cleaning up after a deploy, removing dead imports, and writing a post-mortem within a week of the incident. None of it ships features today. All of it makes shipping features tomorrow easier.
For platform-specific patterns (which platform handles which flag type best, what the MCP commands look like, where the gotchas live), consult the chosen platform's documentation. For the experiment-design layer above this skill (hypothesis discipline, sample size, decision-making), see the experiment-design skill. For the analytical layer below this skill (variance reduction, sequential testing math, Bayesian alternatives), see the experimentation-analytics skill.