From design-system-ops
Benchmarks design systems against industry standards and public systems like Material Design, Polaris, and Carbon, producing qualitative comparisons across maturity dimensions with named references.
npx claudepluginhub murphytrueman/design-system-opsThis skill uses the workspace's default tool permissions.
A skill for benchmarking a design system against industry standards and comparable public systems, producing a qualitative comparison with specific reference points that answer: "How does our system compare to what good looks like?"
Assesses design system health across seven dimensions—tokens, components, documentation, adoption, governance, AI readiness, platform maturity—producing findings summary and prioritized actions.
Assesses design system maturity on 1-4 scale with metrics, evaluates adoption health and governance, analyzes component duplication, calculates custom vs system ratios, identifies barriers and debt trends.
Detects design systems in code, identifies token drifts with paired evidence blocks showing definitions vs conflicting usages. Use for auditing UI consistency.
Share bugs, ideas, or general feedback.
A skill for benchmarking a design system against industry standards and comparable public systems, producing a qualitative comparison with specific reference points that answer: "How does our system compare to what good looks like?"
Output type: Proposal only. This skill produces analysis and comparisons. It does not make changes. It produces a benchmark report with findings, comparison context, and prioritised improvement areas.
System-health tells you if your system is healthy on its own terms. But it cannot answer: "Is our token architecture actually good? What do systems we admire look like at this layer?"
System Benchmark fills this gap. It compares your system against documented public benchmarks — published design system case studies, open-source system architectures, and industry maturity models — to give your findings context. A team that learns their token architecture is two tiers behind what mature enterprise systems typically have now has a specific target and gap to close.
This is not competitive intelligence. Design systems are not products competing in a market. This is calibration — understanding where your system sits on a maturity curve so you can prioritise investment.
Check for .ds-ops-config.yml in the project root:
benchmark:
system_type: "enterprise" # enterprise, product, agency, government
team_size: 5 # Full-time design system team members
system_age_months: 24 # How long the system has been in active development
consumer_count: 12 # Number of teams/products consuming the system
comparison_targets: # Specific systems to compare against (optional)
- "Material Design"
- "Polaris"
- "Carbon"
If no configuration exists, ask for:
The benchmark assesses twelve dimensions grouped into four pillars. For each dimension, assess the current state and compare it against what mature, publicly documented systems look like at that layer.
1. Token architecture maturity What to look for: Is there a formal token system? How many tiers (flat list, two-tier, three-tier)? Is aliasing consistent? Is the format standards-compliant (DTCG)? Is validation automated? Does it support multi-brand or multi-theme?
What mature systems look like: Full three-tier architecture with consistent aliasing, DTCG-compliant format, automated validation, and multi-brand support.
2. Component API consistency What to look for: Are prop naming conventions consistent across components? Are conventions documented? Are they enforced by linting? Are there typed API contracts?
What mature systems look like: Typed, linted, documented API contracts with automated consumer contract testing.
3. Accessibility baseline What to look for: Are ARIA attributes present and consistent? Is there automated a11y testing in CI? Has manual testing with assistive technology been done? Are keyboard patterns documented? Is reduced motion and high contrast supported?
What mature systems look like: Full WCAG 2.1 AA compliance verified by audit. Screen reader testing in CI. Reduced motion and high contrast support built in.
4. Component documentation completeness What to look for: Is there documentation beyond source code? Is there a documentation site? Does it include usage guidelines, do/don't examples, and API reference? Is there an interactive playground?
What mature systems look like: Documentation site with interactive playground, usage analytics, and searchable component inventory with metadata.
5. Token documentation What to look for: Are tokens documented beyond code? Are there visual previews? Is semantic intent described? Are there migration guides for token changes?
What mature systems look like: Token documentation with visual previews, intent descriptions, do/don't examples, and automated sync between code and docs.
6. AI readiness What to look for: Do components have structured metadata? Are descriptions consistent and machine-optimised? Is there a machine-readable manifest? Are composition rules explicit?
What mature systems look like: AI-optimised descriptions, machine-readable composition rules, validation schemas, and components that are self-describing.
7. Release process maturity What to look for: Is there a formal release process? Is semantic versioning applied? Are changelogs consistent? Is there an automated pipeline with quality gates? Are consumers notified?
What mature systems look like: Full release pipeline with canary releases, consumer impact assessment, and automated migration codemods.
8. Contribution model What to look for: Is there a contribution process? Are there guidelines? Is there a review process with SLAs? Is the model federated with clear ownership?
What mature systems look like: Federated contribution model with automated validation, community engagement metrics, and regular office hours.
9. Deprecation discipline What to look for: Is there a deprecation process? Are there timelines and recommended replacements? Are there migration guides and codemods? Is consumer adoption tracked?
What mature systems look like: Automated deprecation pipeline — notice, codemod, migration assistance, sunset — with consumer adoption dashboard.
10. Adoption breadth What to look for: What percentage of eligible products use the system? Are custom builds exceptions or the norm? Do custom builds require approval?
What mature systems look like: Over 90% adoption. Off-system builds are rare and tracked. The system is the default.
11. Adoption depth What to look for: Do teams use just basic components or the full stack (tokens, components, patterns, layout, content guidelines)? Do team extensions feed back into the system?
What mature systems look like: Full system consumption across all layers. Team extensions follow contribution guidelines and feed back into the system.
12. Developer experience What to look for: How long does it take to get started? Are there starter templates? Is there good TypeScript support? Are error messages helpful?
What mature systems look like: CLI scaffold, framework templates, IDE plugins, hot-reload dev environment. Under 5 minutes to first component.
Ask about what cannot be observed from code:
For each of the twelve dimensions:
Be honest about the current state. Describe what you found, not what you hope is there.
Based on published case studies and documented design system characteristics, these are typical maturity profiles for different system types:
Mature enterprise (Material, Carbon, Polaris level): Strong across nearly all dimensions. Full token architecture, automated pipelines, documented governance, high adoption. Gaps tend to be in AI readiness and developer experience refinement.
Established enterprise (1–3 years, dedicated team): Foundation quality and documentation are typically functional. Governance and deprecation discipline are often the weakest layers. Adoption breadth may be high but depth is inconsistent.
Early enterprise (first year, part-time team): Token architecture and component APIs are still forming. Documentation is partial. Governance is informal. Focus should be on getting foundations right rather than chasing breadth.
Mature product (GitHub Primer, Adobe Spectrum level): Strong developer experience and API consistency. Documentation often excellent. Governance may be lighter (smaller team, less formal process needed).
Government systems (USWDS, GOV.UK level): Accessibility baseline is typically the strongest dimension. Token architecture and AI readiness vary widely. Governance is often strong due to procurement and compliance requirements.
Agency systems: Tend to be strong on developer experience (need fast onboarding) but weaker on governance and deprecation discipline (client turnover disrupts continuity).
Use these profiles to contextualise the system's current state — "your token architecture is typical for an established enterprise system" or "your governance is behind what we'd expect at this maturity level."
Based on the system's type and maturity, select 3–5 comparison targets from publicly documented systems:
Enterprise: Material Design 3 (Google), Carbon (IBM), Polaris (Shopify), Atlassian Design System, Lightning Design System (Salesforce)
Government: US Web Design System (USWDS), Australian Government Design System (AGDS), GOV.UK Design System, Canada Design System
Product: Primer (GitHub), Spectrum (Adobe), Paste (Twilio), Base Web (Uber)
Agency: Orbit (Kiwi.com), Garden (Zendesk)
For each comparison target, document (from published information):
Do not fabricate data about comparison targets. Only include information that is publicly documented or observable from their open-source repositories and documentation sites.
Open with a headline sentence that tells the reader the overall state and where to focus.
Generated by: Design System Ops — system-benchmark Date: [date] System: [system name] Type: [enterprise / product / government / agency] Team size: [N] | Active since: [date/duration] | Consumers: [N teams/products]
System type: [enterprise / product / government / agency] Maturity band: [Early / Established / Mature] Closest comparable: [named public system at a similar stage] Strongest pillar: [pillar name] Weakest pillar: [pillar name]
| Pillar | Status | Summary |
|---|---|---|
| Foundation quality | ✅ Strong / ⚠️ Functional / ❌ Weak | [One sentence] |
| Documentation & discoverability | ||
| Governance & process | ||
| Adoption & impact |
For each of the twelve dimensions:
[Dimension name]: ✅ Strong / ⚠️ Functional / ❌ Weak / ❌ Absent
Evidence: [What was observed] Gap: [What specifically needs to happen to reach the next level of maturity] Comparison: [How this compares to similar systems — "Material Design has X at this layer, which your system lacks"]
| Dimension | Your system | [Target 1] | [Target 2] | [Target 3] |
|---|---|---|---|---|
| Token architecture | [Status] | [What they have] | [What they have] | [What they have] |
| API consistency | [Status] | |||
| ... | ... | ... | ... | ... |
List the three strongest dimensions with why they are strong and what to protect.
List the three weakest dimensions with:
Based on the benchmark, map a progression path:
Current state: [Maturity band] — [key characteristics] 6-month target: Focus on [2–3 specific dimensions] — what "done" looks like for each 12-month target: Focus on [next 2–3 dimensions] — what "done" looks like for each 18-month aspiration: Approaching [comparison target] level in [specific areas]
If recurring is configured:
End the report with:
A note on context: This benchmark compares your system against structural best practices — it does not see the constraints, priorities, or trade-offs that shaped it. Some dimensions may be intentionally deprioritised at your current maturity stage. If any finding flags a known, accepted gap, let me know — I'll calibrate future benchmarks to your system's actual priorities. The goal is to highlight opportunities, not to penalise deliberate focus.