From systems-design
Guides designing Internal Developer Platforms (IDPs), building platform teams, and improving developer experience. Covers Backstage, portal design, and platform engineering principles.
npx claudepluginhub melodic-software/claude-code-plugins --plugin systems-designThis skill is limited to using the following tools:
Comprehensive guide to designing and building Internal Developer Platforms (IDPs) that improve developer productivity and experience.
Designs Internal Developer Platforms with self-service portals, templates, provisioning, and golden paths. Useful for platform teams assessing maturity and creating roadmaps.
Initializes Turborepo monorepos for enterprise microservices using Next.js, FastAPI/Python, Kubernetes/Terraform infra, and provides AI Native development guides and help.
Share bugs, ideas, or general feedback.
Comprehensive guide to designing and building Internal Developer Platforms (IDPs) that improve developer productivity and experience.
Internal Developer Platform (IDP):
A layer on top of infrastructure that provides self-service
capabilities to development teams while maintaining governance.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DEVELOPERS β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β Team A β β Team B β β Team C β β Team D β β
β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
β β β β β β
β ββββββββββββββ΄ββββββ¬βββββββ΄βββββββββββββ β
β β β
β βββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββ β
β β INTERNAL DEVELOPER PLATFORM β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Service β β Template β β Self- β β Docs & β β β
β β β Catalog β β Library β β Service β β Discoveryβ β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββ β
β β INFRASTRUCTURE β β
β β Kubernetes β Cloud β CI/CD β Observability β Security β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Value Propositions:
βββ Self-service: Developers can provision without tickets
βββ Standardization: Consistent patterns across teams
βββ Guardrails: Security and compliance built-in
βββ Visibility: Centralized service catalog and docs
βββ Efficiency: Reduce cognitive load on developers
Infrastructure Team (Traditional):
- Ticket-based requests
- Manual provisioning
- Bespoke solutions per team
- Ops handles deployments
- Documentation scattered
Platform Team (Modern):
- Self-service capabilities
- Automated provisioning
- Standardized templates
- Developers own deployments
- Centralized documentation
Key Shift:
"You Build It, You Run It" + "Platform Handles the How"
Service Catalog:
Centralized registry of all services with ownership, docs, and metadata.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SERVICE CATALOG β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Payment Service [API] β β
β β Owner: Payments Team β Tier: Critical β β
β β Tech: Node.js, PostgreSQL β Dependencies: 4 β β
β β [Docs] [API Spec] [Runbook] [Alerts] [Deploy] β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β User Service [Backend] β β
β β Owner: Identity Team β Tier: High β β
β β Tech: Go, MongoDB β Dependencies: 2 β β
β β [Docs] [API Spec] [Runbook] [Alerts] [Deploy] β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Service Metadata: β
β βββ Owner team and contacts β
β βββ Technical stack β
β βββ Service tier/criticality β
β βββ Dependencies (upstream/downstream) β
β βββ API specifications β
β βββ Documentation links β
β βββ Deployment information β
β βββ Observability dashboards β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Template Library:
Pre-built templates for common patterns that encode best practices.
Template Categories:
βββ Application Templates
β βββ REST API (Go, Node.js, .NET, Python)
β βββ GraphQL Service
β βββ gRPC Service
β βββ Event Consumer
β βββ Scheduled Job
β βββ Frontend (React, Vue, Angular)
β
βββ Infrastructure Templates
β βββ Database (PostgreSQL, MySQL, MongoDB)
β βββ Cache (Redis, Memcached)
β βββ Message Queue (Kafka, RabbitMQ)
β βββ Storage (S3, GCS)
β
βββ Integration Templates
βββ Third-party API client
βββ Authentication flow
βββ Webhook handler
Template Contents:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Template: node-rest-api β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββ src/ β Application code β
β βββ tests/ β Test setup β
β βββ Dockerfile β Container image β
β βββ helm/ β Kubernetes deployment β
β βββ .github/workflows/ β CI/CD pipelines β
β βββ docs/ β Documentation templates β
β βββ catalog-info.yaml β Backstage registration β
β βββ terraform/ β Infrastructure as Code β
β β
β Built-in: β
β β Health checks β Structured logging β
β β OpenTelemetry tracing β Prometheus metrics β
β β Security headers β Input validation β
β β Error handling β API documentation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Self-Service Capabilities:
Actions developers can perform without tickets or approvals.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SELF-SERVICE PORTAL β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Create New Service [5 min setup, no tickets] β
β βββ Choose template β
β βββ Configure options β
β βββ Generate repository β
β βββ Create CI/CD pipeline β
β βββ Provision infrastructure β
β βββ Register in catalog β
β β
β Common Self-Service Actions: β
β ββββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ β
β β Environments β Databases β Secrets β β
β β βββ Create env β βββ Provision β βββ Create β β
β β βββ Clone env β βββ Scale β βββ Rotate β β
β β βββ Destroy β βββ Backup β βββ Access β β
β ββββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββ β
β ββββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββ β
β β Deployments β Domains β Access β β
β β βββ Deploy β βββ Request β βββ Request β β
β β βββ Rollback β βββ Configure β βββ Review β β
β β βββ Promote β βββ Cert β βββ Audit β β
β ββββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββ β
β β
β Guardrails (automatic): β
β β Security scanning β Compliance checks β
β β Cost limits β Naming conventions β
β β Resource quotas β Approval workflows β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Platform Team Types:
1. Enabling Team (Recommended Start)
Purpose: Help stream-aligned teams adopt platform
Size: 3-5 people
Activities:
βββ Pair programming with product teams
βββ Create documentation and guides
βββ Gather feedback and requirements
βββ Provide training and support
2. Platform Team (Mature)
Purpose: Build and maintain the platform
Size: 5-15 people (scale with org)
Activities:
βββ Build self-service capabilities
βββ Maintain templates and tooling
βββ Define and enforce standards
βββ Operate platform infrastructure
3. Complicated Subsystem Team (Specialized)
Purpose: Handle complex technical domains
Size: 3-7 people per domain
Examples:
βββ Data platform team
βββ ML platform team
βββ Security platform team
Team Interaction:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β ββββββββββββββββ ββββββββββββββββ β
β β Stream- ββββββββββΊβ Platform β β
β β Aligned Team β X-as-a- β Team β β
β ββββββββββββββββ Service ββββββββββββββββ β
β β β β
β β Collaboration β Facilitation β
β β β β
β βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ β
β β Complicated ββββββββββΊβ Enabling β β
β β Subsystem β Service β Team β β
β ββββββββββββββββ ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Platform Team Competencies:
Technical:
βββ Kubernetes and container orchestration
βββ Infrastructure as Code (Terraform, Pulumi)
βββ CI/CD pipeline design
βββ API design and development
βββ Observability tooling
βββ Security engineering
βββ Cloud platforms (AWS, GCP, Azure)
Product:
βββ Developer experience research
βββ User journey mapping
βββ Metrics and analytics
βββ Documentation writing
βββ Training and enablement
Organizational:
βββ Stakeholder management
βββ Communication skills
βββ Change management
βββ Technical leadership
Backstage:
Open-source developer portal framework by Spotify.
Core Features:
βββ Service Catalog (software component registry)
βββ Software Templates (scaffolding)
βββ TechDocs (docs-as-code)
βββ Search (unified search across everything)
βββ Plugins (extensible ecosystem)
Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BACKSTAGE β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Frontend (React) β β
β β βββ Catalog UI βββ Templates UI βββ Plugins β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Backend (Node.js) β β
β β βββ Catalog API βββ Auth βββ Plugin APIs β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Integrations β β
β β βββ GitHub βββ Kubernetes βββ CI/CD β β
β β βββ PagerDuty βββ Prometheus βββ Custom β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Catalog Entity:
# catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: payment-service
description: Handles payment processing
annotations:
github.com/project-slug: org/payment-service
backstage.io/techdocs-ref: dir:.
spec:
type: service
lifecycle: production
owner: payments-team
system: payments
dependsOn:
- component:user-service
providesApis:
- payment-api
Platform Options Comparison:
| Platform | Type | Strengths | Considerations |
|----------|------|-----------|----------------|
| Backstage | OSS | Extensible, active community | Requires customization |
| Port | Commercial | Quick setup, polished UI | Vendor lock-in |
| Cortex | Commercial | SRE focused, scorecards | Enterprise pricing |
| OpsLevel | Commercial | Service maturity | Smaller ecosystem |
| Roadie | Managed | Hosted Backstage | Less control |
Decision Factors:
βββ Build vs Buy tolerance
βββ Customization requirements
βββ Team capacity for maintenance
βββ Integration needs
βββ Budget constraints
βββ Timeline expectations
DORA (DevOps Research and Assessment) Metrics:
1. Deployment Frequency
How often you deploy to production
βββ Elite: Multiple times per day
βββ High: Weekly to monthly
βββ Medium: Monthly to every 6 months
βββ Low: Every 6+ months
2. Lead Time for Changes
Time from code commit to production
βββ Elite: < 1 hour
βββ High: 1 day to 1 week
βββ Medium: 1 week to 1 month
βββ Low: 1 month to 6 months
3. Mean Time to Recovery (MTTR)
Time to recover from production failure
βββ Elite: < 1 hour
βββ High: < 1 day
βββ Medium: < 1 week
βββ Low: 1 week to 1 month
4. Change Failure Rate
Percentage of deployments causing failure
βββ Elite: 0-15%
βββ High: 16-30%
βββ Medium: 31-45%
βββ Low: 46-60%
Platform Success Metrics:
Adoption:
βββ % of services in catalog
βββ % of teams using templates
βββ Self-service usage rate
βββ Portal active users
βββ Template utilization
Efficiency:
βββ Time to first deployment (new service)
βββ Time to provision infrastructure
βββ Ticket reduction rate
βββ Toil automation percentage
βββ Developer time saved
Satisfaction:
βββ Developer NPS
βββ Platform satisfaction surveys
βββ Support ticket volume
βββ Documentation usefulness
βββ Onboarding feedback
Quality:
βββ Template adoption vs custom builds
βββ Security compliance rate
βββ Standards adherence
βββ Incident rate for platform-built services
Phase 1: Foundation (3-6 months)
βββ Service catalog (inventory what exists)
βββ Basic documentation site
βββ Initial template (1-2 golden paths)
βββ Platform team formation
βββ Metrics baseline
Phase 2: Self-Service (6-12 months)
βββ Template library expansion
βββ Self-service provisioning
βββ CI/CD standardization
βββ Developer portal launch
βββ Adoption campaigns
Phase 3: Optimization (12-18 months)
βββ Advanced templates
βββ Platform APIs
βββ Automation expansion
βββ Cost optimization
βββ Advanced analytics
Phase 4: Ecosystem (18+ months)
βββ Plugin ecosystem
βββ ML/data platform integration
βββ Cross-team collaboration features
βββ External developer experience
βββ Continuous evolution
Success Criteria Per Phase:
Phase 1: 50% service discovery complete
Phase 2: 70% of new services use templates
Phase 3: 80% self-service capability
Phase 4: Platform is indispensable
Platform Anti-Patterns:
1. "Build It and They Will Come"
β Building features without user research
β Start with developer interviews and pain points
2. "One Size Fits All"
β Forcing every team into same workflow
β Provide flexibility with sensible defaults
3. "Platform as Gatekeeper"
β Adding friction and approval gates
β Enable self-service with guardrails
4. "Technical Purity"
β Choosing tech for platform team excitement
β Choose what solves developer problems
5. "Big Bang Launch"
β Building for 2 years before releasing
β Iterate quickly with early adopters
6. "Mandates Without Value"
β Forcing adoption via policy
β Make platform so good teams want to use it
7. "Documentation Afterthought"
β Minimal or outdated docs
β Treat docs as product feature
8. "Ivory Tower Platform"
β Platform team isolated from users
β Embed with product teams regularly
Platform Engineering Best Practices:
1. Treat Platform as Product
βββ Have product owner/manager
βββ Conduct user research
βββ Prioritize based on impact
βββ Measure outcomes, not outputs
2. Start with Golden Paths
βββ Identify most common use cases
βββ Create templates for those first
βββ Make golden path easiest choice
βββ Don't block non-golden paths
3. Optimize for Self-Service
βββ Target <5 minutes for common tasks
βββ Eliminate manual approvals where safe
βββ Provide escape hatches when needed
βββ Clear error messages and guidance
4. Build Community
βββ Developer advocates/champions
βββ Office hours and support channels
βββ Contribution guidelines
βββ Celebrate platform wins
5. Measure Everything
βββ Adoption metrics
βββ Developer satisfaction
βββ Time savings
βββ Platform reliability
6. Iterate Rapidly
βββ Ship early, improve often
βββ Gather feedback continuously
βββ Deprecate gracefully
βββ Communicate changes clearly
golden-paths - Designing standardized development workflowsself-service-infrastructure - Infrastructure self-service patternsslo-sli-error-budget - Platform reliability targetsobservability-patterns - Platform observability