Platform Onboarding Workflow
This skill defines the structured process for onboarding services to the internal developer platform. Use it to ensure consistent, compliant service deployments.
Onboarding Phases
| Phase | Focus | Output |
|---|
| 1. Requirements | Gather service requirements | Requirements doc |
| 2. Golden Path Selection | Choose deployment pattern | Selected template |
| 3. Infrastructure Provisioning | Create service resources | Infrastructure ready |
| 4. Observability Setup | Configure monitoring | Dashboards/alerts |
| 5. Security Configuration | Apply security controls | Security validated |
| 6. Documentation | Complete service docs | Runbook ready |
| 7. Handoff | Transfer to service team | Ownership confirmed |
Phase 1: Requirements Gathering
Service Requirements Checklist
## Service Onboarding Request
**Service Name:** [name]
**Team:** [owning team]
**Requested By:** [name]
**Target Date:** YYYY-MM-DD
### Service Information
| Attribute | Value |
|-----------|-------|
| Service type | [API / Worker / Batch / Frontend] |
| Language/runtime | [Go / Node.js / Python / etc.] |
| Criticality | [Tier 1/2/3/4] |
| External traffic | [Yes / No] |
| Data sensitivity | [PII / Financial / Public] |
### Resource Requirements
| Resource | Requirement | Notes |
|----------|-------------|-------|
| CPU | [cores] | [peak/average] |
| Memory | [GB] | [peak/average] |
| Storage | [GB] | [type: SSD/HDD] |
| Database | [type] | [shared/dedicated] |
| Cache | [type] | [shared/dedicated] |
### Dependencies
| Dependency | Type | SLA Required |
|------------|------|--------------|
| [service] | Internal | [Yes/No] |
| [external] | External | [Yes/No] |
### Compliance Requirements
- [ ] SOC2
- [ ] PCI-DSS
- [ ] GDPR
- [ ] HIPAA
- [ ] Other: ____________
Phase 2: Golden Path Selection
Available Golden Paths
| Golden Path | Use Case | Includes |
|---|
| api-service | REST/GraphQL APIs | ALB, EKS, RDS, ElastiCache |
| worker-service | Background processing | SQS, EKS, auto-scaling |
| batch-job | Scheduled jobs | EventBridge, Lambda/Fargate |
| frontend-app | Static sites, SPAs | CloudFront, S3, API Gateway |
| data-pipeline | ETL, streaming | Kinesis, Glue, S3 |
Golden Path Selection Matrix
| Requirement | api-service | worker-service | batch-job |
|---|
| HTTP traffic | Yes | No | No |
| Queue processing | Optional | Yes | Optional |
| Scheduled runs | No | No | Yes |
| Real-time | Yes | Near-real-time | No |
| Auto-scaling | Yes | Yes | N/A |
Selection Template
## Golden Path Selection
**Service:** [name]
**Selected Path:** [api-service / worker-service / etc.]
### Rationale
1. Service type [X] matches [golden path] pattern
2. Traffic requirements of [X] supported by [features]
3. Compliance requirements met by built-in [controls]
### Customizations Required
| Standard Component | Customization | Reason |
|--------------------|---------------|--------|
| [component] | [change] | [why] |
### Approval
- [ ] Platform team reviewed
- [ ] Security team reviewed (if customizations)
- [ ] Architecture team reviewed (if non-standard)
Phase 3: Infrastructure Provisioning
Provisioning Checklist
Terraform/IaC Template
# Example service provisioning
module "service" {
source = "platform/service-template"
service_name = var.service_name
team = var.team
environment = var.environment
golden_path = "api-service"
# Compute
cpu_request = "500m"
memory_request = "512Mi"
replicas_min = 2
replicas_max = 10
# Database
database_enabled = true
database_class = "db.t3.medium"
# Tags
tags = {
Team = var.team
Environment = var.environment
CostCenter = var.cost_center
}
}
Provisioning Verification
# Verify namespace
kubectl get namespace [service-name]
# Verify compute
kubectl get deployment -n [service-name]
# Verify database
aws rds describe-db-instances --db-instance-identifier [service-db]
# Verify DNS
dig [service-name].internal.example.com
Phase 4: Observability Setup
Observability Checklist
Dashboard Template
Standard service dashboard includes:
| Panel | Metrics |
|---|
| Request rate | requests/sec, by status code |
| Error rate | 5xx rate, 4xx rate |
| Latency | p50, p95, p99 |
| Saturation | CPU, memory utilization |
| Dependencies | Upstream/downstream health |
Alert Configuration
| Alert | Condition | Severity | Response |
|---|
| High error rate | 5xx > 1% for 5m | Critical | Page on-call |
| High latency | p99 > 1s for 5m | Warning | Alert team |
| Low availability | uptime < 99.9% | Critical | Page on-call |
| Resource saturation | CPU > 85% for 10m | Warning | Alert team |
SLI/SLO Definition
## Service Level Objectives
**Service:** [name]
**SLO Version:** 1.0
| SLI | Target | Measurement |
|-----|--------|-------------|
| Availability | 99.9% | Successful requests / total requests |
| Latency | p99 < 500ms | Request duration percentile |
| Error rate | < 0.1% | 5xx responses / total responses |
### Error Budget
- Monthly budget: 43.2 minutes downtime
- Current consumption: [X]%
- Actions if budget exceeded: [escalation process]
Phase 5: Security Configuration
Security Checklist
Network Policy Template
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: service-policy
namespace: [service-name]
spec:
podSelector:
matchLabels:
app: [service-name]
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: istio-system
ports:
- port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: database
ports:
- port: 5432
Security Review
## Security Configuration Review
**Service:** [name]
**Reviewer:** @security-team
| Control | Status | Notes |
|---------|--------|-------|
| mTLS enabled | PASS | Istio strict mode |
| Network policies | PASS | Ingress/egress restricted |
| Secrets management | PASS | Using Vault |
| Least privilege IAM | PASS | Scoped to required resources |
| Vulnerability scanning | PASS | Trivy in CI/CD |
Phase 6: Documentation
Required Documentation
| Document | Purpose | Template |
|---|
| Service Overview | What the service does | README.md |
| Runbook | Operational procedures | runbook.md |
| Architecture | Design decisions | architecture.md |
| API Docs | Interface documentation | OpenAPI spec |
| On-call Guide | Incident handling | oncall.md |
Runbook Template
## [Service Name] Runbook
### Service Overview
[Brief description of what the service does]
### Quick Reference
| Item | Value |
|------|-------|
| Repository | [link] |
| Dashboard | [link] |
| Logs | [query link] |
| On-call | [PagerDuty service] |
### Common Operations
#### Restart Service
```bash
kubectl rollout restart deployment/[service] -n [namespace]
Scale Service
kubectl scale deployment/[service] -n [namespace] --replicas=X
Check Logs
kubectl logs -l app=[service] -n [namespace] --tail=100
Troubleshooting
| Symptom | Possible Cause | Resolution |
|---|
| High latency | DB connection pool | Scale DB or optimize queries |
| 5xx errors | Dependency down | Check upstream services |
| OOM kills | Memory leak | Investigate heap, restart |
Escalation
| Level | Contact | When |
|---|
| L1 | [team Slack channel] | First response |
| L2 | [on-call engineer] | Cannot resolve in 15m |
| L3 | [service owner] | Critical/extended outage |
---
## Phase 7: Handoff
### Handoff Checklist
- [ ] Service owner identified and trained
- [ ] On-call rotation set up
- [ ] Access provisioned to team
- [ ] Documentation reviewed by team
- [ ] Shadowing session completed
- [ ] Ownership officially transferred
### Handoff Template
```markdown
## Service Handoff Confirmation
**Service:** [name]
**Date:** YYYY-MM-DD
**Platform Team:** @[name]
**Service Owner:** @[name]
### Completed Items
- [x] Infrastructure provisioned and documented
- [x] Observability configured
- [x] Security controls applied
- [x] Runbook created and reviewed
- [x] On-call rotation configured
- [x] Training session completed
### Outstanding Items
| Item | Owner | Due Date |
|------|-------|----------|
| [item] | [owner] | YYYY-MM-DD |
### Acknowledgment
By signing below, the service owner confirms:
1. Receipt of all documentation
2. Understanding of operational procedures
3. Acceptance of on-call responsibilities
**Service Owner:** _________________ Date: _______
**Platform Team:** _________________ Date: _______
Anti-Rationalization Table
| Rationalization | Why It's WRONG | Required Action |
|---|
| "Skip documentation, code is self-explanatory" | On-call != developers | Complete runbook |
| "We'll add observability later" | Blind deployments = incidents | Observability on day 1 |
| "Golden path doesn't fit exactly" | Customizations add complexity | Justify every deviation |
| "Security can come later" | Later = never for security | Security from start |
| "Team can figure it out" | Assumptions cause outages | Complete handoff process |
Dispatch Specialist
For platform onboarding tasks, dispatch:
Task tool:
subagent_type: "platform-engineer"
model: "opus"
prompt: |
SERVICE ONBOARDING REQUEST
Service: [name]
Team: [team]
Type: [API/Worker/Batch]
Requirements: [summary]
Golden Path: [if known]