Engineering effectiveness metrics: DORA Four Keys (Deployment Frequency, Lead Time, Change Failure Rate, MTTR), SPACE Framework (Satisfaction, Performance, Activity, Communication, Efficiency), Goodhart's Law pitfalls, Velocity vs. Outcomes, Developer Experience measurement.
From clarcnpx claudepluginhub marvinrichter/clarc --plugin clarcThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
"If you can't measure it, you can't improve it" — but measuring the wrong things destroys teams. This skill covers the metrics frameworks that correlate with actual engineering effectiveness, and how to avoid turning metrics into gaming incentives.
From Google's State of DevOps Research (2019+, DORA Institute), the four metrics that best predict software delivery performance and organizational outcomes.
What: How often does the team successfully deploy to production?
| Performance | Frequency |
|---|---|
| Elite | Multiple times per day |
| High | Once per day to once per week |
| Medium | Once per week to once per month |
| Low | Less than once per month |
Why it matters: High deployment frequency → smaller batches → lower risk → faster feedback.
How to measure:
# GitHub: count successful production deployments
gh api repos/:owner/:repo/deployments \
--jq '[.[] | select(.environment == "production")] | length'
# Or: count merges to main as a proxy
git log --after="30 days ago" --merges --oneline main | wc -l
Common improvement paths:
What: Time from first code commit to successful production deployment.
| Performance | Lead Time |
|---|---|
| Elite | < 1 hour |
| High | 1 day to 1 week |
| Medium | 1 week to 1 month |
| Low | 1 to 6 months |
How to measure:
# Approximate: time from PR creation to merge
gh pr list --state=merged --json createdAt,mergedAt \
--jq '[.[] | {duration: (((.mergedAt | fromdateiso8601) - (.createdAt | fromdateiso8601)) / 3600)}] |
map(.duration) | add / length'
# Better: time from first commit in branch to deployment
# (requires deployment tracking in your CI/CD tool)
Bottlenecks by phase:
| Phase | Common Bottleneck | Fix |
|---|---|---|
| Coding → PR | Large PRs | Break into smaller PRs |
| PR open → merge | Slow reviews | SLA for reviews (e.g., <24h), PR size limit |
| Merge → deploy | Long CI pipeline | Parallelize tests, optimize Docker builds |
| Deploy → stable | Slow rollout | Automated canary, faster health checks |
What: Percentage of deployments that result in a degraded service, requiring hotfix or rollback.
| Performance | Rate |
|---|---|
| Elite | 0–15% |
| High | 16–30% |
| Medium | 16–30% (same range as High; score depends on MTTR) |
| Low | 46–60% |
How to measure:
# Manual: incidents created within N hours of a deployment
# Automated: correlate deployment timestamps with PagerDuty/OpsGenie incident creation
# Simplified: rollback rate
git log --oneline --all | grep -i "revert\|rollback\|hotfix" | wc -l
# Divide by total deployments in period
What: How long to recover from a service degradation.
| Performance | MTTR |
|---|---|
| Elite | < 1 hour |
| High | < 1 day |
| Medium | 1 day to 1 week |
| Low | > 1 week |
How to measure: Incident duration from your on-call system (PagerDuty, OpsGenie, Grafana OnCall).
Improvement paths:
| Level | Deploy Freq | Lead Time | CFR | MTTR |
|---|---|---|---|---|
| Elite | Multiple/day | < 1h | 0–15% | < 1h |
| High | Daily/weekly | 1d–1wk | 16–30% | < 1d |
| Medium | Weekly/monthly | 1wk–1mo | 16–30% | 1d–1wk |
| Low | Monthly/less | 1–6mo | 46–60% | > 1wk |
A team's level = the lowest single-metric rating (weakest link).
From GitHub Research (2021). Five dimensions of developer productivity:
| Dimension | What It Measures | Example Metrics |
|---|---|---|
| Satisfaction | Wellbeing, engagement, retention | eNPS, survey scores, attrition rate |
| Performance | Outcomes achieved | Feature delivery, quality (defect rate), reliability |
| Activity | Work artifacts produced | PRs merged, commits, code reviews completed |
| Communication | Knowledge flow, collaboration | Cross-team PRs, documentation coverage, review turnaround |
| Efficiency | Flow state, low friction | Interruption rate, build time, onboarding time |
Critical SPACE insight: Never measure only Activity. A team can maximize commits while delivering zero business value.
Healthy signal: S + P improving while A stays constant = efficiency gain. Warning signal: A increasing while S declining = burnout, unsustainable pace.
Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure."
| Metric | How Teams Game It | Consequence |
|---|---|---|
| Deployment Frequency | Deploy config-only changes, trivial PRs | High frequency, no value delivery |
| Lead Time | Mark PRs as created late, skip code review | Fast on paper, poor quality |
| Change Failure Rate | Don't declare incidents, "it was a feature" | Hidden failures, no learning |
| MTTR | Close incidents prematurely, reopen later | Looks fast, actually slow |
What velocity is good for: Sprint planning (capacity estimation), not performance measurement.
What velocity is bad for:
Better alternatives to velocity for effectiveness:
From "DevEx: What Actually Drives Productivity" (Noda et al., 2023), three core factors:
Quick proxy survey questions (1–7 scale):
1. I can get into a flow state during my work (rarely 1 → often 7)
2. I feel confident that changes I make work correctly before deployment (1 → 7)
3. I understand how my work contributes to company goals (1 → 7)
4. Our development tools support my work effectively (1 → 7)
5. I feel energized by my work rather than drained (1 → 7)
Score < 4 on any item = action required.
DORA metrics are lagging — they tell you what already happened.
| Leading Indicator | Predicts |
|---|---|
| PR size (lines of code) | Lead time (large PRs → slower review) |
| CI duration | Lead time (slow CI → slow delivery) |
| PR review turnaround | Lead time |
| Test coverage | Change failure rate |
| Incident runbook coverage | MTTR |
| Onboarding time | Team efficiency long-term |
Track leading indicators weekly to catch problems before they show in DORA.
/dora-baseline — measure current DORA baseline for your team/devex-survey — design and run a developer experience survey/engineering-review — monthly engineering health review workflowdora-implementation skill — technical setup for extracting DORA data from GitHub/GitLab