Orchestrates benchmark execution comparing PopKit vs baseline Claude Code
From popkit-opsnpx claudepluginhub jrc1883/popkit-ai --plugin popkit-opsThis skill uses the workspace's default tool permissions.
IMPLEMENTATION_STATUS.mdREADME.mddocs/AUTOMATION_GUIDE.mddocs/ORCHESTRATOR_ARCHITECTURE.mddocs/PROMPT_GUIDELINES.mddocs/RESPONSE_FILE_SCHEMA.mddocs/VIBE_CODED_BENCHMARK_RESULTS.mdscripts/benchmark_analyzer.pyscripts/benchmark_orchestrator.pyscripts/benchmark_runner.pyscripts/codebase_manager.pyscripts/report_generator.pytests/test_benchmark_analyzer.pytests/test_benchmark_orchestrator.pytests/test_benchmark_runner.pytests/test_benchmark_runner_recording_resolution.pytests/test_codebase_manager.pytests/test_integration_full_suite.pytests/test_integration_single_task.pytests/test_report_generator.pyEnforces baseline coding conventions for naming, readability, immutability, KISS/DRY/YAGNI, and code quality review in TypeScript/JavaScript. Use for new projects, refactoring, reviews, and onboarding.
Provides patterns for shared UI in Compose Multiplatform across Android, iOS, Desktop, and Web: state management with ViewModels/StateFlow, navigation, theming, and performance.
Analyzes unfamiliar codebases to generate structured onboarding guides with architecture maps, key entry points, conventions, and starter CLAUDE.md.
Automates quantitative measurement of PopKit's value by comparing AI-assisted development with PopKit enabled vs without PopKit (baseline Claude Code). Orchestrates trials in separate windows for side-by-side observation.
# Run benchmark with 3 trials
/popkit-ops:benchmark run jwt-authentication --trials 3
# Custom task
/popkit-ops:benchmark run custom-task --trials 5 --verbose
┌─────────────────────────────────────────────────────────┐
│ Current Claude Session (Orchestrator) │
│ │
│ 1. Load task definition and responses │
│ 2. For each trial: │
│ ├─ Spawn WITH PopKit window → New terminal │
│ ├─ Spawn BASELINE window → New terminal │
│ └─ Monitor via recording files (poll every 3s) │
│ 3. Collect all recordings when complete │
│ 4. Run statistical analysis │
│ 5. Generate and open HTML report │
└─────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ WITH PopKit │ │ BASELINE │
│ Terminal Window │ │ Terminal Window │
│ │ │ │
│ Claude Code │ │ Claude Code │
│ + PopKit enabled │ │ PopKit disabled │
│ │ │ │
│ Recording: JSON │ │ Recording: JSONL │
└──────────────────┘ └──────────────────┘
# User runs in current Claude session
/popkit-ops:benchmark run jwt-authentication --trials 3
# Orchestrator takes over:
🚀 PopKit Benchmark Suite
▶ Trial 1/3 WITH PopKit - Launching window...
[New terminal window opens → user sees Claude working with PopKit]
▶ Trial 1/3 BASELINE - Launching window...
[New terminal window opens → user sees vanilla Claude working]
⏳ Monitoring trials... (watch the windows work)
✓ Trial 1 WITH PopKit completed (45s)
✓ Trial 1 BASELINE completed (68s)
...
📊 Analyzing results...
📈 Generating HTML report...
🎉 Opening report in browser...
routine_measurement.pyrecording_analyzer.py tool usage breakdowntranscript_parser.py file edit detectionrecording_analyzer.py error summaryrecording_analyzer.py performance metricsfrom scipy import stats
t_statistic, p_value = stats.ttest_ind(with_popkit_values, baseline_values)
is_significant = p_value < 0.05 # p < 0.05 means statistically significant
# Calculate effect size
effect_size = cohens_d(with_popkit_values, baseline_values)
# Interpret:
# d < 0.2: small effect
# d >= 0.5: medium effect
# d >= 0.8: large effect
95% confidence intervals calculated for all metrics to show variance range.
Tasks are YAML files in packages/popkit-ops/tasks/<category>/<task-id>.yml:
id: jwt-authentication
category: feature-addition
description: Add JWT-based user authentication to Express API
codebase: demo-app-express
initial_state: git checkout baseline-v1.0
user_prompt: |
Implement JWT authentication with:
- POST /auth/login endpoint (username/password)
- JWT token generation with 1-hour expiry
- Protected middleware for authenticated routes
- Error handling for invalid credentials
verification:
- npm test
- npm run lint
- npx tsc --noEmit
expected_outcomes:
- "/auth/login endpoint exists"
- "Tests pass for authentication flow"
- "Protected routes return 401 without token"
Response files enable automation without user interaction (<task-id>-responses.json):
{
"responses": {
"Auth method": "JWT (jsonwebtoken library)",
"Token storage": "HTTP-only cookies (security best practice)",
"Token expiry": "1 hour (3600 seconds)",
"Error handling": "Standard HTTP status codes (401, 403, 500)"
},
"standardAutoApprove": ["install.*dependencies", "run.*tests", "commit.*changes"]
}
POPKIT_RECORD=true # Enable session recording
POPKIT_BENCHMARK_MODE=true # Enable benchmark automation
POPKIT_BENCHMARK_RESPONSES=<path-to-responses.json> # Response file
POPKIT_COMMAND=benchmark-<task-id> # Command name for recording
For a benchmark to be considered valid:
benchmark_orchestrator.py - Orchestrates parallel trials in separate windowsbenchmark_runner.py - Single trial execution (called by orchestrator)benchmark_analyzer.py - Statistical analysiscodebase_manager.py - Git worktree managementreport_generator.py - Markdown/HTML reports../../../shared-py/popkit_shared/utils/recording_analyzer.py - Metrics extraction../../../shared-py/popkit_shared/utils/routine_measurement.py - Token tracking../../../shared-py/popkit_shared/utils/benchmark_responses.py - Automation# Unit tests
python -m pytest packages/popkit-ops/skills/pop-benchmark-runner/tests/ -v
# Integration test
/popkit-ops:benchmark run simple-feature --trials 1