Programming as Theory Building Skill
A Claude Code plugin and reusable coding-agent skill that turns code generation from prompt completion into theory-preserving engineering work.
Most coding-agent failures are not syntax failures. They are theory failures: the agent writes code that looks right, but does not understand the invariant the code protects, why the current boundary exists, where the change belongs, or what behavior proves the change is correct.
The skill is grounded in Peter Naur's paper "Programming as Theory Building" (1985). Naur's central claim is that the durable asset in programming is not only the program text, but the programmer's theory of how the program maps real-world affairs into behavior. This skill converts that idea into operational checks for coding agents: map the domain rule, explain the current shape, place the change beside the closest existing facility, and verify the behavior that matters.
The Problem
General coding agents often produce plausible files that satisfy the prompt surface while missing the program's governing invariant. For code generation, that shows up as:
- new helpers or modules that do not match the existing domain boundary,
- tests that prove the happy path but not the business rule,
- speculative abstractions added before the current problem needs them,
- readable code whose design story is hard to extend safely.
programming-as-theory-building narrows the agent's behavior around the question Naur's paper makes unavoidable: what theory of the program is being preserved or extended?
The Solution
The plugin packages one Claude Code skill and one project-level CLAUDE.md guideline file. The skill asks the agent to answer these checks before non-trivial code work:
| Principle | Addresses |
|---|
| Rebuild the theory | Context-free patches and wrong assumptions |
| Place by similarity | Misplaced helpers, duplicated domain concepts |
| Keep changes surgical | Drive-by rewrites and unrelated cleanup |
| Avoid speculative flexibility | Bloated abstractions and unused options |
| Verify the theory | Tests that pass without proving the domain rule |
That makes the agent inspect code paths, names, tests, docs, and runtime behavior before editing. It also discourages one-off abstractions and asks for verification tied to the domain behavior, not just syntax.
Benchmark summary
The benchmark compares commerce-backend code generation across three isolated arms:
skills_off: managed Claude Code skills disabled.
karpathy_only: only the compact comparison-guidelines skill enabled.
theory_only: only this Programming as Theory Building skill enabled.
Code generation used Claude Haiku through the Claude Code MODEL=haiku setting for every arm. Each generation ran in a fresh temporary workspace, and generated projects were reviewed by a separate Claude Opus review pass using benchmark-codegen-review-v1.
The copied benchmark now contains three prompt families:
basic-commerce: the original, looser FastAPI + SQLite inventory reservation/order orchestration prompt.
strict-production: a later, more explicit prompt that specifies endpoints, status codes, error bodies, expiration behavior, stock restoration, 401 auth behavior, and pagination semantics. This maps to benchmark/prompts/strict-commerce.md.
strict-commerce-no-mcp: the same strict prompt run after MCP usage was disabled in the harness, also using benchmark/prompts/strict-commerce.md. It is reported separately because the execution environment changed.
Because the prompt changed, the headline result is reported by prompt family rather than as one flattened average.
| Prompt family | Arm | n | Avg weighted | Functional | Executability | Test quality | Verdict summary |
|---|
basic-commerce | skills_off | 40 | 71.0 | 61.4 | 68.9 | 65.8 | 12 good, 27 mixed, 1 poor |
basic-commerce | karpathy_only | 40 | 73.9 | 63.8 | 71.0 | 70.5 | 19 good, 21 mixed |
basic-commerce | theory_only | 40 | 77.9 | 68.6 | 78.5 | 76.1 | 27 good, 13 mixed |
strict-production | skills_off | 19 | 80.9 | 76.6 | 74.2 | 80.3 | 4 excellent, 7 good, 8 mixed |
strict-production | karpathy_only | 19 | 82.5 | 77.5 | 80.5 | 83.2 | 5 excellent, 5 good, 9 mixed |
strict-production | theory_only | 20 | 83.4 | 81.8 | 77.8 | 83.8 | 4 excellent, 12 good, 4 mixed |
strict-commerce-no-mcp | skills_off | 10 | 78.5 | 64.3 | 73.9 | 88.0 | 2 excellent, 2 good, 6 mixed |
strict-commerce-no-mcp | karpathy_only | 9 | 84.6 | 82.8 | 83.7 | 82.9 | 3 excellent, 4 good, 2 mixed |
strict-commerce-no-mcp | theory_only | 10 | 88.5 | 89.5 | 91.2 | 88.9 | 4 excellent, 6 good |
Interpreting the result