Plugin

builder-ai

AI product quality enforcement: 8 skills and 5 agents for LLM product teams

What's Inside

Skills8

Use before launching any LLM feature or when monthly API costs are growing unexpectedly. Requires token count measurement, call volume analysis, and cost projection at 10× scale. Blocks "it's cheap enough now" completions.

ai-safety-review

/ai-safety-review

Use before shipping any LLM feature that touches users. Reviews prompt injection, hallucination risk, output misuse, agentic scope, and abuse vectors. Blocks "nobody will try that" completions.

context-optimization

/context-optimization

Use when prompt cost is too high, latency is above threshold, or context window limits are being approached. Requires measurement before and after each reduction. Blocks "I shortened the prompt so it should be cheaper" completions.

eval-before-ship

/eval-before-ship

Use before merging, deploying, or demo'ing any LLM feature. Requires documented eval results — pass rate, failure analysis, baseline comparison. Blocks "it looked good when I tested it" completions.

fallback-required

/fallback-required

Use before merging any PR that adds an LLM API call. Every call must handle timeout, malformed output, low confidence, and refusal — with a defined, user-safe fallback for each. Blocks "add error handling later" completions.

Stats

Version1.0.0

ReleasedJun 8, 2026

LanguageShell

Stars2

Forks2

MaintenanceExcellent

LicenseMIT

Last CommitJul 14, 2026

AddedJun 11, 2026

Actions

View on GitHub View README Plugin Marketplace JSON Homepage

Available In

rbraga0111

builder-ai v1.0.0

Your AI assistant will skip the eval, change the prompt without versioning it, add no fallback, and ship without a safety review.

This pack makes that impossible.

Drop one folder into your project. Your AI coding assistant now enforces production standards for every LLM feature — not as suggestions it can ignore, but as gates it cannot pass without evidence.

Mac / Linux / WSL:

bash <(curl -fsSL https://raw.githubusercontent.com/RBraga01/builder-ai/master/install.sh)

Windows PowerShell:

irm https://raw.githubusercontent.com/RBraga01/builder-ai/master/install.ps1 | iex

Works on Claude Code, Codex CLI, Cursor, and OpenCode. Works alongside A Team, builder-design, builder-product, and builder-growth.

The Four Ways LLM Features Fail in Production

Every one of these has happened to a team that was confident before launch:

1. Shipped without an eval "I tested it and it looked good." The feature worked on the 8 examples you chose. It failed on 30% of real traffic. Nobody knew which prompt change caused it because there was no baseline.

2. Prompt changed, nobody noticed "Small tweak." A single instruction shifted pass rate from 89% to 72%. There was no previous version to compare against. The regression took three weeks to diagnose.

3. No fallback when the model misbehaved Timeout at peak load → blank response → support ticket at 3am. "We'll add error handling after launch." You added it at 3am.

4. Shipped without a safety review "Nobody will try that." Someone did, on day two. The injection vector was in the document upload — not the user message — and it had been there since the first commit.

builder-ai makes each of these a gate your AI assistant must pass before marking any LLM task complete.

What's in the Pack

Hard Gates — Cannot Be Skipped

Skill	What It Blocks
`eval-before-ship`	No LLM feature merges without a named eval suite, documented pass rate, failure analysis, and baseline
`prompt-versioning`	No prompt goes to production without a version file in `prompts/` and a CHANGELOG entry
`fallback-required`	No LLM call ships without tested fallback paths for timeout, malformed output, low confidence, and refusal

Workflow Skills — How to Do the Work Right

Skill	What It Enforces
`rag-pipeline-design`	Data audit + query audit before any pipeline decision — no "standard chunking" shortcuts
`model-benchmarking`	Task-specific benchmarking across three tiers before committing to a model
`context-optimization`	Measure → reduce by hierarchy → measure again — not guessing at token savings
`ai-cost-audit`	Token count + call volume + cost at 10× scale before launch, not after the billing alert
`ai-safety-review`	Four-category review with tested attack surfaces before any feature reaches users

Agents — Specialist Roles

Agent	Role	Model
`prompt-engineer`	Writes, versions, and iterates prompts with eval criteria	Sonnet
`eval-designer`	Designs evaluation suites and writes eval harnesses	Sonnet
`rag-architect`	Designs and debugs retrieval pipelines	Opus
`model-selector`	Benchmarks models and recommends the cost-optimal choice	Sonnet
`ai-safety-reviewer`	Reviews for injection, hallucination, abuse, and agentic scope	Opus

How Enforcement Works

Each hard gate defines exactly what an agent must produce — not a checklist to tick, a formatted evidence block it must fill in with real numbers.

An agent reading eval-before-ship cannot say "task complete" without producing:

Eval complete.
Suite: evals/email-classifier/test-set.jsonl — 200 examples
Model: claude-sonnet-4-6, temperature: 0.0, seed: 42
Pass rate: 178/200 = 89% (threshold: ≥ 85% ✓)
Top failure mode: format violation (12 cases — emails > 2000 tokens)
Baseline: v1 = 82% → v2 = 89%, delta: +7pp ✓
Results stored: evals/email-classifier/results-2026-06-07.md

"It looks good" does not fill that template. That is the entire point.

Each skill also lists the Rationalization Red Flags — the exact things teams say when they want to skip the gate — and explains why each one is wrong. The agent has already read the rebuttals.

builder-ai

What's Inside

builder-ai

Popularity

What's Inside

Confidence

README

builder-ai v1.0.0

The Four Ways LLM Features Fail in Production

What's in the Pack

Hard Gates — Cannot Be Skipped

Workflow Skills — How to Do the Work Right

Agents — Specialist Roles

How Enforcement Works

Similar Plugins

fullstack-dev-skills

everything-claude-code

ponytail

godot-skills

claude-md-management

nature-skills

More by RBraga01

a-team

builder-growth

builder-design

builder-product

builder-ai v1.0.0

The Four Ways LLM Features Fail in Production

What's in the Pack

Hard Gates — Cannot Be Skipped

Workflow Skills — How to Do the Work Right

Agents — Specialist Roles

How Enforcement Works

Popularity

Health & Quality

More by RBraga01

a-team

builder-growth

builder-design

builder-product

Similar Plugins

fullstack-dev-skills

everything-claude-code

ponytail

godot-skills

claude-md-management

nature-skills