Help us improve
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
By RBraga01
AI product quality enforcement: 8 skills and 5 agents for LLM product teams
npx claudepluginhub rbraga01/a-team --plugin builder-aiUse before launching any LLM feature or when monthly API costs are growing unexpectedly. Requires token count measurement, call volume analysis, and cost projection at 10× scale. Blocks "it's cheap enough now" completions.
Use before shipping any LLM feature that touches users. Reviews prompt injection, hallucination risk, output misuse, agentic scope, and abuse vectors. Blocks "nobody will try that" completions.
Use when prompt cost is too high, latency is above threshold, or context window limits are being approached. Requires measurement before and after each reduction. Blocks "I shortened the prompt so it should be cheaper" completions.
Use before merging, deploying, or demo'ing any LLM feature. Requires documented eval results — pass rate, failure analysis, baseline comparison. Blocks "it looked good when I tested it" completions.
Use before merging any PR that adds an LLM API call. Every call must handle timeout, malformed output, low confidence, and refusal — with a defined, user-safe fallback for each. Blocks "add error handling later" completions.
Share bugs, ideas, or general feedback.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Comprehensive skill pack with 66 specialized skills for full-stack developers: 12 language experts (Python, TypeScript, Go, Rust, C++, Swift, Kotlin, C#, PHP, Java, SQL, JavaScript), 10 backend frameworks, 6 frontend/mobile, plus infrastructure, DevOps, security, and testing. Features progressive disclosure architecture for 50% faster loading.
Complete collection of battle-tested Claude Code configs from an Anthropic hackathon winner - agents, skills, hooks, and rules evolved over 10+ months of intensive daily use
Tools to maintain and improve CLAUDE.md files - audit quality, capture session learnings, and keep project memory current.
Develop, test, build, and deploy Godot 4.x games with Claude Code. Includes GdUnit4 testing, web/desktop exports, CI/CD pipelines, and deployment to Vercel/GitHub Pages/itch.io.
Create new skills, improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, or benchmark skill performance with variance analysis.
Universal multi-agent infrastructure: 25 specialist agents, 17 enforced workflow skills, and a lead orchestrator
AI growth quality enforcement: 6 skills and 3 agents for growth teams
AI UI design quality enforcement: 8 skills and 5 agents for UI design teams
AI product quality enforcement: 6 skills and 3 agents for product teams
Your AI assistant will skip the eval, change the prompt without versioning it, add no fallback, and ship without a safety review.
This pack makes that impossible.
Drop one folder into your project. Your AI coding assistant now enforces production standards for every LLM feature — not as suggestions it can ignore, but as gates it cannot pass without evidence.
Mac / Linux / WSL:
bash <(curl -fsSL https://raw.githubusercontent.com/RBraga01/builder-ai/master/install.sh)
Windows PowerShell:
irm https://raw.githubusercontent.com/RBraga01/builder-ai/master/install.ps1 | iex
Works on Claude Code, Codex CLI, Cursor, and OpenCode. Works alongside A Team, builder-design, builder-product, and builder-growth.
Every one of these has happened to a team that was confident before launch:
1. Shipped without an eval "I tested it and it looked good." The feature worked on the 8 examples you chose. It failed on 30% of real traffic. Nobody knew which prompt change caused it because there was no baseline.
2. Prompt changed, nobody noticed "Small tweak." A single instruction shifted pass rate from 89% to 72%. There was no previous version to compare against. The regression took three weeks to diagnose.
3. No fallback when the model misbehaved Timeout at peak load → blank response → support ticket at 3am. "We'll add error handling after launch." You added it at 3am.
4. Shipped without a safety review "Nobody will try that." Someone did, on day two. The injection vector was in the document upload — not the user message — and it had been there since the first commit.
builder-ai makes each of these a gate your AI assistant must pass before marking any LLM task complete.
| Skill | What It Blocks |
|---|---|
eval-before-ship | No LLM feature merges without a named eval suite, documented pass rate, failure analysis, and baseline |
prompt-versioning | No prompt goes to production without a version file in prompts/ and a CHANGELOG entry |
fallback-required | No LLM call ships without tested fallback paths for timeout, malformed output, low confidence, and refusal |
| Skill | What It Enforces |
|---|---|
rag-pipeline-design | Data audit + query audit before any pipeline decision — no "standard chunking" shortcuts |
model-benchmarking | Task-specific benchmarking across three tiers before committing to a model |
context-optimization | Measure → reduce by hierarchy → measure again — not guessing at token savings |
ai-cost-audit | Token count + call volume + cost at 10× scale before launch, not after the billing alert |
ai-safety-review | Four-category review with tested attack surfaces before any feature reaches users |
| Agent | Role | Model |
|---|---|---|
prompt-engineer | Writes, versions, and iterates prompts with eval criteria | Sonnet |
eval-designer | Designs evaluation suites and writes eval harnesses | Sonnet |
rag-architect | Designs and debugs retrieval pipelines | Opus |
model-selector | Benchmarks models and recommends the cost-optimal choice | Sonnet |
ai-safety-reviewer | Reviews for injection, hallucination, abuse, and agentic scope | Opus |
Each hard gate defines exactly what an agent must produce — not a checklist to tick, a formatted evidence block it must fill in with real numbers.
An agent reading eval-before-ship cannot say "task complete" without producing:
Eval complete.
Suite: evals/email-classifier/test-set.jsonl — 200 examples
Model: claude-sonnet-4-6, temperature: 0.0, seed: 42
Pass rate: 178/200 = 89% (threshold: ≥ 85% ✓)
Top failure mode: format violation (12 cases — emails > 2000 tokens)
Baseline: v1 = 82% → v2 = 89%, delta: +7pp ✓
Results stored: evals/email-classifier/results-2026-06-07.md
"It looks good" does not fill that template. That is the entire point.
Each skill also lists the Rationalization Red Flags — the exact things teams say when they want to skip the gate — and explains why each one is wrong. The agent has already read the rebuttals.