By breethomas
Build custom LLM-as-judge evaluators for failure modes like tone or faithfulness, evaluate RAG pipelines using Recall@k and faithfulness metrics, generate synthetic test data via dimension-based tuples for LLM pipelines, and analyze traces to judge passes, categorize failures, and prioritize fixes in AI features.
npx claudepluginhub breethomas/bette-thinkBuild an LLM-as-Judge evaluator for one specific failure mode. Binary pass/fail only. Use when a failure mode requires interpretation (tone, faithfulness, relevance, completeness) and cannot be checked with code. Do NOT use when the failure can be checked with regex, schema validation, or execution tests. Do NOT use before completing error analysis (/upgrade-evals).
Evaluate RAG pipeline retrieval and generation quality separately. Measure Recall@k, Precision@k, MRR, NDCG@k for retrieval. Assess faithfulness and relevance for generation. Use when the AI feature uses retrieval (search, knowledge base, document QA). Do NOT use for non-RAG AI features.
Create diverse synthetic test inputs using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces (use stratified sampling instead).
Systematic error analysis on real AI traces. Read traces, judge pass/fail, let failure categories emerge from data, compute failure rates, decide what to fix. Use when you have 50+ test cases or are seeing production failures. Do NOT use when you have fewer than 20 test cases (use /start-evals first).
"Attempt the impossible in order to improve your work." — Bette Davis
PM frameworks and strategic sparring for Claude Code.
This repo is now part of Bette. Install the unified plugin to get all 57 skills including everything in this repo.
/plugin marketplace add breethomas/bette
/plugin install bette@breethomas
30 skills and frameworks from Marty Cagan, Teresa Torres, Elena Verna, Brian Balfour, Ryan Singer, Hamel Husain, and more. Your sparring partner, not your assistant.
Top skills: strategy-session, spec, shape-up, four-risks, agency-ladder, start-evals, competitive-research, calibrate, now-next-later, growth-loops
7 agents for autonomous research and analysis.
Browse: skills/ · frameworks/ · thought-leaders/
/plugin uninstall pm-thought-partner@breethomas
/plugin marketplace add breethomas/bette
/plugin install bette@breethomas
All your skills are still there, plus 27 more.
MIT
Part of the Bette system. Fasten your seatbelts.
Strategic thinking partner for product decisions. Works through problems conversationally, challenges assumptions, helps you ship faster. Grounded in frameworks from Marty Cagan, Teresa Torres, Elena Verna, Brian Balfour, Chip Huyen, Ryan Singer, Hamel Husain, and more. Complete eval chain from first 20 test cases through error analysis, LLM judges, and RAG evaluation. Plus backlog automation with Linear/GitHub integration.
Share bugs, ideas, or general feedback.
Advanced PM skills: AI Product Canvas, Multi-Source Signal Synthesiser, Experiment Designer, Design Handoff Brief. For senior PMs working on complex or AI-powered products.
18 production-ready Claude Code skills for Product Managers. Discovery, build, measure, communicate.
16 product management skills for PMs and founders: user interviews, PRD writing, scope cutting, feature prioritization, positioning, strategy, metrics, and more
A deterministic thinking partner that challenges assumptions and applies 150+ mental models to sharpen decisions, solve problems, and think more clearly. Features orientation detection, cognitive operations framework, and structured diagnostic workflows.
Adversarial thinking partner for founders and executives. Stress-tests plans, prepares for board meetings, navigates hard decisions, and forces honest post-mortems.