Help us improve
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Skills for behavioral evaluation of LLMs using Petri and Bloom
npx claudepluginhub k3nnethfrancis/machine-psychology-fieldkitA Claude Code plugin with skills for behavioral evaluation of LLMs using Petri and Bloom.
# Add the repo as a marketplace
claude plugin marketplace add https://github.com/k3nnethfrancis/machine-psychology-fieldkit
# Install the plugin
claude plugin install machine-psychology-fieldkit
# Clone the repo
git clone https://github.com/k3nnethfrancis/machine-psychology-fieldkit.git
# Run Claude Code with the plugin directory
claude --plugin-dir /path/to/machine-psychology-fieldkit
claude plugin list
You should see machine-psychology-fieldkit in the list.
# Clone both repos
git clone https://github.com/anthropics/petri.git
git clone https://github.com/anthropics/bloom.git
cd petri
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -e ".[dev]"
# Set API key
export ANTHROPIC_API_KEY="your-key-here"
cd bloom
# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -e .
# Set API key
export ANTHROPIC_API_KEY="your-key-here"
Run adversarial audits with Petri. The skill helps you:
Quick start:
cd petri
inspect eval src/petri/tasks/petri.py --model anthropic/claude-sonnet-4-20250514
Generate evaluation scenarios with Bloom. The skill helps you:
Quick start:
cd bloom
python -m bloom.run --config configs/your_config.yaml
Once installed, Claude Code automatically activates these skills when you're working on behavioral evaluation tasks. You can also invoke them directly by typing /petri-collaborator or /bloom-collaborator.
| Use Case | Tool |
|---|---|
| Broad audit across 36 dimensions | Petri |
| Test a specific behavior hypothesis | Bloom |
| Compare models on standard battery | Petri |
| Measure robustness across framings | Bloom |
MIT
Share bugs, ideas, or general feedback.
Based on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Skills for building LLM evaluations: pipeline audit, error analysis, synthetic data generation, LLM-as-Judge design, evaluator validation, RAG evaluation, and annotation interfaces.
Skills for adding DeepEval evaluations, tracing, datasets, Confident AI reports, and iterative improvement loops to AI applications.
Agent Skills for NeMo Evaluator SDK
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
Measure AI output quality, user satisfaction, task success, and design effectiveness.
Build evals, A/B test prompts, audit skills, and benchmark LLM outputs at production quality
Persistent memory across context compactions via session dumps, vault search (QMD), and auto-injection
Turn X bookmarks into ranked, analyzed research briefs via parallel deep-dive agents
Turn X bookmarks into ranked, analyzed research briefs
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claim