Builds and runs evaluators for AI/LLM apps using Phoenix in Python or TypeScript. Covers code/LLM judges, batch eval, experiments, datasets, validation, and production.
From awesome-copilotnpx claudepluginhub ctr26/dotfiles --plugin awesome-copilotThis skill uses the workspace's default tool permissions.
references/axial-coding.mdreferences/common-mistakes-python.mdreferences/error-analysis-multi-turn.mdreferences/error-analysis.mdreferences/evaluate-dataframe-python.mdreferences/evaluators-code-python.mdreferences/evaluators-code-typescript.mdreferences/evaluators-custom-templates.mdreferences/evaluators-llm-python.mdreferences/evaluators-llm-typescript.mdreferences/evaluators-overview.mdreferences/evaluators-pre-built.mdreferences/evaluators-rag.mdreferences/experiments-datasets-python.mdreferences/experiments-datasets-typescript.mdreferences/experiments-overview.mdreferences/experiments-running-python.mdreferences/experiments-running-typescript.mdreferences/experiments-synthetic-python.mdreferences/experiments-synthetic-typescript.mdFetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
Uses ctx7 CLI to fetch current library docs, manage AI coding skills (install/search/generate), and configure Context7 MCP for AI editors.
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
Starting Fresh: observe-tracing-setup → error-analysis → axial-coding → evaluators-overview
Building Evaluator: fundamentals → common-mistakes-python → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}
RAG Systems: evaluators-rag → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)
Production: production-overview → production-guardrails → production-continuous
| Prefix | Description |
|---|---|
fundamentals-* | Types, scores, anti-patterns |
observe-* | Tracing, sampling |
error-analysis-* | Finding failures |
axial-coding-* | Categorizing failures |
evaluators-* | Code, LLM, RAG evaluators |
experiments-* | Datasets, running experiments |
validation-* | Validating evaluator accuracy against human labels |
production-* | CI/CD, monitoring |
| Principle | Action |
|---|---|
| Error analysis first | Can't automate what you haven't observed |
| Custom > generic | Build from your failures |
| Code first | Deterministic before LLM |
| Validate judges | >80% TPR/TNR |
| Binary > Likert | Pass/fail, not 1-5 |