Help us improve
Share bugs, ideas, or general feedback.
From probabl-skills
Audits ML experiments by executing read-only bare-expression Python files with skore reports and streaming a markdown digest. Triggered after experiment iteration, on user request, or after re-runs.
npx claudepluginhub probabl-ai/skills --plugin probabl-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/probabl-skills:audit-ml-pipelineThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Per-experiment, human-readable, agent-executable narrative of a skore
Converts ML pipeline audit digest rows with severity `issue` or `tip` into actionable Backlog items by following each row's documentation URL. Meant to be triggered after an experiment has finished and the user wants to mine the diagnostic report for next steps.
Audits skills and memory files for docs-management delegation compliance in Claude Code projects. Detects hardcoded data like hook events and verifies delegation patterns with smart filtering.
Evaluates and improves GenAI agent output quality using MLflow's native APIs for datasets, scorers, and tracing. Covers end-to-end evaluation workflow or individual components.
Share bugs, ideas, or general feedback.
Per-experiment, human-readable, agent-executable narrative of a skore
report — produced by executing a bare-expression # %% file and
reading the digest. Read-only against the skore Project.
| Came here from… | After audit, next is… |
|---|---|
iterate-ml-experiment § 4 record-outcome | → Read audit digest, fill Status block + JOURNAL row + overview/summary.md |
| User free-text ("audit 02", "re-audit 04") | → Surface metrics to the user; no further dispatch |
| Re-run of an existing experiment | → Re-execute the existing audit file; surface diff if metrics changed |
The audit is dispatched FIRST in § 4, before any scratch probes.
The digest carries the checks summary and the metrics summary — it
replaces ad-hoc scratch/<ts>_inspect_*.py files for the metric
extraction step.
| Path | Durability | Who writes it | What it holds |
|---|---|---|---|
audit/<NN>_<short_name>.py | Durable (in git) | This skill, once per experiment | The bare-expression cells. Source of truth. Can be opened as a notebook in JupyterLab / VS Code for the rich HTML view |
scratch/audit/<stem>/audit.md | Ephemeral (gitignored), optional | run_audit.py when given a 2nd arg | Per-cell markdown digest: source + stdout + last-expression repr. Same content as stdout |
Stdout from run_audit.py | Captured by the bash tool | run_audit.py (always) | Streamed digest — the agent reads this directly from the tool output |
Mnemonic: audit/ is source (in git); scratch/audit/ and
stdout are output. Never put the source .py under
scratch/audit/. Never commit anything under scratch/audit/.
The central rule. Surfaced as the first Stop condition below.
Allowed in audit/<stem>.py:
skore.Project(...) — open the project this experiment wrote to.project.summarize() — list (key, id) pairs.project.get(id) — load a specific report by id.report.* accessor.<pkg> (read-only inspection).Forbidden in audit/<stem>.py:
skore.evaluate(...) — duplicates the report under the same key
and pollutes summarize().project.put(...) — same.scratch/audit/<stem>/ — no data/ writes, no
reports/ writes, no edits to src/<pkg>/. The audit is a viewer.report that survives the cell (e.g.
monkey-patching skore symbols).The runner renders every cell's source + last-expression repr +
stdout to the digest. A forbidden call surfaces in the digest (as a
put row in a later summarize() cell, or as a **error:**
section). The contract is visible, not invisible.
Sibling read-only consumers (different output shapes, same
discipline): scratch/<ts>_*.py probes, iterate-from-skore's
Backlog enrichment walk. See evaluate-ml-pipeline § Stop
conditions for the three-consumer rule.
skore.evaluate(...) or project.put(...) in an audit file.project.get(...) is by id, not key. For hub mode, read the
id from the URL printed by project.put():
https://…/<workspace>/<project>/<type-plural>/<N> → id is
skore:report:<type-singular>:<N> (URL segment is plural; id uses
the singular — drop the trailing s, e.g. cross-validations →
cross-validation, estimators → estimator). Hardcode
REPORT_ID in the audit file — no summarize() traversal needed.
For local mode, read the "id" column of project.summarize() for
the matching key row. A KeyError from get("<stem>") means the
lookup shape is wrong (get is by id), not that the report is
missing.skore / skrub /
sklearn symbol must come from python-api this turn. Cache
hits under scratch/api/skore/<version>/ count (Shape 0); inline
memory does not.ipython /
pyright aren't importable, do NOT fabricate audit outputs by
writing print() calls as a workaround. Do NOT type
pixi add ... / uv add ... yourself — install is owned by
python-env-manager § Agent feature. Request via
G-AGENT-FEATURE (binary: install / skip); resume only when
python-env-manager returns "ready".print(). The runner captures each
cell's last bare expression via result.result and renders its
repr. Wrapping in print(repr(...)) lands in stdout instead of
the output section; mixed and harder to scan. Use bare
expressions; statement-only cells (variable binding) are fine.audit_NN_<short_name>_v2.py. When an experiment is re-run, the
audit file is overwritten in place — same stem, same audit.scratch/audit/<stem>/, NOT into
audit/. Durable artifact is audit/<stem>.py; the rendered
digest is ephemeral.audit/ is read-only against workspace data. No writes to
data/, reports/, or outside scratch/audit/<stem>/.warnings.filterwarnings(...) unless the user explicitly asks
— the runner streams cell stderr into the digest and that's
signal. See python-code-style § Stop conditions.| Shortcut | Why it's wrong |
|---|---|
report = project.get(REPORT_ID); print(repr(report)) | Runner captures bare expressions via result.result, not stdout. print(repr(...)) mixes stdout and output sections. Use report on its own line |
Drop .frame() from report.checks.summarize() / report.metrics.summarize() | __repr__ of the Display objects is <…Display at 0x…>. .frame() returns a DataFrame whose repr carries the actual values |
project.get(KEY) raised KeyError → re-run evaluate + put "to refresh" | Lookup shape is wrong (get is by id, not key). Hub: read the id from the URL printed by put(). Local: read summary["id"] for the matching key row. Never re-run evaluate + put to recover |
Write pixi add --feature agent ipython pyright directly from this skill | Install commands owned by python-env-manager. This skill requests via G-AGENT-FEATURE; it does not install |
Dump the audit .py into scratch/audit/<stem>/ | .py is durable in git; scratch/ is gitignored. Source in audit/; digest in scratch/audit/<stem>/ |
| Register a Jupyter kernel "to be safe" | Current runner is in-process; no kernel. Registering creates an orphan kernelspec |
Add a fix-up cell that mutates data/ or reports/ | Audit files are read-only. State mutations belong in a scratch/<ts>_*.py probe or the experiment script |
Substitute <SKORE_PROJECT_INIT> in audit/<stem>.py without reading experiments/<stem>.py first | Audit must open the same Project. Always Read experiments/.py this turn and copy the literal Project init block byte-identical (modulo formatting) |
Hub mode: put skore.login(mode="hub") after skore.Project(...) | Project(...) constructor authenticates at init time; without prior login, fails. Order is fixed: login first, Project second |
| § 4 dispatched audit → write scratch probe first to "double-check metrics" | The audit IS the metric-extraction step in § 4. Scratch probes for metrics are the anti-pattern this dispatch replaces |
Pre-flight (audit-ml-pipeline):
- [ ] Experiment stem confirmed: <NN_short_name>
Evidence: journal/NN_<short_name>.md exists AND state ≥ done
| "n/a — user invoked re-audit on existing stem"
- [ ] Four-way pairing complete:
journal/NN_<short_name>.md — design note (state ≥ done)
experiments/NN_<short_name>.py — script
tests/smoke/test_NN_<short_name>.py — smoke test (passing)
audit/NN_<short_name>.py — about to be written / refreshed
Evidence: ls / Glob on each path
- [ ] Report present in skore Project under key=<NN_short_name>
Evidence: scratch/<ts>_check_report.py probe ran
project.summarize() this turn; row with
key == "<NN_short_name>" appears.
"Run finished, put() landed" is NOT sufficient.
- [ ] Agent feature available:
`pixi run -e agent ipython -c "print(0)"` exit 0
`pixi run -e agent pyright --version` exit 0
Evidence: tool output of each
| JOURNAL.md Status `agent feature: installed`
Missing → STOP, delegate to python-env-manager G-AGENT-FEATURE
- [ ] python-api consulted for skore symbols used:
Project, summarize, get, report.checks.summarize, report.metrics.summarize
Evidence: Read scratch/api/skore/<version>/<topic>.md (this turn)
| Write the same (this turn)
| "n/a — cache hit, file already on disk + Read this turn"
- [ ] Template copy + substitution decided:
<pkg> → package name from src/<pkg>/
<NN>_<short_name> → experiment stem
<SKORE_PROJECT_INIT> → literal block copied from experiments/<stem>.py
Evidence: Read experiments/<stem>.py this turn for the Project init block;
Read templates/audit.py this turn before Write audit/<stem>.py
- [ ] Read-only contract acknowledged: audit file contains
summarize / get / report.* only — no evaluate, no put
Evidence: explicit grep / Read confirmation of the drafted file
- [ ] Execution command shape confirmed:
pixi run -e agent python \
.agents/skills/audit-ml-pipeline/scripts/run_audit.py \
audit/<stem>.py [scratch/audit/<stem>/audit.md]
(Second arg is optional — the runner always streams to stdout.)
Evidence: command emitted in the response before running
- [ ] Pre-flight re-emitted with evidence before final message.
Evidence: this checklist appears in the end-of-turn summary.
The audit file is jupytext percent format (# %%). Filename:
audit/NN_<short_name>.py — stem matches the experiment exactly.
Template: templates/audit.py.
| Placeholder | Replaced with |
|---|---|
<pkg> | The importable package name (from src/<pkg>/) |
<NN>_<short_name> | The experiment stem (e.g. 02_target_transform) |
<SKORE_PROJECT_INIT> | The full Project init block (including any preceding skore.login(...) call for hub mode), copied byte-identical from experiments/<stem>.py |
<project-name> | The name= argument from experiments/<stem>.py (read it; don't invent) |
<hub-workspace> | Hub-mode only. From JOURNAL.md Status Workspace decisions skore hub workspace: row |
<SKORE_PROJECT_INIT> and <project-name> are the most error-prone
substitutions: the audit must open the same Project the experiment
wrote to. Always Read experiments/<stem>.py this turn to lift
the literal init block; never reconstruct from memory of the
skore mode: decision alone.
Brief outline; full anatomy with concrete examples →
references/cell_anatomy.md.
import skore, from <pkg> import ....project = skore.Project(...); then project on its own line.summary = project.summarize(); then summary.REPORT_ID from the URL printed by
project.put() (hub: "skore:report:<type-singular>:<N>" — URL
path segment is plural, id uses singular, e.g. cross-validations
→ cross-validation, estimators → estimator; local: read
summary["id"] for the matching key row), then
report = project.get(REPORT_ID); then report.report.checks.summarize().frame(). Each row
carries documentation_url — the actionable mitigation for an
issue / tip lives at that link.report.metrics.summarize().frame().That's the whole template. .frame() is load-bearing on cells 6
and 7 — without it the digest shows <…Display object at 0x…>.
Details: → references/cell_anatomy.md.
iterate-from-skore's canonical sourceThe rendered digest at scratch/audit/<stem>/audit.md is the
single source of truth that iterate-from-skore mines to
populate the JOURNAL Backlog. That skill reads the digest as text,
walks the checks + metrics sections, and follows each check's
documentation_url to draft Backlog rows. It does NOT re-open the
Project, does NOT call report.* accessors, and does NOT write
scratch/<ts>_*.py probes for metric extraction.
The contract is deliberately narrow: checks (with their doc URLs)
pixi run -e agent python \
.agents/skills/audit-ml-pipeline/scripts/run_audit.py \
audit/<stem>.py
The runner streams the digest to stdout — the agent reads it
directly from the bash tool's output. Pass a second arg
scratch/audit/<stem>/audit.md to also write to a file (parent
created if missing).
For non-pixi workspaces, swap the activation prefix per
python-env-manager § Agent feature.
What the runner does internally (parsing, IPython shell setup,
matplotlib backend fix, progress-bar suppression, displayhook
patch, pandas widening, error capture) → references/runner_internals.md.
put() under the same key)
→ re-execute the matching audit file. iterate-ml-experiment § 4
fires this on every record-outcome.scratch/audit/<stem>/ is overwritten on every execution. No
version history; the source .py + git history is the audit trail.Extends organize-ml-workspace's pairing rule from three artifacts
to four:
journal/NN_<short_name>.md — design note
experiments/NN_<short_name>.py — script
tests/smoke/test_NN_<short_name>.py — smoke test
audit/NN_<short_name>.py — audit ← this skill
Identical stems, 1:1. By the time the experiment shows done in
JOURNAL.md AND its summary is refreshed in
overview/summary.md, all four exist.
| Caller | When |
|---|---|
iterate-ml-experiment § 4 record-outcome | Automatic; dispatched FIRST (replaces scratch probes for metric extraction). Agent feature must be available |
iterate-ml-experiment § 0 (bootstrap) | After the first baseline run, dispatch here for audit/01_baseline.py |
| User free-text | "audit experiment 02", "show me what 03", "re-audit 04" — resolves directly |
| Callee | Why |
|---|---|
python-api | Every skore symbol (Project, project.summarize, project.get, report.checks.summarize, report.metrics.summarize, .frame()). Cache hits first |
python-env-manager § Agent feature | When ipython / pyright are missing — G-AGENT-FEATURE gate |
python-code-style | After writing / editing audit/<stem>.py — bundled ruff.toml carries audit/** per-file ignores |
Quick lookup; detailed recovery steps in references/failure_modes.md.
| Symptom | Cause | Fix |
|---|---|---|
project.get(key) raises KeyError / TypeError | Lookup by key, not id; local vs hub shape differs | → references/failure_modes.md § "project.get(key) raises" |
ModuleNotFoundError: No module named 'IPython' | Agent feature not installed | Delegate to python-env-manager; never pip install here |
Cell renders as <Display object at 0x…> | *.summarize() called without .frame() | Add .frame() |
AttributeError for a report.* accessor | Symbol from memory; skore version drift | → references/failure_modes.md § "AttributeError" |
RuntimeError: No report under key=... | put() landed in a different Project | → references/failure_modes.md § "wrong Project" |
| Report differs across runs with unchanged source | Non-deterministic step / different data slice | Not a bug here; surface to user |
Hub mode: skore.login() auth error | Token expired / first-time login | → references/failure_modes.md § "skore.login fails" |
Hub mode: TypeError: workspace kwarg | Hub form left local-mode kwarg | → references/failure_modes.md § "TypeError workspace" |
Hub mode: report missing in summarize() after put() | Wrong hub workspace OR no read access | → references/failure_modes.md § "report missing" |
evaluate-ml-pipeline).ipython / pyright (python-env-manager owns).pyrightconfig.json (python-env-manager owns).iterate-from-skore).journal/NN_*.md (iterate-ml-experiment).smoke-test-ml-pipeline).| Skill | Relationship |
|---|---|
iterate-ml-experiment | Caller. § 4 dispatches here FIRST; the digest feeds summary.md refresh |
iterate-from-skore | Downstream consumer of this skill's digest. audit-ml-pipeline opens the Project and renders the digest; iterate-from-skore parses the digest as text and drafts Backlog rows from each surfaced check. Never opens the Project itself |
evaluate-ml-pipeline | Producer side. skore.evaluate + project.put live only in experiments/NN_*.py |
organize-ml-workspace | Workspace layout; four-way stem pairing |
python-env-manager | Agent feature install (G-AGENT-FEATURE). This skill requests; that skill installs |
python-api | skore symbol lookups. Cache hits first |
python-code-style | ruff after writing/editing audit/<stem>.py |
data-science-python-stack | Catalogues ipython + pyright under the agent feature |
templates/audit.py — per-experiment audit file skeleton. Copy
scripts/run_audit.py — the in-process cell runner. Source of
truth for the execution contract; don't reimplement.references/cell_anatomy.md — concrete cell examples (right /
wrong shapes), full 7-cell sequence, why .frame() matters,
bare-expression rules.references/runner_internals.md — what run_audit.py does
internally: parsing, IPython shell + NoOpDisplayHook, matplotlib
Agg backend, progress-bar suppression, pandas widening, per-cell
capture, error rendering.references/failure_modes.md — detailed recovery for every
symptom in § Failure modes.