Help us improve
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
Share bugs, ideas, or general feedback.
By probabl-ai
Run the full ML experimentation lifecycle in Python — propose experiments, declare pipelines with skrub, train and evaluate with skore, audit results, and mine diagnostics into actionable backlog items — all inside a structured journal-driven workspace with opinionated tooling for environment management, code style, and smoke testing.
npx claudepluginhub probabl-ai/skills --plugin probabl-skillsOwns the `audit/` folder: one `# %%` (jupytext percent) Python file per experiment, aligned 1:1 with `experiments/NN_<short_name>.py` and `journal/NN_<short_name>.md`, that loads the experiment's skore report **read-only** and uses bare-last-expression cells whose `__repr__` carries the audit's signal. The agent executes the audit file via the bundled in-process runner (`audit-ml-pipeline/scripts/run_audit.py` — IPython `InteractiveShell.run_cell`), which streams a markdown digest of each cell's stdout + last-expression repr to stdout (optionally also to a file). The digest fuels narrative work (the `overview/summary.md` refresh, follow-up questions about a past experiment, cross-experiment comparison). Stops at "audit/NN_*.py is placed, executed, and the digest is available." Never calls `skore.evaluate(...)` or `project.put(...)`. TRIGGER — any of: - `iterate-ml-experiment` § 4 record-outcome — audit is dispatched FIRST (replaces scratch probes for metric extraction). - The user asks "audit experiment 02", "show me what 03 looks like", "re-audit 04 against the new report". - An experiment was re-run (same `put()` key overwritten) and the matching audit file needs re-execution. - The user wants a human-readable narrative of a past experiment without firing the full `iterate-from-skore` flow. SKIP when: the design note isn't approved yet (route to `iterate-ml-experiment`); the experiment hasn't been run (no report on disk); the agent feature isn't installed (delegate to `python-env-manager` § "Agent feature"); the user is mining the report to source the *next* experiment (`iterate-from-skore`). HOW TO USE: confirm the four-way stem pairing exists (`journal/NN_*.md` approved + `experiments/NN_*.py` exists + smoke test passed + report under that key in the Project), then place `audit/NN_<short_name>.py` from `templates/audit.py`, substituting the package name + the literal Project init block copied from `experiments/<stem>.py`. Execute via the bundled runner: `pixi run -e agent python .agents/skills/audit-ml-pipeline/scripts/run_audit.py audit/<stem>.py`. **Read the Stop conditions and emit the Pre-flight checklist before any write or shell command.** Always invoke `python-api` for skore symbol signatures — never write them from memory.
Declare the pipeline from data source to predictor as a **skrub DataOps graph** (not as a bare `sklearn.Pipeline`). Every step is either a pure-Python function (stateless) attached via `.skb.apply_func`, or a sklearn-compatible estimator (stateful) attached via `.skb.apply`. Stops at the declared object — no fit, split, tuning, persistence, or evaluation. TRIGGER — any of: - Writing or editing code that declares any link in the chain *data source → predictor*: loaders, preprocessing, encoders / imputers / scalers, feature steps, composition objects (`Pipeline`, `ColumnTransformer`, skrub `tabular_pipeline`, `nn.Module`), or the final estimator. - A pure-Python data-processing function destined for the pipeline path (cleans / derives / reshapes) — whether wrapped via `FunctionTransformer`, `skrub.@deferred` / `skrub.var`, a custom `BaseEstimator` subclass, or just called in the training path before the estimator. - A step is added, removed, swapped, or reordered inside an existing pipeline declaration. - A bare `sklearn.Pipeline` / `make_pipeline` is being used as the top-level — fire to redirect into a skrub DataOps graph. - The user asks to build / declare / set up a pipeline / classifier / regressor for X. SKIP when: `.fit(...)` calls / training loops / `Trainer.fit` / epoch loops; train/test split or cross-validation splitting; hyperparameter search; persistence (`joblib.dump`, checkpointing); evaluation / metrics / scoring; inference over a pre-trained model; pure EDA; library-choice questions with no concrete declaration in play. HOW TO USE: consult before the first declarative line and on every structural edit (added/swapped step, changed input columns, changed estimator family). Don't re-consult for cosmetic edits. **First, read the Stop conditions and emit the Pre-flight checklist as visible text before any code.** Always invoke `python-api` to confirm skrub / sklearn symbol names and signatures before typing — don't guess from memory.
Opinionated Python stack for data-science / ML work — one library per job, organized into tiers (mandatory / user choice / optional / transitive). SKILL.md is the index; per-library `references/<library>.md` files carry scope, "pick this when" / "pick something else when", and pairings. TRIGGER when (any of these): (1) **a library import fails** in this stack's domain — the answer is install, not substitute (see § "Missing dependency"); (2) **a library choice has to be made** — explicitly (the user asks "which library for X?") or implicitly (code is about to introduce a new dependency, or the project is being scaffolded and the tabular library hasn't been picked yet); (3) starting a new Python data-science / ML project; (4) the user or current code reaches for a substitute outside the stack (xgboost, lightgbm, black, isort, flake8, poetry, hatch), or reaches for `mlflow` to log params/metrics, or for `cross_val_score` + handwritten reporting — redirect: tracking → `skore` Project API, evaluation / reporting → `skore` report classes, `mlflow` stays only for model serving / registry. SKIP when: the project is non-Python; the work is web / backend / infra unrelated to data science; the library is already chosen and installed and the task is implementation inside it (bug fix, feature work, refactor) with no new dependency in play. HOW TO USE: **read this SKILL.md end-to-end before recommending or installing anything** — picking from a single index entry hides the tier (whether the library is mandatory, a user-choice, optional, or already transitively present) and the pairings, and both matter. Then read the linked `references/<library>.md` for the chosen library's scope and tradeoffs. Don't silently substitute one library for another; if no entry fits, surface the gap to the user.
Methodology for evaluating a single sklearn-compatible learner (in particular, the `SkrubLearner` produced by `build-ml-pipeline`). Owns: which entry point to call (`skore.evaluate` first, the explicit report classes when needed), which cross-validator to pick from scikit-learn's catalogue, how to consume the structural metadata (`groups`, `times`, …) attached at build time via `.skb.mark_as_X(split_kwargs=...)`. Stops at "what does the report say". Defaults (metrics, plots) come from skore; only override on explicit user request. TRIGGER when: code calls `cross_val_score`, `cross_validate`, `classification_report`, or any handwritten metric print (`print(mean_squared_error(...))`); code calls `.skb.cross_validate(...)` (route through skore for richer output); user asks how to score, evaluate, or compare a single learner; user asks how to pick a cross-validator; user wants to see a report / metrics / diagnostic plots for a fitted learner. SKIP when: declaring the pipeline (use `build-ml-pipeline`); hyperparameter / model search (separate skill); fitting, persisting, or serving the final model; tracking or comparing experiments across multiple runs over time (separate skill). HOW TO USE: invoke before any evaluation call. **First, read the "Stop conditions" block at the top of the body and emit the Pre-flight checklist as visible text in your response — both are mandatory before any evaluation code is written.** The structural facts about the data (group keys, time ordering) should already be encoded at the X marker via `split_kwargs` — if they aren't and you can't tell from the data, return to `build-ml-pipeline` and ask the user. For symbol-level lookups, defer to `python-api` (skore symbols) and `python-api` (splitters); don't guess names from memory.
Source the next ML experiment proposal by **reading the audit digest** at `scratch/audit/<stem>/audit.md` (produced by `audit-ml-pipeline` at § 4 record-outcome). For every row in the digest's `## Checks summary` whose `severity` is `issue` or `tip`, follow the row's `documentation_url` to draft a Backlog row whose `Item` is the mitigation the docs recommend. The `## Metrics summary` provides context for the human summary paragraph but does not drive Backlog rows on its own. Returns the enriched Backlog rows + a one-paragraph summary back to `iterate-ml-experiment`, which writes the rows into `JOURNAL.md` and re-presents the sourcing menu so the user can promote a `B<N>` row. Stops at "Backlog enriched, summary returned"; never writes a per-experiment design note, never picks the "winning" finding — the user picks via `B<N>`. TRIGGER when: `iterate-ml-experiment` is picking a sourcing strategy and the user picks `skore` from the menu; the user says "mine the report", "what does skore see?", "fill the backlog from the diagnostic"; the previous experiment has finished and the user wants the report converted into actionable backlog items. SKIP when: the previous experiment hasn't run yet (no audit digest on disk); the user has a concrete modelling idea (use `iterate-from-user`); the task is the *mechanics* of running / opening a report — route to `evaluate-ml-pipeline`; the user wants a narrative read of one specific section of the report (route to `evaluate-ml-pipeline`). HOW TO USE: read the existing `scratch/audit/<stem>/audit.md` digest as text — do NOT re-open the skore Project, do NOT call `report.*` accessors. For each row in the `## Checks summary` section whose severity is `issue` or `tip`, follow the `documentation_url` (via WebFetch) and draft one Backlog row citing `audit:<stem>:checks.<code>`. Dedupe against rows already in `JOURNAL.md` Backlog by source citation. Return the candidate rows + a one-paragraph human summary. The parent skill writes the rows to `JOURNAL.md` and re-shows the sourcing menu.
Share bugs, ideas, or general feedback.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge.
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Set up ML experiment tracking
ML/perf investigation skills: topic, plan, judge, run, sweep
DataRobot skills for AI/ML workflows — model training, deployment, predictions, feature engineering, monitoring, explainability, data preparation, App Framework CI/CD, and external agent monitoring.
ML engineering plugin: Give your AI coding agent ML engineering superpowers.
Automate ML workflows with Airflow, Kubeflow, MLflow. Use for reproducible pipelines, retraining schedules, MLOps, or encountering task failures, dependency errors, experiment tracking issues.
ML experiment tracking with metrics logging and run comparison
A collection of skills for ML experimentation in Python, organized around skrub, scikit-learn, and skore and more broadly to the PyData ecosystem.
One command, 55+ agents — including Claude Code, Codex, Cursor, OpenCode, Gemini CLI, and Mistral Vibe:
npx skills add probabl-ai/skills
That's the skills CLI from Vercel
Labs. It auto-detects which coding agents you have installed and drops the
skills into each one's skills directory — no per-agent configuration.
Install the full bundle. The skills cross-reference each other (the
iteration-loop skill dispatches to its sourcing strategies, the test router
dispatches to the smoke-test skill, several skills point to python-api for
symbol lookups), and the Agent Skills spec doesn't yet carry a requires
field, so the CLI can't auto-resolve those references — a partial install
will leave dangling pointers.
Useful flags once you have it:
npx skills add probabl-ai/skills --list # preview the catalog
npx skills add probabl-ai/skills -g # global install (~/<agent>/skills/)
npx skills add probabl-ai/skills -a claude-code -a codex # target specific agents
npx skills update # pull the latest
See the full agent list and command reference in the
skills CLI docs.
If you only use Claude Code and prefer the native plugin flow, this repo is also a Claude Code plugin marketplace:
/plugin marketplace add probabl-ai/skills
/plugin install probabl-skills@probabl-skills
/plugin update pulls new releases.
| Skill | Description |
|---|---|
| build-ml-pipeline | Declare the pipeline from data source to predictor as a skrub DataOps graph. Stops at the declared object — no fit, split, tuning, or persistence. |
| evaluate-ml-pipeline | Evaluate a single sklearn-compatible learner: pick the right entry point (skore.evaluate first), the right cross-validator, and consume report metadata. |
| test-ml-pipeline | Router that owns the tests/ folder of an ML workspace and the experiment ↔ test pairing rule. Dispatches to a per-category subskill. |
| smoke-test-ml-pipeline | Diagnostic-by-construction pytest that catches the "load → featurize → split" anti-pattern by predicting on a disjoint, no-buffer slice of the real data source. |
| audit-ml-pipeline | Owns the audit/ folder: one # %% file per experiment that loads its skore report read-only and uses bare-last-expression cells. The agent executes via an in-process IPython runner (scripts/run_audit.py) that streams a markdown digest. Read-only — never calls evaluate or put. |
| Skill | Description |
|---|---|
| iterate-ml-experiment | Drives the iteration loop on top of an ML workspace — owns journal/JOURNAL.md and per-experiment design notes, and dispatches to a sourcing strategy below. |
| iterate-from-skore | Source the next experiment by reading the audit digest at scratch/audit/<stem>/audit.md — every issue / tip row drives a Backlog item, following the row's documentation_url for the mitigation. |
| iterate-from-user | Source the next experiment from the user directly — free-text, a scientific article URL, or a resource link (GitHub issue / spec / reference repo). |
| Skill | Description |
|---|---|
| organize-ml-workspace | Decide where files live: reusable code, per-experiment scripts (jupytext-style # %%), reports. One file per experiment. |
| python-code-style | Place the project's ruff.toml template and run ruff (lint + format) on touched files. numpydoc for docstrings. |
| python-env-manager | Detect the project's env manager (pixi / uv / poetry / hatch / conda / pip+venv) and issue the right install command. Defaults to pixi when bootstrapping. |
| data-science-python-stack | Opinionated one-library-per-job Python stack, organized into mandatory / user-choice / optional / transitive tiers. |