Help us improve
Share bugs, ideas, or general feedback.
From probabl-skills
Declares ML pipelines as skrub DataOps graphs instead of bare sklearn Pipelines, using .skb.apply_func for stateless steps and .skb.apply for stateful estimators. Stops at declaration — no fit, split, tuning, or evaluation.
npx claudepluginhub probabl-ai/skills --plugin probabl-skillsHow this skill is triggered — by the user, by Claude, or both
Slash command
/probabl-skills:build-ml-pipelineThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Declarative shape of a Python ML pipeline from data source to
Evaluates a single sklearn-compatible learner using skore: selects entry points, cross-validators, consumes structural metadata, and reads reports.
Guides end-to-end MLOps pipeline orchestration from data preparation through model training, validation, deployment, and monitoring using DAG patterns like Airflow.
Turns model work into production ML systems with data contracts, repeatable training, quality gates, deployable artifacts, and monitoring. Useful for ranking, search, recommendations, classifiers, forecasting, embeddings, LLMs, anomaly detection, and batch analytics.
Share bugs, ideas, or general feedback.
Declarative shape of a Python ML pipeline from data source to predictor.
Read these once; they're referenced throughout.
.skb.mark_as_X() call that anchors the
predict-time slice. Everything upstream runs identically at
fit and predict; everything downstream is per-prediction work.(group, time) set.learner.predict({"data_dir": …})).| You came here for… | → next |
|---|---|
| Declared pipeline → CV strategy | → evaluate-ml-pipeline § G-CV-SPLITTER |
| Declared pipeline → smoke test | → test-ml-pipeline → smoke-test-ml-pipeline |
| Symbol lookup mid-declaration | → python-api (Shape 1 / 1b / 3) |
| Missing skrub/sklearn import | → python-env-manager § install |
Modified pipeline.py / features.py / data.py | → python-code-style (ruff + NumPyDoc) |
Always re-emit the Pre-flight checklist with evidence before declaring the turn done.
The 90% case. Copy + adapt; replace TARGET_COL and the regressor.
import skrub
from sklearn.ensemble import HistGradientBoostingRegressor
from <pkg>.data import TARGET_COL, load_raw
def build_learner(data_dir_preview=None):
"""Return the unfit learner (skrub SkrubLearner)."""
data_dir = (
skrub.var("data_dir", value=str(data_dir_preview))
if data_dir_preview is not None
else skrub.var("data_dir")
)
# Layer 1 + 2: load + mark X / y on the source frame.
# No cross-row feature steps → marker sits here.
data = data_dir.skb.apply_func(load_raw)
X = data.drop(columns=[TARGET_COL]).skb.mark_as_X()
y = data[TARGET_COL].skb.mark_as_y()
# Layer 3: estimator at the tail. Feature engineering (if any)
# chains between mark_as_X and the final .skb.apply.
predictions = X.skb.apply(
HistGradientBoostingRegressor(random_state=0), y=y
)
return predictions.skb.make_learner()
For history-dependent / panel / cold-start cases (≠ IID):
→ references/layer_examples.md § history-dependent.
For loader-baked-shift counter-example (what NOT to do):
→ references/layer_examples.md § counter-example.
Each Stop condition: rule → symptom → recovery. Scan top to bottom; any match means STOP.
import skrub raising means python-env-manager is
next, not a substitute library.ModuleNotFoundError: No module named 'skrub'.python-env-manager for the install
command. Do NOT substitute with sklearn.Pipeline /
make_pipeline / FunctionTransformer — that silently rewrites
this skill out of the project.python-api lookup this turn.tabular_learner (renamed in 0.7+),
mark_as_y(col) (signature dropped the positional in 0.9+), or
any name "you remember".python-api. Recognition is not a lookup;
names drift between releases.KFold / StratifiedKFold / train_test_split /
any splitter import in pipeline code.from sklearn.model_selection import KFold in pipeline.py.evaluate-ml-pipeline's territory. This
skill only wires split_kwargs AT the X marker (see Rule 2).skrub.X(...) / skrub.y(...) are not acceptable graph rootsskrub.var("<source>", value=preview) instead.skrub.X(df) / skrub.y(s).skrub.var("data_dir", value=...) →
.skb.apply_func(load_fn) → .skb.mark_as_X(). The shortcuts
(1) bake the marker at the source — defeating Layer 1; (2)
force a pre-loaded binding, breaking predict-time replay;
(3) silently re-enable the late-mark_as_X bug for cross-row
features.mark_as_X is forbidden when any feature step is cross-rowdrop_nulls on shifted col), the
X marker goes UPSTREAM of that step. The step references the
cross-row source as an additional apply_func argument
(Layer 1 source → Layer 3 feature, via the marker bypass).len(predictions) != n_predict_grid_rows; OR a feature_steps=[] toggle appears
in build_learner "to make predict work for cold-start"; OR
a temp-dir gymnastic at predict time to fake history; OR a
wrapper estimator whose only job is to filter NaN rows the
pipeline itself produced. (Don't be misled by syntax —
pl.col("x").shift(k) IS cross-row.)feature_steps=[].smoke-test-ml-pipeline) — pipeline
with marker in the right place passes by construction.target.shift(-HORIZON),
a drop_nulls("y"), or any task-specific filter.scratch/scratch/<YYYY-MM-DD>_<HHMMSS>_<short>.py and runs via
pixi run python scratch/<ts>_<short>.py.pixi run python -c
or python -c.python-api § Stop
conditions). No 2-line carve-out.warnings.filterwarnings(...) in pipeline.py or
scratch probes unless the user explicitly asks. See
python-code-style § Stop conditions.| Shortcut | Why it's wrong |
|---|---|
tabular_learner from memory | Renamed to tabular_pipeline in skrub 0.7+. Memory typed → ImportError on modern installs |
mark_as_y(target_column) positional arg | Dropped in 0.9+. Use .skb.select("...") BEFORE the mark |
skrub.X(df) / skrub.y(s) as roots | Forbidden (S4). Use skrub.var("<source>", value=...) |
value="data/train.parquet" literal in pipeline.py | Resolves against CWD; breaks runs from non-root dirs. Expose data_dir_preview as kwarg; caller passes PROJECT_ROOT / "data" |
feature_steps=[] toggle "to make predict work" | S5 symptom. Fix the graph, not the predict-time bypass |
skore.evaluate(learner, X, y, ...) | SkrubLearner takes an env-dict. Use data={"data_dir": ..., ...} |
bare sklearn.Pipeline as top-level | Rewrite as skrub DataOps graph (Rule 1) |
Inline pixi run python -c "..." | S7. Write to scratch/<ts>_*.py instead |
Each ticked box requires an actual tool call this turn. Empty Evidence = unchecked.
Pre-flight (build-ml-pipeline):
- [ ] Tier 1 mandatory libs importable: sklearn, skrub, skore
Evidence: scratch/<ts>_check_tier1.py + `pixi run python …` output.
**Inline `python -c` is NOT evidence.**
- [ ] Tabular library identified: pandas | polars
Evidence: JOURNAL.md Status (Workspace decisions) | user quote
| "n/a — pandas already in loader signature"
- [ ] python-api consulted for skrub symbols this turn
Evidence: Read scratch/api/skrub/<v>/<topic>.md (this turn)
| "n/a — no new skrub symbol this turn"
- [ ] python-api consulted for sklearn symbols this turn
Evidence: Read scratch/api/sklearn/<v>/<topic>.md (this turn)
| "n/a — no new sklearn symbol this turn"
- [ ] Source-binding pattern chosen
Evidence: list each planned `skrub.var("<name>")` and state
whether it's a source identifier (e.g. `data_dir`)
or a predict-grid descriptor. IID: one `skrub.var`
rooted on the loaded frame is enough.
- [ ] X-marker placement decided
Evidence: name the DataOp node where `.skb.mark_as_X()` lands.
IID: on the loaded source frame. Panel / cold-start:
on the predict-grid node, BEFORE any history-dep step.
- [ ] (Cross-row pipelines only) Each cross-row step references the
upstream history DataOp as an extra `apply_func` arg
Evidence: name each step + its history-DataOp argument
| "n/a — no cross-row steps"
- [ ] Layer 1 audit — every `apply_func` upstream of `mark_as_X`
passes the constructive test (S6)
Evidence: per-step "external consumer would derive this: yes/no"
- [ ] Preview value handling
Evidence: `build_learner` exposes `data_dir_preview=None` kwarg;
no relative-path literal baked into `pipeline.py`
- [ ] split_kwargs at the X marker decided: groups | time | none
Evidence: name the column(s) wired OR "n/a — i.i.d., no group
or time structure"
- [ ] Smoke test wired (`tests/smoke/test_NN_<short_name>.py`)
Evidence: per `smoke-test-ml-pipeline`; trivial assertions if no
history-dep
- [ ] Pre-flight re-emitted with evidence before final message.
Evidence: this checklist appears in the end-of-turn summary.
Declare the pipeline as a skrub DataOps graph rooted at one or
more skrub.var(...) calls — not as a bare
sklearn.Pipeline. The skrub.X(...) / skrub.y(...) shortcuts
are not acceptable roots (see S4). Look up the underlying
signatures via python-api.
Reference: https://skrub-data.org/stable/data_ops.html
→ next: Rule 2 (where the marker goes).
The marker is the shared-vs-predict-specific boundary.
One question to place the marker: does any feature step look at rows other than the one currently being processed?
| Answer | Placement | Pattern |
|---|---|---|
| No (per-row math, stateful encoders that learn at fit and apply per-row) | Marker on the loaded source frame | Canonical IID example above |
| Yes (lag / rolling / cross-row join / target-shift) | Marker UPSTREAM of every cross-row step | Three-layer model below |
The three logical layers:
Layer 1 — Sources. One skrub.var(...) per input identifier:
raw history file(s) / URL(s) / table name(s), side tables, and —
for time-series / cold-start panels — the predict-time-grid
description (start/end range, list of (group_id, time)).
The loader for each source is its first .skb.apply_func.
Loaders are pure functions of a single source identifier.
Do not load + featurize in one apply_func — that fuses
Layers 2 + 3 with the loader and breaks predict-time replay.
Layer 2 — Predict-time grid + X marker. A DataOp whose rows are exactly the predict grid.
(group, time) grid derived from
Layer 1's predict-time bounds.mark_as_X and mark_as_y go here. Target derivation that
requires history (and drop_nulls on y) belongs to a small
stateful BaseEstimator with fit_transform → {X, y} /
transform → {X, y=None}, attached at this layer.
Layer 3 — Feature engineering. apply_func chained on the
X-branch after mark_as_X. History-dependent steps take the
X DataOp as their first argument and the relevant Layer-1
source DataOp(s) as additional arguments — history is
referenced, not bound to X. The same history node materializes
the full available history at fit and at predict, so a backward
lag computed for a row in the predict grid sees real values from
the train history — no cold-start NaN.
Worked examples (full code, IID + history-dependent +
counter-example): → references/layer_examples.md. Also see
python-api/references/pre_mark_alignment.md for the
production-style three-layer walkthrough drawn from this
workspace's 01_baseline.
Preview value is a caller-supplied parameter, not a literal in
pipeline.py. value= controls what learner.skb.preview()
sees during interactive iteration — nothing else. A literal like
value="data/train.parquet" resolves against CWD and silently
breaks runs not started from the project root. Expose the preview
as an optional kwarg on build_learner and leave it None for
production fit / cross-validate.
Downstream evaluation contract. A SkrubLearner does NOT
implement sklearn's fit(X, y) signature — it takes an
environment dict. Pair with
skore.evaluate(learner, data={"data_dir": ..., ...}, splitter=...),
never with skore.evaluate(learner, X, y, ...) (raises). See
evaluate-ml-pipeline; confirm signatures via python-api.
Cross-validation metadata at the X marker. If the data has
group structure (subjects, sessions, customer IDs, repeated
measures) or temporal ordering, attach the relevant column at
.skb.mark_as_X(split_kwargs={...}):
X = data.drop(columns=[...]).skb.mark_as_X(
split_kwargs={"groups": data["customer_id"]},
)
Keys map to the cross-validator's split(X, y, **split_kwargs)
(e.g. groups). Ask the user when you can't tell from data
alone whether such structure exists — name suspect columns
(anything ending in _id, columns called subject / session /
region, any date / timestamp for temporal ordering) and
ask whether to wire them. Don't silently leave split_kwargs
empty when group structure is plausible — that produces optimistic
CV downstream. Choosing the splitter itself is
evaluate-ml-pipeline's job; this skill only wires the metadata.
When editing an existing pipeline that uses skrub.X /
skrub.y or binds materialized data: do not auto-rewrite.
Surface the source-bound alternative and ask whether to refactor.
Full catalogue: → references/source-binding.md.
→ next: Rule 3 (attach mechanism).
.skbTwo attach points:
.skb.apply_func(fn) — wraps a callable that transforms data..skb.apply(estimator) — wraps any sklearn-compatible estimator
(transformer in the middle, or the final predictor).When to use skrub.deferred instead of apply_func: rare —
only when the callable must combine multiple DataOps at once
(e.g. a custom join over two tables). Even then, check whether a
skrub joiner (Joiner / AggJoiner / MultiAggJoiner) covers it
first. Default: .skb.apply_func. Details:
→ references/source-binding.md.
→ next: Rule 4 (function vs estimator).
The only decision rule for picking apply_func vs apply:
# Stateless — pure function + apply_func
import numpy as np
X = X.skb.apply_func(lambda df: df.assign(log_price=np.log1p(df["price"])))
# Stateful — estimator + apply
from sklearn.preprocessing import StandardScaler
X = X.skb.apply(StandardScaler())
If a step would silently learn from the test set when called as a plain function, it is stateful — promote it.
→ next: Rule 5 (leakage check).
Any computation using statistics learned from the data (means, medians, quantiles, vocabularies, target distribution) MUST be stateful. Calling such a computation as a plain function over the whole frame leaks test into training.
# WRONG — pct rank fits on the full frame, leaks test into training
X = X.skb.apply_func(lambda df: df.assign(p=df["x"].rank(pct=True)))
# RIGHT — quantile transformer learns on training fold only
from sklearn.preprocessing import QuantileTransformer
X = X.skb.apply(QuantileTransformer(output_distribution="uniform"))
Classic traps by name:
fit on training y only),KBinsDiscretizer(strategy="quantile"),OrdinalEncoder / LabelEncoder whose categories come from
the full dataset rather than fit on training only,Litmus test: would this output change if I called it on the
training subset alone vs the whole frame? If yes → stateful →
.skb.apply with an estimator, never .skb.apply_func.
→ next: Decision flow.
.skb.apply_func..skb.apply.→ next: Reproducibility (when touching shared modules).
iterate-ml-experiment enforces a hard rule: every done row in
JOURNAL.md History must stay runnable on main and produce the
same result. When touching a shared module under src/<pkg>/,
default behavior must preserve prior experiments' shape.
Three options, picked by judgment (full procedures + worked
examples: → references/reproducibility_mechanics.md):
iterate-ml-experiment § 3's smoke-test gate runs all of
tests/smoke/, not just the new one. A prior smoke test going
red after a change = default behavior not preserved. Fix before
declaring the new experiment ready.
→ next: Common patterns (for recurring shapes).
Short catalogue. Look up exact symbols in python-api. Full
catalogue with code: → references/common_patterns.md.
cols=
on .skb.apply (one apply per group), not ColumnTransformer.skrub.tabular_pipeline(...) or TableVectorizer + estimator
first; specialize column-by-column only when default is
insufficient.skrub.var(...) per table; join
with skrub Joiner / AggJoiner / MultiAggJoiner via
.skb.apply(...).StackingClassifier,
CalibratedClassifierCV, TransformedTargetRegressor. Wrap
the predictor first, then attach via .skb.apply as the final
step.skrub.choose_from / choose_int / choose_float /
optional inside the declaration. Don't import GridSearchCV
here; the tuning skill owns search.BaseEstimator + TransformerMixin. For a stateless op,
write a function and use .skb.apply_func.| Skill | Relationship |
|---|---|
python-api | Authoritative lookup of sklearn / skrub / skore. Invoke whenever picking a symbol; cache hits first (Shape 0) |
evaluate-ml-pipeline | Owns skore.evaluate, CV selection, metric defaults. Consumes the split_kwargs wired at the X marker |
smoke-test-ml-pipeline | Executable proof of Rule 2's early-mark. Smoke failure → route back here; fix the topology, don't loosen the assertion |
test-ml-pipeline | Router for tests/. Smoke test pairs 1:1 with the experiment script |
python-env-manager | Detection + install commands. Invoke when import skrub raises |
python-code-style | Must be invoked after writing or editing pipeline.py / features.py / data.py. Direct pixi run ruff check drops the NumPyDoc convention |
references/source-binding.md — full catalogue of source-binding
patterns (encouraged / discouraged / OK-but-offer-upgrade) +
the apply_func vs deferred decision.references/layer_examples.md — worked code for the IID
flat-table case, the loader-baked-shift counter-example, and
the history-dependent three-layer pattern.references/reproducibility_mechanics.md — full Option 1 / 2 /
3 procedures with code, plus the tripwire criterion.references/common_patterns.md — full catalogue of recurring
pipeline shapes with code snippets.Companion skill (planned):
review-ml-pipeline— methodological review of an existing declaration (leakage audit, statelessness check, step ordering, scope creep). When it flags a problem, return here to fix.