From pymc-modeling
Load when the user is comparing Bayesian models, computing LOO-CV / ELPD, calling az.loo or az.compare, doing model stacking/averaging, or computing Bayes factors. Covers the ArviZ 1.0 LOO/ELPD/stacking APIs exclusively (no waic). Triggers include: model comparison, LOO, ELPD, az.compare, az.loo, loo_expectations, loo_metrics, loo_r2, Pareto k, stacking, Bayes factor, cross-validation, predictive accuracy, information criterion.
How this skill is triggered — by the user, by Claude, or both
Slash command
/pymc-modeling:model-evaluationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
CRITICAL: ArviZ 1.0 replaces InferenceData with xarray.DataTree. `az.waic` is removed entirely — use PSIS-LOO-CV exclusively. Default credible interval is 0.89 ETI (not 0.94 HDI), controlled via `ci_prob=` and `ci_kind=` (replaces the old `hdi_prob=`).
CRITICAL: ArviZ 1.0 replaces InferenceData with xarray.DataTree. az.waic is removed entirely — use PSIS-LOO-CV exclusively. Default credible interval is 0.89 ETI (not 0.94 HDI), controlled via ci_prob= and ci_kind= (replaces the old hdi_prob=).
For model building context, prior selection, and convergence diagnostics, see the pymc-modeling skill.
Leave-one-out cross-validation via Pareto-smoothed importance sampling (PSIS).
import arviz as az
import arviz_stats # registers the .azstats accessor
# dt is a DataTree from pm.sample()
loo_result = az.loo(dt)
print(loo_result)
# Returns: ELPDData with elpd_loo, se, p_loo, n_data_points, pareto_k
# Equivalent via the xarray accessor (DataArray / Dataset / DataTree all supported):
loo_result = dt.azstats.loo()
Pareto k values indicate reliability of PSIS approximation for each observation:
| k value | Interpretation | Action |
|---|---|---|
| k < 0.5 | Good | LOO estimate reliable |
| 0.5 < k < 0.7 | Marginal | Results usable but less accurate |
| 0.7 < k < 1.0 | Bad | Estimate unreliable — use moment matching or k-fold |
| k > 1.0 | Very bad | PSIS fails entirely — must use k-fold CV |
# Check Pareto k values
print(loo_result.pareto_k)
# Plot Pareto k diagnostics
az.plot_khat(loo_result)
# Count problematic observations
import numpy as np
k_values = loo_result.pareto_k.values
print(f"k > 0.7: {np.sum(k_values > 0.7)} observations")
Automatically refit problematic observations using moment matching:
# Requires log_likelihood in the DataTree
loo_mm = az.loo_moment_match(dt)
This importance-weights the posterior for each problematic observation, improving the PSIS approximation without refitting the model. Much faster than k-fold.
When LOO is unreliable for many observations, use exact k-fold CV:
# Perform 10-fold cross-validation
kfold_result = az.loo_kfold(dt, K=10)
print(kfold_result)
This refits the model K times, so it is K times slower than LOO. Use only when LOO diagnostics indicate problems.
Compare multiple models on predictive accuracy:
# dt1, dt2, dt3 are DataTree objects from pm.sample()
comparison = az.compare(
{"linear": dt1, "quadratic": dt2, "spline": dt3},
scale="log", # log scale (default) or deviance
)
print(comparison)
Note: az.compare in ArviZ 1.0 only supports LOO, so the ic= argument has been dropped.
| Column | Meaning |
|---|---|
rank | Model rank (0 = best) |
elpd_loo | Expected log pointwise predictive density |
p_loo | Effective number of parameters |
d_loo | Difference in ELPD from best model |
weight | Stacking weight (sums to 1) |
se | Standard error of ELPD |
dse | Standard error of the ELPD difference |
warning | True if any Pareto k > 0.7 |
d_loo = 0: best model|d_loo| < 4: models are practically indistinguishable — prefer simpler one|d_loo| > 4 and |d_loo/dse| > 2: meaningful difference in predictive accuracywarning = True: LOO unreliable for this model — investigate Pareto k values# Visualize comparison
az.plot_compare(comparison)
# Detailed forest plot of ELPD differences
az.plot_elpd({"linear": dt1, "quadratic": dt2, "spline": dt3})
See references/model_comparison.md for detailed usage.
Stacking minimizes KL divergence from the true predictive distribution to the weighted mixture. This is the recommended default.
comparison = az.compare({"m1": dt1, "m2": dt2, "m3": dt3})
# Stacking weights are in the "weight" column by default
print(comparison["weight"])
Alternative weighting based on Bayesian bootstrap of ELPD:
comparison = az.compare(
{"m1": dt1, "m2": dt2, "m3": dt3},
method="BB-pseudo-BMA",
)
| Method | Use When |
|---|---|
| Stacking | Default. Best for prediction when true model is not in the set |
| Pseudo-BMA+ | Want Bayesian uncertainty over weights |
| Equal weights | Models represent different scientific hypotheses to average over |
weights = comparison["weight"].values
# Manually mix posterior predictive samples
# weighted by stacking weights
See references/stacking.md for detailed averaging workflows.
Bayes factors compare marginal likelihoods. Conceptually different from LOO (predictive accuracy vs. evidence).
# Bayes factors are difficult to compute reliably
# Bridge sampling is the most reliable method but requires specialized setup
# For most applied work, LOO-CV is preferred
# Approximate Bayes factor from LOO (rough):
# BF ~ exp(elpd_loo_m1 - elpd_loo_m2)
# This is a very rough approximation — use with caution
LOO probability integral transform checks if the model is calibrated:
az.plot_loo_pit(dt, var_names=["observed_data_name"])
This is a powerful diagnostic that LOO uniquely provides — it checks calibration without held-out data.
Compute LOO-weighted posterior expectations (mean, variance, quantile) for each observation. Requires both posterior_predictive and log_likelihood groups on the DataTree:
# LOO-weighted posterior predictive mean for each observation
loo_mean = az.loo_expectations(dt, kind="mean")
loo_var = az.loo_expectations(dt, kind="var")
loo_q = az.loo_expectations(dt, kind="quantile", probs=[0.055, 0.945])
Compute common LOO-based predictive metrics (RMSE, MAE, etc.) from posterior_predictive and log_likelihood:
metrics = az.loo_metrics(dt, kind="rmse")
import arviz_stats registers an .azstats accessor on DataArray, Dataset, and DataTree. This gives a fluent xarray-native interface alongside the top-level az.* functions (which remain available after import arviz as az):
import arviz_stats # registers accessor; required even if you already imported arviz
dt.azstats.loo() # same as az.loo(dt)
dt["posterior"].azstats.rhat() # on a Dataset
dt["posterior"].azstats.ess()
dt["posterior"].azstats.summary()
dt["posterior"].azstats.hdi()
dt["posterior"].azstats.eti()
Bayesian R-squared via LOO:
r2 = az.loo_r2(dt)
print(f"LOO-R2: {r2.mean():.3f} [{r2.quantile(0.055):.3f}, {r2.quantile(0.945):.3f}]")
Compute LOO-based scoring rules (CRPS, log score):
score = az.loo_score(dt, score_func="crps")
LOO with subsampling for large datasets:
# When n > 10000, subsample for speed
loo_sub = az.loo_subsample(dt, observations=1000)
Exact refit LOO for observations with high Pareto k:
# Refits the model for problematic observations
loo_exact = az.reloo(dt, loo_result, model=model)
import arviz as az
# 1. Compute LOO
loo = az.loo(dt)
print(loo)
# 2. Check Pareto k
az.plot_khat(loo)
# 3. If k > 0.7, try moment matching
if (loo.pareto_k > 0.7).any():
loo = az.loo_moment_match(dt)
# 4. LOO-PIT calibration
az.plot_loo_pit(dt, var_names=["y"])
# 5. Compare models
comparison = az.compare({"model_a": dt_a, "model_b": dt_b})
az.plot_compare(comparison)
print(comparison)
# 6. Predictive R2
r2 = az.loo_r2(dt)
npx claudepluginhub pymc-labs/pymc-modelingCreates bite-sized, testable implementation plans from specs or requirements, with file structure and task decomposition. Activates before coding multi-step tasks.