Help us improve
Share bugs, ideas, or general feedback.
From godmode
Guides ML experiments: defines hypotheses, manages hyperparameters, validates datasets, detects bias, tracks training, evaluates models, compares results for PyTorch/TensorFlow/scikit-learn.
npx claudepluginhub arbazkhan971/godmodeHow this skill is triggered — by the user, by Claude, or both
Slash command
/godmode:mlThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- `/godmode:ml`, "train a model", "compare experiments"
Audits ML pipeline reproducibility, experiment tracking hygiene, and model versioning. Advises on serving patterns and prompt evaluation across MLflow, W&B, SageMaker, Vertex AI.
Assesses ML pipeline stage and applies patterns for data pipelines, model training, serving, MLOps, evaluation, and debugging with validations like schema checks, drift detection, and skew guards.
Turns model work into production ML systems with data contracts, repeatable training, quality gates, deployable artifacts, and monitoring. Useful for ranking, search, recommendations, classifiers, forecasting, embeddings, LLMs, anomaly detection, and batch analytics.
Share bugs, ideas, or general feedback.
/godmode:ml, "train a model", "compare experiments"ID: EXP-<YYYY-MM-DD>-<NNN>
Hypothesis: <what you expect and why>
Objective: <metric to optimize>
Baseline: <current best or naive baseline>
Task: classification|regression|ranking|generation
Framework: PyTorch|TensorFlow|scikit-learn|JAX|XGBoost
# Check for ML frameworks
pip list 2>/dev/null | grep -iE "torch|tensorflow|sklearn"
cat requirements.txt 2>/dev/null | grep -iE "torch|tf"
search:
strategy: grid|random|bayesian|hyperband
space:
learning_rate: [1e-5, 1e-4, 1e-3, 1e-2]
batch_size: [16, 32, 64, 128]
dropout: uniform(0.1, 0.5)
hidden_size: [128, 256, 512, 1024]
trials: <total>
IF trials > 50: use Bayesian or Hyperband (not grid). IF search space > 4 dimensions: use random search minimum.
Total samples: <N>
Split: train=<N>(<pct>%) / val=<N>(<pct>%) / test=<N>
Quality checks:
Missing values: <count per feature>
Duplicates: <count exact duplicates>
Outliers: <count, method used>
Class balance: <ratio of majority/minority>
IF class imbalance > 10:1: use stratified sampling
Protected attributes: <gender, race, age, geography>
Per-attribute:
| Attribute | Group | Samples | Accuracy | FPR | FNR |
IF max_group_accuracy - min_group_accuracy > 5%:
FLAG bias. Investigate feature correlations.
IF FNR disparity > 10% across groups:
BLOCK deployment until mitigated.
Epoch: <current>/<total>
Training loss: <value> (trend: decreasing|plateau)
Validation loss: <value> (trend)
Primary metric: <value> (best: <val> at epoch <N>)
IF val_loss increases 3 consecutive epochs: early stop. IF train_loss << val_loss (gap > 2x): overfitting.
Test set: <N samples> (used ONCE for final eval)
Accuracy: <val> Precision: <val> Recall: <val>
F1: <val> AUC-ROC: <val> AUC-PR: <val>
Statistical significance vs baseline:
p=<val> (paired bootstrap, 10K iterations)
IF p > 0.05: improvement not significant, iterate. IF improvement < 1% absolute: likely noise.
| Experiment | F1 | AUC | Latency | Size | Params |
Winner selection: best accuracy/latency tradeoff.
Commit: "ml: EXP-<ID> — <metric>=<value> (<delta>)"
IF best found: -> /godmode:mlops to deploy.
IF bias detected: address before deployment.
Append .godmode/ml-results.tsv:
timestamp experiment_id model metric baseline result status
KEEP if: significant improvement AND bias passes
AND no data leakage.
DISCARD if: no significance OR bias violation
OR leakage found. Log both.
STOP when FIRST of:
- Best model beats baseline significantly
- Bias check passes all attributes
- 3 consecutive experiments show no improvement
On failure: git reset --hard HEAD~1. Never pause.
| Failure | Action |
|---|---|
| Worse than baseline | Check leakage, preprocessing, balance |
| Training diverges | Reduce LR 10x, check NaN, normalize |
| Fails in production | Compare data distributions, check drift |