npx claudepluginhub arbazkhan971/godmodeThis skill uses the workspace's default tool permissions.
references/ml-evaluation.mdDesigns and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
/godmode:ml, "train a model", "compare experiments"ID: EXP-<YYYY-MM-DD>-<NNN>
Hypothesis: <what you expect and why>
Objective: <metric to optimize>
Baseline: <current best or naive baseline>
Task: classification|regression|ranking|generation
Framework: PyTorch|TensorFlow|scikit-learn|JAX|XGBoost
# Check for ML frameworks
pip list 2>/dev/null | grep -iE "torch|tensorflow|sklearn"
cat requirements.txt 2>/dev/null | grep -iE "torch|tf"
search:
strategy: grid|random|bayesian|hyperband
space:
learning_rate: [1e-5, 1e-4, 1e-3, 1e-2]
batch_size: [16, 32, 64, 128]
dropout: uniform(0.1, 0.5)
hidden_size: [128, 256, 512, 1024]
trials: <total>
IF trials > 50: use Bayesian or Hyperband (not grid). IF search space > 4 dimensions: use random search minimum.
Total samples: <N>
Split: train=<N>(<pct>%) / val=<N>(<pct>%) / test=<N>
Quality checks:
Missing values: <count per feature>
Duplicates: <count exact duplicates>
Outliers: <count, method used>
Class balance: <ratio of majority/minority>
IF class imbalance > 10:1: use stratified sampling
Protected attributes: <gender, race, age, geography>
Per-attribute:
| Attribute | Group | Samples | Accuracy | FPR | FNR |
IF max_group_accuracy - min_group_accuracy > 5%:
FLAG bias. Investigate feature correlations.
IF FNR disparity > 10% across groups:
BLOCK deployment until mitigated.
Epoch: <current>/<total>
Training loss: <value> (trend: decreasing|plateau)
Validation loss: <value> (trend)
Primary metric: <value> (best: <val> at epoch <N>)
IF val_loss increases 3 consecutive epochs: early stop. IF train_loss << val_loss (gap > 2x): overfitting.
Test set: <N samples> (used ONCE for final eval)
Accuracy: <val> Precision: <val> Recall: <val>
F1: <val> AUC-ROC: <val> AUC-PR: <val>
Statistical significance vs baseline:
p=<val> (paired bootstrap, 10K iterations)
IF p > 0.05: improvement not significant, iterate. IF improvement < 1% absolute: likely noise.
| Experiment | F1 | AUC | Latency | Size | Params |
Winner selection: best accuracy/latency tradeoff.
Commit: "ml: EXP-<ID> — <metric>=<value> (<delta>)"
IF best found: -> /godmode:mlops to deploy.
IF bias detected: address before deployment.
Append .godmode/ml-results.tsv:
timestamp experiment_id model metric baseline result status
KEEP if: significant improvement AND bias passes
AND no data leakage.
DISCARD if: no significance OR bias violation
OR leakage found. Log both.
STOP when FIRST of:
- Best model beats baseline significantly
- Bias check passes all attributes
- 3 consecutive experiments show no improvement
On failure: git reset --hard HEAD~1. Never pause.
| Failure | Action |
|---|---|
| Worse than baseline | Check leakage, preprocessing, balance |
| Training diverges | Reduce LR 10x, check NaN, normalize |
| Fails in production | Compare data distributions, check drift |