From harnessml
Use after every experiment run. This is the core data science skill — reading results, understanding errors, and forming the next hypothesis. If you skip this, you're not doing science, you're doing random search.
npx claudepluginhub msilverblatt/harness-ml --plugin harnessmlThis skill uses the workspace's default tool permissions.
Use after every experiment run. This is the core data science skill — reading results, understanding errors, and forming the next hypothesis. If you skip this, you're not doing science, you're doing random search.
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
Use after every experiment run. This is the core data science skill — reading results, understanding errors, and forming the next hypothesis. If you skip this, you're not doing science, you're doing random search.
A metric is a symptom. Diagnosis finds the cause. You don't treat symptoms — you understand the disease.
When you see "Brier improved by 0.004," that is not a conclusion. It's the beginning of a question: where did the improvement come from? Which predictions changed? Did calibration improve, or did discrimination improve? Is the gain robust across folds, or driven by one?
After every run:
pipeline(action="diagnostics")
pipeline(action="compare_latest")
Start here, but don't stop here.
This is where the real signal is.
What fold patterns tell you:
pipeline(action="diagnostics") # includes calibration curves
This is the purpose of diagnosis. Every diagnostic finding should generate a question:
| Finding | Question it generates |
|---|---|
| Fold 3 always worst | What's different about fold 3's data? Is it a time period, a subgroup, a distribution shift? |
| Train-test gap on tree models only | Are trees overfitting to interactions that don't generalize? Would regularization or max_depth reduction help? |
| Feature X has high importance but low univariate correlation | The model found a non-linear or conditional relationship. What interaction or transformation makes this explicit? |
| Two models have 0.98 correlation | They're learning the same thing. Can I differentiate them with different feature sets? Or should I drop one and add a different model family? |
| ECE degraded while Brier improved | Discrimination improved but calibration suffered. Is the calibration method appropriate? Would a different calibrator help? |
| New feature has zero importance | Is it redundant with an existing feature? Or does the model need a different functional form (e.g., binned instead of continuous)? |
| All folds improved slightly | Genuine structural improvement, but small. Is there more headroom in this direction, or is this the ceiling? |
Write down the next hypothesis before closing the diagnosis. If you finish diagnosing and don't know what to try next, you didn't dig deep enough.
When available, look at where the model's errors concentrate:
After every experiment, write:
## Diagnosis: [Experiment ID]
### What happened
[Metric changes — aggregate and per-fold]
### Why it happened
[Mechanism — not just "the metric went down" but WHY]
### What was confirmed
[Parts of the hypothesis that were supported]
### What was surprising
[Things you didn't predict — these are the most valuable]
### Next hypothesis
[What to investigate based on these findings]
Write key diagnostic findings to the notebook so they persist across sessions:
notebook(action="write", type="finding", content="[diagnosis insight]", experiment_id="...")
Not every diagnostic detail — just the insights that should inform future work.