From ds
Generates standardized model cards in HuggingFace and NVIDIA Model Card++ formats for ML models, covering details, intended uses, training data, metrics, limitations, and ethics. Use when preparing models for deployment or handoff.
npx claudepluginhub andikarachman/data-science-plugin --plugin dsThis skill uses the workspace's default tool permissions.
Generate a standardized model card that documents a trained ML model's purpose, performance, limitations, and ethical considerations. Based on HuggingFace Model Card format and NVIDIA Model Card++ extensions.
Adds and manages structured evaluation results in Hugging Face model cards: extracts tables from READMEs, imports from Artificial Analysis API, runs custom evals with vLLM/lighteval/inspect-ai. Supports model-index format.
Advises on ML pipeline management, experiment tracking hygiene, model serving patterns, and prompt evaluation frameworks. Audits reproducibility, model versioning, and deployment readiness across MLflow, Weights & Biases, SageMaker, Vertex AI.
Deep-dives into ML/AI topics by fetching official docs and GitHub sources via KB or web tools, for explaining concepts, comparing approaches, or surveying frameworks like 'how does X work?' or 'X vs Y'.
Share bugs, ideas, or general feedback.
Generate a standardized model card that documents a trained ML model's purpose, performance, limitations, and ethical considerations. Based on HuggingFace Model Card format and NVIDIA Model Card++ extensions.
| Field | Description |
|---|---|
| Name | Human-readable model name |
| Version | Model version (e.g., v1.0.0) |
| Type | Algorithm family (e.g., gradient boosting, neural network, linear regression) |
| Framework | Library used (scikit-learn, statsmodels, aeon, xgboost, etc.) |
| Task | What the model does (classification, regression, forecasting, anomaly detection, etc.) |
| Date trained | When the model was last trained |
| Author | Who developed the model |
Document the model's intended use case clearly:
| Field | Description |
|---|---|
| Source | Where the training data comes from |
| Date range | Time period of training data |
| Size | Number of samples and features |
| Data hash | SHA-256 hash for version tracking |
| Preprocessing | Key transformations applied |
| Known biases | Any known biases in the training data |
| Field | Description |
|---|---|
| Source | Same or different from training? |
| Date range | Time period of evaluation data |
| Size | Number of samples |
| Split strategy | How train/eval was split |
Report performance metrics with context:
| Metric | Value | Baseline | Improvement | Confidence Interval |
|---|---|---|---|---|
| [Primary] | ||||
| [Secondary] |
Include:
Document known limitations honestly:
Provide concrete usage examples:
# Example: Loading and using the model
import joblib
model = joblib.load("path/to/model.pkl")
predictions = model.predict(X_new)
Include:
Before shipping, verify:
| Mistake | Impact | Fix |
|---|---|---|
| Vague limitations ("may not work for all data") | Users can't assess risk | Be specific: "Accuracy drops 15% on samples with >50% missing values" |
| Missing subgroup metrics | Hides fairness issues | Report metrics for all meaningful slices |
| No baseline comparison | Can't assess model value | Always include baseline performance |
| Outdated training data dates | Users assume data is fresh | Include data recency and staleness risk |
| Missing dependency versions | Can't recreate environment | Pin exact versions in requirements |