From tao-skill-bank
Integrates a HuggingFace Computer Vision model into the NVIDIA TAO Toolkit ecosystem (tao-core, tao-pytorch, tao-deploy/TensorRT). Handles the full 7-phase pipeline from HF model inspection through ONNX export, packaging, and containerized validation.
How this skill is triggered — by the user, by Claude, or both
Slash command
/tao-skill-bank:tao-port-huggingface-modelThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
<!--
BENCHMARK.mdevals/evals.jsonreferences/cross-cutting.mdreferences/docker-patterns.mdreferences/hf-inspection.mdreferences/phase-0-prereqs.mdreferences/phase-1-inspection.mdreferences/phase-2-codebase.mdreferences/phase-3-implementation.mdreferences/phase-4-deploy.mdreferences/phase-5-packaging.mdreferences/phase-6-container-tests.mdreferences/phase-7-optimization.mdreferences/repo-structure.mdreferences/tao-patterns.mdreferences/task-type-guide.mdreferences/workflow-consistency.mdskill-card.mdskill.oms.sigIntegrate a HuggingFace (HF) Computer Vision model into the NVIDIA TAO Toolkit ecosystem. Work the phases iteratively — not purely linearly — via a build → test → debug → fix → retest loop at every step: when something fails, diagnose and fix before moving on; when it passes, move to the next step.
This SKILL.md is the workflow coordinator. Each phase has a dedicated references/phase-N-*.md with the full step-by-step content, code, docker invocations, and gates. Read the matching reference at the start of each phase — the summaries below are not sufficient.
All work is strictly local. Do NOT push/commit/branch on any remote (GitLab, GitHub, HuggingFace), create merge/pull requests or issues, or upload/publish Docker images to any registry or artifact store. You may only read/clone from remotes — all edits, Docker builds, and test runs stay on the local machine.
The user clones the four TAO repos (tao-core, tao-pytorch, tao-deploy, tao-dataservices) independently into one working directory. The tao-core/ submodule nested inside each repo points to the original unmodified commit; modifications only exist in the top-level tao-core/. Always install from the top-level tao-core/, never <repo>/tao-core/ — the nested submodule silently ignores all modifications. Override rules: (1) mount the working directory -v $(pwd):/workspace; (2) pip install /workspace/tao-core FIRST, before tao-pytorch/tao-deploy; (3) PYTHONPATH top-level tao-core first, e.g. -e PYTHONPATH=/workspace/tao-core:/workspace/tao-pytorch. See references/cross-cutting.md for the directory tree.
Every test, smoke run, and end-to-end validation executes inside a locally prepared TAO Toolkit container (tao-pytorch-base:latest, tao-deploy-base:latest, optionally tao-dataservices-base:latest — all from Phase 0). The platform skills own how to run them; this skill specifies what. Default platform: local-docker. Phase 0 delegates the driver / CUDA / NCT preflight to tao-setup-nvidia-gpu-host. See references/cross-cutting.md for the authoritative-skill table, bind-mount rationale, and canonical docker-run flag set.
| Phase | Goal | Reference |
|---|---|---|
| 0 | Prerequisites + TAO Toolkit images + local image tags | phase-0-prereqs.md |
| 1 | Inputs, HF-inspection container, validate model + dataset | phase-1-inspection.md, hf-inspection.md |
| 2 | Closest existing TAO reference model | phase-2-codebase.md, task-type-guide.md |
| 3 | tao-core config + tao-pytorch trainer / eval / inference | phase-3-implementation.md, tao-patterns.md, repo-structure.md |
| 4 | ONNX export + tao-deploy TRT engine / inference / eval | phase-4-deploy.md |
| 5 | Packaging (console_scripts) + L0 tests | phase-5-packaging.md |
| 6 | Container testing + end-to-end validation | phase-6-container-tests.md, docker-patterns.md |
| 7 | (conditional) Accuracy / latency / size tuning | phase-7-optimization.md |
Cross-cutting refs: workflow-consistency.md (CLI flow, config field paths, cross-phase dependencies); cross-cutting.md (platform, isolation, module pitfalls, debugging).
IMPORTANT — Continuous Execution Through Phase 6: do NOT stop after Phases 3–5 to wait for the user to run tests. Phase 6 is mandatory — not complete until tests pass inside the containers and the end-to-end pipeline is validated.
At every step: write code → test immediately (import check, unit test, or dry-run) → if it fails, read traceback → diagnose → fix → retest; if it passes, move on. Do NOT accumulate untested code — testing only at the end compounds bugs.
When something fails, consult the symptom → likely-cause → fix table in references/cross-cutting.md before trying random fixes — it covers ModuleNotFoundError, BACKBONE_REGISTRY KeyError, shape mismatch, NaN loss, ONNX/TRT build failures, TRT-vs-PyTorch accuracy gaps, OOM, DDP hangs, checkpoint load failures, and stale-submodule config issues.
All Python work runs inside Docker containers — no host venvs, no pip installs into host Python (the host needs only Docker, from tao-setup-nvidia-gpu-host). Three contexts: A (Phase 1 HF inspection in tao-hf-inspect, python:3.12-slim fallback), B (Phase 3/4/6 smoke/L0/e2e in the prepared container, source via pip install /workspace/tao-core && python setup.py develop), C (host-bind-mount scratch). See references/cross-cutting.md for the contexts in full and the four numbered rules verbatim (--check-only host packages; Phase 1 --user $(id -u):$(id -g) vs. root; HOME=/workspace/PIP_USER=1 fallback; distro package-manager list; root:root trade-off; tao-hf-inspect cleanup).
Goal: verify Python 3.10+ and git; delegate the driver / CUDA / Docker / NVIDIA Container Toolkit host check to tao-setup-nvidia-gpu-host; verify NGC docker login for nvcr.io. Then ask the user for the TAO Toolkit image references (tao-pytorch, tao-deploy, optionally tao-dataservices), pull, and prepare local tags tao-pytorch-base:latest, tao-deploy-base:latest, tao-dataservices-base:latest for later phases — preparation removes the pre-installed released TAO packages so the user's /workspace/... clones install/load via pip install /workspace/tao-core && python setup.py develop. Hard stop on any failed check. Required user inputs: the image references + credentials (NGC login, HF_TOKEN). Full commands, prompt wording, and per-image Dockerfile snippets: the Phase 0 reference.
Gate: all prerequisite checks pass; the user supplied the required image references; tao-pytorch-base:latest and tao-deploy-base:latest exist locally; tao-dataservices-base:latest exists if dataservices work is anticipated.
Goal: decide whether to proceed at all. Gather credentials, locate/clone the four TAO repos, create a consistent working branch, launch the tao-hf-inspect container (Context A), validate the HF model is CV with a supported pipeline_tag, extract config + state-dict schema, sanity-check ONNX export, clean up. Full steps: Phase 1 references.
Reject if: pipeline_tag is NLP / audio / LLM (non-CV); AutoConfig raises; or ONNX export fundamentally cannot work (no rewrite path).
Gate: all 4 TAO repos located/cloned with a consistent branch; pipeline_tag confirmed CV; model_type, image_size, hidden_size, num_labels extracted; state-dict keys documented + HF→TAO remapping plan drafted; ONNX export sanity check passed (or failure understood); user confirmed model_short_name + task type. (Full checklist: phase-1-inspection.md.) Present findings and get user confirmation first.
Goal: find the closest existing TAO reference model for the detected pipeline_tag, read its implementation across tao-core / tao-pytorch / tao-deploy, and decide whether the backbone exists in backbone_v2/ or is new.
The HF pipeline_tag → TAO reference model mapping (classification → classification_pyt, detection → dino/rtdetr, segmentation → segformer, instance → mask2former, panoptic → oneformer, zero-shot → grounding_dino, depth → mono_depth) drives everything downstream (config, architecture, loss, ONNX shape, TRT builder, deploy classes, metrics, dataset format). See the Phase 2 references for the full reference list (12 files per model), the backbone_v2/ and tao-dataservices coverage checks, and per-task architecture.
If a new backbone is needed, decide the strategy (timm wrap > re-implement > HF black-box wrap) before Phase 3 — it changes weight loading, ONNX export, deploy. Never dual-inherit from transformers.PreTrainedModel and BackboneBase (metaclass conflict — compose instead).
Gate: reference TAO model identified + all 12 reference locations read; task-type implications understood (architecture, loss, ONNX outputs, deploy classes, metrics, dataset); backbone coverage decided (reuse / wrap timm / new); dataservices coverage checked. (Full checklist: phase-2-codebase.md.)
Goal: write the tao-core config schema + the tao-pytorch trainer / native inference / evaluation, smoke-testing between steps. (<model_name> = snake_case short-name; <ModelName> = PascalCase.)
Steps 1–7 (each builds on the previous, smoke-test between): tao-core config (1), tao-pytorch trainer (2), multi-GPU/multi-node (3), native inference → result.csv (4), native evaluation → results.json (5), MLOps for training and eval/infer → status.json (6–7). The ExperimentConfig(CommonExperimentConfig) must contain model, dataset, train, evaluate, inference, export, gen_trt_engine, quantize. All ??? fields are MISSING (user supplies via YAML/CLI); the augmentation.mean/std, model.head.in_channels, checkpoint-name, and onnx_file matches are in the checklist below.
Full per-step bodies, code, the canonical experiment_spec.yaml, and smoke-test commands: the Phase 3 references.
Gates: Step 1 — ExperimentConfig imports cleanly in-container; Step 2 — build_model(cfg) runs + PLModel instantiates in-container; Phase 3 — all 7 steps complete, smoke tests pass, no missing __init__.py.
Goal: ONNX export from tao-pytorch, then TRT engine builder + inference + evaluation in tao-deploy reusing the tao-core ExperimentConfig.
Steps 8–11: ONNX exporter (8 — task-specific input/output names, batch_size=-1 ⇒ dynamic batch); TRT engine builder (9 — subclass EngineBuilder or reuse ClassificationEngineBuilder; write specs/{gen_trt_engine,inference,evaluate}.yaml, same ExperimentConfig schema, augmentation.mean/std MUST match training); TRT inference → result.csv (10); TRT eval → results.json (11). See the Phase 4 reference for full code and the Phase 3+4 gate (3 in-container checks: imports, model build + forward, ONNX round-trip).
Module pitfalls: tao-pytorch and tao-deploy have separate hydra_runner and monitor_status — use the deploy versions in deploy scripts. ExperimentConfig comes from nvidia_tao_core in both (same schema/field paths).
Phase 3+4 gate: all three in-container checks pass (tao-pytorch imports + model + ONNX export; tao-deploy imports).
Goal: register the model as a console_script in both repos and add unit tests.
Steps 12–15: register '<model_name>=...:main' in console_scripts of tao-pytorch/setup.py (12) and tao-deploy/setup.py (13, creating the deploy entrypoint/<model_name>.py via entrypoint_hydra); deploy L0 tests (14); trainer L0 tests — Trainer(..., fast_dev_run=True) + @pytest.mark.cv_unit @pytest.mark.<model_name> (15). See the Phase 5 reference for exact entry-point strings, code, and L0 test file lists.
Gate: entrypoints registered; pytest files exist and follow the marker convention. Do NOT stop — go directly to Phase 6.
Before Docker testing, verify the chain train → export → gen_trt_engine → inference / evaluate (the *_model_latest.pth → .onnx → .engine artifact flow + the config fields each stage reads/writes — full diagram in references/cross-cutting.md).
Consistency checklist (verify before proceeding): self.checkpoint_filename → the *_latest.pth name evaluate.checkpoint / export.checkpoint reference; augmentation.mean/std identical across training spec, inference.yaml, evaluate.yaml, engine-builder preprocess_mode; ONNX input_names=['input'] / output_names=['output'] (detection/instance-seg use task-specific names); export.input_width/input_height match dataset.img_size; model.head.in_channels matches model_params_mapping.py; classes.txt at dataset.root_dir readable by both repos; all __init__.py exist (incl. scripts/__init__.py for get_subtasks() via pkgutil). Full paths: workflow-consistency.md.
Mandatory — start immediately after Phase 5. All TAO models ship as Docker images; code that only works outside a container is incomplete. Testing runs directly inside the TAO Toolkit container — no image build in the loop: mount → install source (setup.py develop) → run pytest / pylint / pydocstyle / flake8 directly. Use vanilla commands, NOT the ci/run_functional_tests.py / ci/run_static_tests.py wrappers (internal-mirror-only; public github.com/NVIDIA-TAO/ mirrors have no ci/ dir).
Steps 16–25: verify local image tags exist (16); unit tests for tao-core / tao-pytorch (-m cv_unit, --shm-size=16G) / tao-deploy (17–19); lint (20); wheels (21); end-to-end — train dry-run + export in one tao-pytorch session, then gen_trt_engine + inference + evaluate in one tao-deploy session (same session critical — --rm discards installs) (22); cross-check native vs TRT (23); debug shells (24); optional release images (25).
Full commands (every docker run, per-container env-vars, exact pytest / lint invocations + full-suite variants, the train/export/gen_trt_engine/inference/evaluate one-liner with all CLI overrides, the ci/ note, the fix-and-retest loop) and build scripts / runner patterns: see the Phase 6 references.
Phase 6 gate (Done criteria): tao-core / tao-pytorch / tao-deploy unit tests pass in their containers; static tests pass (or only legacy lint warnings); wheels build; end-to-end <model_name>_model_latest.pth → model.onnx → model.engine → non-empty result.csv + results.json; native vs TRT agree within tolerance.
Enter only if Phase 6 passes but accuracy / latency / size needs improvement. Ask the user for target metrics first.
Diagnostic categories: accuracy too low; TRT-vs-native gap; training too slow; inference too slow. Techniques: Step 27 — hyperparameter tuning; Step 28 — INT8 quantization (PTQ via torchao / modelopt, TRT INT8 + calibration); Step 29 — channel pruning + retrain; Step 30 — knowledge distillation; Step 31 — resolution tuning (TAO interpolates ViT positional embeddings automatically). See the Phase 7 reference for each category's checks, config blocks, YAML overrides, decision tree, and rationale.
$ARGUMENTS
If provided, interpret $ARGUMENTS as the HuggingFace model ID or URL to start Phase 1. If credentials or model short-name are not included, ask the user for them before proceeding.
npx claudepluginhub nvidia-tao/tao-skills-bank --plugin tao-daft-processCreates bite-sized, testable implementation plans from specs or requirements, with file structure and task decomposition. Activates before coding multi-step tasks.