infra-tracking-backend | curry-train | ClaudePluginHub

Skill

infra-tracking-backend

From curry-train

A Logger protocol decoupling the training code from any specific tracking backend (W&B, MLflow, Aim, TensorBoard) — with TensorBoard as the zero-dependency default. Activate when the user asks "experiment tracking", "W&B integration", "TensorBoard setup", "MLflow", "switch tracking backend", or wants tracking without lock-in.

$

npx claudepluginhub curryfromuestc/curry-train --plugin curry-train

Tool Access

This skill uses the workspace's default tool permissions.

Preview

curryTrain does **not** lock into any one tracking backend. The reasons (subagent B's research):

SKILL.md

Similar Skills

cache-components

139.4k

Guides Next.js Cache Components and Partial Prerendering (PPR): 'use cache' directives, cacheLife(), cacheTag(), revalidateTag() for caching, invalidation, static/dynamic optimization. Auto-activates on cacheComponents: true.

cache-components

pdf

131.6k

Processes PDFs: extracts text/tables/images, merges/splits/rotates pages, adds watermarks, creates/fills forms, encrypts/decrypts, OCRs scans. Activates on PDF mentions or output requests.

11 files

document-skills

Stats

Stars0

Forks0

Last CommitMay 4, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Infra · Tracking backend (logger protocol)

curryTrain does not lock into any one tracking backend. The reasons (subagent B's research):

Neptune.ai 2026 shutdown — commercial SaaS is not durable.
Lightning, HuggingFace already use multi-logger abstractions; that's the modern norm.
The canonical record is the per-run journal (stage5-run-journal) — backends are display layers, not source of truth.

The protocol

from typing import Protocol

class TrackingBackend(Protocol):
    """A minimal interface for any experiment-tracking sink."""

    def log_metrics(self, metrics: dict[str, float], step: int) -> None: ...
    def log_artifact(self, name: str, path: str | Path) -> None: ...
    def log_config(self, cfg: dict) -> None: ...
    def finish(self) -> None: ...

Every backend implements this Protocol. Training code calls only Protocol methods. New backends are drop-in.

Built-in backends

TensorBoard (default, zero dependency)

class TensorBoardBackend:
    def __init__(self, log_dir: Path):
        from torch.utils.tensorboard import SummaryWriter
        self.writer = SummaryWriter(log_dir=str(log_dir))

    def log_metrics(self, metrics, step):
        for k, v in metrics.items():
            self.writer.add_scalar(k, v, step)

    def log_artifact(self, name, path):
        # TB doesn't really do artifacts; record path in scalar text instead
        self.writer.add_text(f"artifact/{name}", str(path))

    def log_config(self, cfg):
        self.writer.add_text("config", json.dumps(cfg, indent=2))

    def finish(self):
        self.writer.close()

W&B (optional)

class WandbBackend:
    def __init__(self, project: str, name: str, config: dict):
        import wandb
        self.run = wandb.init(project=project, name=name, config=config)

    def log_metrics(self, metrics, step): self.run.log(metrics, step=step)
    def log_artifact(self, name, path):
        import wandb
        art = wandb.Artifact(name=name, type="checkpoint")
        art.add_file(str(path))
        self.run.log_artifact(art)
    def log_config(self, cfg): self.run.config.update(cfg)
    def finish(self): self.run.finish()

MLflow / Aim — implement on demand using the same Protocol.

Composite backend (multiple sinks)

class CompositeBackend:
    def __init__(self, backends: list[TrackingBackend]):
        self.backends = backends
    def log_metrics(self, metrics, step):
        for b in self.backends: b.log_metrics(metrics, step)
    # ... and so on

This is what enables "TB locally + W&B for sharing" without changing training code.

Hydra config

# configs/logging/tb_only.yaml
backend:
  _target_: curry_train.infra.tracking.TensorBoardBackend
  log_dir: ${paths.runs}/${experiment.name}/${now:%Y-%m-%d_%H-%M-%S}/tb

# configs/logging/tb_plus_wandb.yaml
backend:
  _target_: curry_train.infra.tracking.CompositeBackend
  backends:
    - _target_: curry_train.infra.tracking.TensorBoardBackend
      log_dir: ${paths.runs}/${experiment.name}/${now:%Y-%m-%d_%H-%M-%S}/tb
    - _target_: curry_train.infra.tracking.WandbBackend
      project: curry-train
      name: ${experiment.name}

Switch by python train.py logging=tb_plus_wandb.

Use in training code

# After fabric and Run setup:
backend = hydra.utils.instantiate(cfg.logging.backend)
backend.log_config(OmegaConf.to_container(cfg))

with Run(cfg) as run:
    for step, batch in enumerate(loader):
        ...
        loss_value = loss.item()
        run.log_metric(step=step, loss=loss_value)
        backend.log_metrics({"train/loss": loss_value}, step=step)

    backend.finish()

The journal always gets the data (canonical record); the backend is a viewer.

Procedure when assisting a user

Default to TensorBoard. It has zero installation friction and survives any service shutdown.
If the user wants W&B for sharing, suggest the composite — TB and W&B, not W&B alone.
Don't tutorial users on W&B account setup; that's outside curryTrain's scope. Do tell them to set WANDB_API_KEY as env var.
Confirm the journal is writing parallel data — if W&B goes down mid-run, the journal still has everything.
For runs-diff, point infra-tracking-backend users at the journal, not at W&B's UI — runs-diff reads the journal, not any backend.

Boundaries

The protocol intentionally has only 4 methods. Backend-specific features (W&B's tables, MLflow's model registry) are accessed through the backend object directly when needed.
Logging large artifacts through the protocol is fine for checkpoints, problematic for full activation maps. For those, write to a side directory and log only the path.
Async logging is out of scope for V1; each log_metrics call is synchronous. Per-step overhead is small (< 1 ms for TB).

Common mistakes

Replacing the journal with the backend → lose the canonical record; problems on Neptune.ai-style outages.
Logging every step at high frequency without sampling → both TB and W&B become slow viewers; consider every-10-steps.
Logging into W&B from inside FSDP without rank-0 guard → duplicate metrics from every rank.

Related

skills/stage5-run-journal — the canonical local record.
skills/infra-hydra-config — the logging config group.
skills/runs-diff — reads the journal, not the backend.
TensorBoard, W&B, MLflow, Aim docs.