Skill

build-perf-model

Guides the construction of a cycle-accurate performance model — scheduler model, module lifecycle, idle/advance contract, pipeline primitives, sparse memory, transaction memory pool, configuration system, and dump infrastructure. Activate when the user explicitly invokes /chipdev-method:build-perf-model, or asks "怎么写 cycle-accurate 模拟器", "性能模拟器怎么写", "how to model timing", "should I use SC_THREAD or advance loop", or is implementing a microarchitecture timing model.

npx claudepluginhub curryfromuestc/dev-guide --plugin chipdev-method

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Use this skill when the user is building a **cycle-accurate** simulator —

Supporting Assets

references/case-gem5.mdreferences/case-sst.mdreferences/pipeline-primitives.mdreferences/scheduler-comparison.md

SKILL.md

Similar Skills

canary-watch

179.4k

Monitors deployed URLs for regressions after deploys, merges, or upgrades by checking HTTP status, console errors, network failures, performance (LCP/CLS/INP), content, and API health.

ecc

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitMay 5, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Build Performance Model

Use this skill when the user is building a cycle-accurate simulator — the artifact intended for microarchitecture exploration and RTL timing alignment. Do not autoload.

The purpose is to give the user a small set of strong defaults: a scheduler model, a module contract, a primitive library for pipelines, and a configuration discipline. These defaults compose into a simulator that remains tractable past 20+ modules and remains performant under multi-instance fanout.

If the user is building a behavior model instead, redirect them to build-behavior-model. The two have different state contracts and different performance constraints.

How to use this skill in a response

When triggered:

Identify which layer the user is at — framework infrastructure, module implementation, or integration.
Match their problem to one of the sections below — scheduler, lifecycle, primitives, memory, communication.
Recommend the strongest default; mention the alternative only if their constraints rule the default out.
Surface the failure mode most likely to bite them on the chosen path.

If the user has not yet defined contracts, push them back to define-contracts first. Performance model implementation without generated bases is a Phase-4 refactor waiting to happen.

Performance model essentials `[abstract]`

A cycle-accurate simulator has four mandatory subsystems and three optional but recommended ones.

Mandatory:

A module contract that lets every block be ticked uniformly.
A scheduler that walks all modules per simulation step.
A memory model that all modules can read and write.
A runtime configuration mechanism so the same binary can run different topologies without recompiling.

Recommended: 5. A transaction memory pool for high-frequency allocation. 6. A dump manager with module-and-interface granularity. 7. An idle detection mechanism that stops simulation automatically when all modules are quiescent.

Tick semantics — choose one and commit `[industry-pattern]`

Three discipline options. They are not mutually compatible; pick before the framework solidifies.

Style	When to use	Example projects
Custom non-blocking advance	Default for new projects: each module exposes `advance(int cycles)` and `idle()`; the scheduler calls `advance` only for non-idle modules. No coroutines, no SC_THREAD.	Many in-house cycle-accurate models; aligns with invariant 6 (determinism).
SystemC SC_THREAD	Only when the workload is genuinely concurrent and modules naturally express as state machines with `wait()`.	Classic SystemC TLM examples; legacy IP integration.
Discrete-event	When events are sparse (e.g., NoC traffic generators) and time advances irregularly.	gem5 (`SimObject` + tick scheduling); SST (event-driven).

Default recommendation: custom non-blocking advance. It's faster, easier to debug, and avoids stack switches. The detailed contract is below.

Module lifecycle `[abstract]`

Every module follows this lifecycle:

constructor → end_of_elaboration() → reset() → [idle()/advance() loop]

constructor — instantiate sub-blocks, allocate FIFOs, set sizes. No simulation-state work; the connection layer is not yet bound.
end_of_elaboration() — connection pointers are now valid; register callbacks and bind external interfaces. Always call the base-class hook.
reset() — clear runtime state. Delegate to base class for trace bookkeeping.
idle() / advance(cycles) — the main loop, called every simulation step by the scheduler.

The base class for each module is generated from the hierarchy DSL (see define-contracts). Hand-writing the base class is a smell.

The advance / idle contract `[abstract]`

Every module implements two methods. The scheduler relies on this contract.

class ModuleBase {
  virtual void advance(int cycles) = 0;   // tick the module N cycles
  virtual bool idle();                    // is this module quiescent?
};

Rules:

idle() returns true only when no input queue holds work AND no internal pipeline stage is busy AND no output queue holds pending data.
advance(cycles) performs exactly cycles ticks of work. Implementation choice: a tight while(cycles--) loop for cycle-by-cycle precision, or a single bulk update for transaction-level precision — but be explicit about which.
The scheduler skips modules that report idle() == true. This is the primary performance optimization in the framework, often 20–40% throughput improvement on multi-instance configurations.
Modules never sleep, wait, or yield. There is no coroutine.

Scheduler model `[abstract]`

The scheduler walks all modules in a fixed registered order, advances the non-idle ones, and detects collective quiescence.

void Scheduler::run_all() {
  bool any_active = false;
  for (auto* m : registered_modules) {
    if (mask_filter(m)) continue;     // optional debug filter
    if (!m->idle()) {
      m->advance(interval_ns);
      any_active = true;
    }
  }
  if (any_active) idle_count = 0;
  else            idle_count += interval_ns;
  if (idle_count > idle_threshold) notify_simulation_done();
  schedule_next_tick(interval_ns);
}

Key parameters live in the configuration file (see "Runtime configuration"):

sim_interval_ns — how many nanoseconds advance per tick.
idle_threshold_ns — how long all-idle must hold before stopping.
An optional mask filter (e.g., disable instances 2 and 3) for debug scenarios.

Why fixed-order, no priority queue: cycle-accurate workloads are dominated by the per-module compute, not by scheduling overhead. A linear walk is cache-friendly and reproducible.

Pipeline modeling primitives `[abstract]`

Three primitives cover almost every flop-pipeline / FIFO / ring-buffer modeling need. Implementing them once means modules don't reinvent them.

`DelayQueue<T>` — fixed-latency pipeline

Models a flop chain with a fixed cycle latency. Internally uses differential delay encoding: each entry stores its delay relative to the previous entry, so advancing by one cycle is O(1) regardless of queue depth.

DelayQueue<Packet> pipe;
pipe.insert(5, packet);                 // arrives 5 cycles later
pipe.advance(1, [this](Packet& p) {     // tick 1 cycle, drain ready entries
  send_output(p);
});

When to use: ALU latency modeling, fixed-latency NoC hops, configurable retry timers.

`StagingFifo<T>` — bounded FIFO with backpressure

Models a hardware FIFO with a fixed depth. The producer-side proc_xxx callback returns false when full, which signals the upstream module to stall.

StagingFifo<RequestT> in_fifo(/*depth=*/4);
// Producer side
bool proc_request(RequestT* trans, int delay) {
  return in_fifo.stage_transaction(trans);  // false if full
}
// Consumer side, in advance():
auto* req = in_fifo.nb_read();
if (req) process(*req);

When to use: input port buffering, backpressure-driven pipelines.

Pair with set_callback() to notify the upstream Conn when a slot becomes available — this drives the resume of the upstream protocol.

`RingBuffer<T>` — hardware ring buffer

Models a circular hardware buffer with Push, Pop, and optional random EntryRemove(index) for out-of-order completion.

RingBuffer<CmdEntry> cmd_ring(/*size=*/0x1000, /*depth=*/256, "cmd_ring");
cmd_ring.Push(entry, /*slots=*/4);
cmd_ring.Front();
cmd_ring.Pop();
cmd_ring.EntryRemove(2);                // out-of-order removal

When to use: command rings, scoreboard queues, reorder buffers.

Memory abstraction — sparse first `[abstract]`

The memory model has two responsibilities and one optimization.

Responsibility 1: present a flat byte-addressable space to all modules, with read(addr, size, buf) and write(addr, size, buf).

Responsibility 2: distinguish access source — driver / core / dump — so that monitor hooks and access-counters can be filtered.

Optimization: sparse allocation. Use std::map<page_addr, vector<uint8_t>> keyed on aligned page addresses. Allocate a 4 MB (or chosen) page on first touch. This supports terabyte-scale address spaces without committing physical memory.

class PhysicalMem {
  static constexpr size_t PAGE_SIZE = 4 * 1024 * 1024;
  std::map<paddr_t, std::vector<uint8_t>> pages;
  // read/write transparently allocate pages on first touch
};

Why this matters: chip simulators routinely use 128 TB+ address spaces (HBM + on-chip + MMIO). Pre-allocating is impossible. Sparse allocation is the only viable model.

Transaction memory pool `[industry-pattern]`

Cycle-accurate simulators allocate millions of transaction objects per second. Routing every allocation through the system allocator costs 20–30% of simulation time.

A bucketed pool with size-class binning (1, 2, 4, 8, 16, 32, 64 bytes for small classes; 16-byte aligned for larger) replaces new/delete:

template<class T>
T* allocate() {
  constexpr size_t size = bucket_size(sizeof(T) + sizeof(Header));
  void* mem = bucket_allocate<bucket_id(size), size>();
  return new(mem) T();
}

Each allocation prefixes a small header (8 bytes typically) carrying the bucket ID and a magic value for sanity-checking on free. Free returns the block to the bucket's free-list; no system-call.

Performance impact: 30–50% reduction in transaction-allocation overhead in typical workloads.

Module-level patterns `[abstract]`

Three patterns cover the majority of modules. Pick by the module's role.

Composite module — aggregating sub-blocks

Aggregates fine-grained sub-blocks. Advance walks sub-blocks in the right order; idle aggregates sub-block idle status.

void CompositeImpl::advance(int cycles) {
  while (cycles--) {
    sub_a->main();
    sub_b->main();
    sub_c->main();
    drain_output_queue();
  }
}
bool CompositeImpl::idle() {
  m_is_idle = sub_a->idle() && sub_b->idle() && sub_c->idle()
              && output_queue.empty();
  return ModuleBase::idle();
}

Pipeline module — reverse-order advance

Advances pipeline stages from output back to input within one cycle to prevent same-cycle data punch-through. Then arbitrates outputs.

void PipelineImpl::advance(int cycles) {
  output_stage->advance();
  middle_stage->advance();
  input_stage->advance();
  if (!output_stage->mOutFifo.empty() && !conn->busy()) {
    auto* t = mempool()->allocate<Out_TData>();
    *t = output_stage->mOutFifo.front();
    conn->send_out(t, /*delay=*/1);
    output_stage->mOutFifo.pop_front();
  }
}

Compute module — request queue draining

Has internal request queues fed by proc_* callbacks. Advance drains queues, dispatches work, records pending completions.

Inter-module communication `[abstract]`

Two valid patterns. Use both.

Through `conn` — cross-module

if (!conn->output_busy()) {
  auto* trans = mempool()->allocate<Output_TData>();
  *trans = data;
  conn->send_output(trans, /*delay=*/1);
}

The conn is generated; the mempool is from the framework. The handshake type is determined by the protocol annotation in the IDF (see define-contracts).

Through parent queue — sub-block to sub-block

Sub-blocks of a composite module communicate through the parent's owned queue or a "Toplevel" interface object. Avoid having sub-blocks call each other directly.

Runtime configuration `[abstract]`

A JSON config file pinned next to the executable, layered:

Load default config at module-aggregate init.
Patch with project-specific config string.
Hot-overlay debug knobs (dump filters, mask filters).

{
  "sim": { "sim_interval_ns": 100, "idle_threshold_ns": 1000 },
  "mem": { "size": "0x2000000000000" },
  "dump": { "blocks": ["cmd_processor"], "interfaces": [] },
  "flags": { "multi_thread": false }
}

Use JSON Pointer or equivalent path syntax for module access. Never read configuration files inside hot paths — cache values during end_of_elaboration.

Common failure modes `[abstract]`

Modules call wait() or yield in advance(). Breaks the determinism guarantee; debugging concurrent traces is brutal. Always run to completion per advance.
idle() returns false because of a stale flag. Simulation never terminates because one queue forgot to clear. Audit every state transition for the matching idle update.
Hand-coded connection layer. Becomes unmaintainable past 10 modules and locks out the IDF/HDF benefits. Always generate.
Per-cycle JSON lookups in advance(). Tens of millions of cfg["x"] calls dominate the profile. Cache config in end_of_elaboration.
new/delete in the hot path. 20–30% perf loss. Use the mempool.
No idle threshold. Bugs that leave one module non-idle cause infinite simulation. Always set a threshold and an absolute simulation-time cap.
Cycle-accurate model used as difftest reference. Cache state, pipeline state, etc. become "architectural" by accident. Use a separate behavior model (build-behavior-model).

build-perf-model

Tool Access

Preview

Supporting Assets

SKILL.md

Similar Skills

Help us improve

Help us improve

build-perf-model

Tool Access

Preview

Supporting Assets

SKILL.md

Build Performance Model

How to use this skill in a response

Performance model essentials [abstract]

Tick semantics — choose one and commit [industry-pattern]

Module lifecycle [abstract]

The advance / idle contract [abstract]

Scheduler model [abstract]

Pipeline modeling primitives [abstract]

DelayQueue<T> — fixed-latency pipeline

StagingFifo<T> — bounded FIFO with backpressure

RingBuffer<T> — hardware ring buffer

Memory abstraction — sparse first [abstract]

Transaction memory pool [industry-pattern]

Module-level patterns [abstract]

Composite module — aggregating sub-blocks

Pipeline module — reverse-order advance

Compute module — request queue draining

Inter-module communication [abstract]

Through conn — cross-module

Through parent queue — sub-block to sub-block

Runtime configuration [abstract]

Common failure modes [abstract]

See also

Similar Skills

Help us improve

Build Performance Model

How to use this skill in a response

Performance model essentials [abstract]

Tick semantics — choose one and commit [industry-pattern]

Module lifecycle [abstract]

The advance / idle contract [abstract]

Scheduler model [abstract]

Pipeline modeling primitives [abstract]

DelayQueue<T> — fixed-latency pipeline

StagingFifo<T> — bounded FIFO with backpressure

RingBuffer<T> — hardware ring buffer

Memory abstraction — sparse first [abstract]

Transaction memory pool [industry-pattern]

Module-level patterns [abstract]

Composite module — aggregating sub-blocks

Pipeline module — reverse-order advance

Compute module — request queue draining

Inter-module communication [abstract]

Through conn — cross-module

Through parent queue — sub-block to sub-block

Runtime configuration [abstract]

Common failure modes [abstract]

See also

Performance model essentials `[abstract]`

Tick semantics — choose one and commit `[industry-pattern]`

Module lifecycle `[abstract]`

The advance / idle contract `[abstract]`

Scheduler model `[abstract]`

Pipeline modeling primitives `[abstract]`

`DelayQueue<T>` — fixed-latency pipeline

`StagingFifo<T>` — bounded FIFO with backpressure

`RingBuffer<T>` — hardware ring buffer

Memory abstraction — sparse first `[abstract]`

Transaction memory pool `[industry-pattern]`

Module-level patterns `[abstract]`

Inter-module communication `[abstract]`

Through `conn` — cross-module

Runtime configuration `[abstract]`

Common failure modes `[abstract]`

Performance model essentials `[abstract]`

Tick semantics — choose one and commit `[industry-pattern]`

Module lifecycle `[abstract]`

The advance / idle contract `[abstract]`

Scheduler model `[abstract]`

Pipeline modeling primitives `[abstract]`

`DelayQueue<T>` — fixed-latency pipeline

`StagingFifo<T>` — bounded FIFO with backpressure

`RingBuffer<T>` — hardware ring buffer

Memory abstraction — sparse first `[abstract]`

Transaction memory pool `[industry-pattern]`

Module-level patterns `[abstract]`

Inter-module communication `[abstract]`

Through `conn` — cross-module

Runtime configuration `[abstract]`

Common failure modes `[abstract]`