From chipdev-method
Guides the construction of a cycle-accurate performance model — scheduler model, module lifecycle, idle/advance contract, pipeline primitives, sparse memory, transaction memory pool, configuration system, and dump infrastructure. Activate when the user explicitly invokes /chipdev-method:build-perf-model, or asks "怎么写 cycle-accurate 模拟器", "性能模拟器怎么写", "how to model timing", "should I use SC_THREAD or advance loop", or is implementing a microarchitecture timing model.
npx claudepluginhub curryfromuestc/dev-guide --plugin chipdev-methodThis skill uses the workspace's default tool permissions.
Use this skill when the user is building a **cycle-accurate** simulator —
Monitors deployed URLs for regressions after deploys, merges, or upgrades by checking HTTP status, console errors, network failures, performance (LCP/CLS/INP), content, and API health.
Share bugs, ideas, or general feedback.
Use this skill when the user is building a cycle-accurate simulator — the artifact intended for microarchitecture exploration and RTL timing alignment. Do not autoload.
The purpose is to give the user a small set of strong defaults: a scheduler model, a module contract, a primitive library for pipelines, and a configuration discipline. These defaults compose into a simulator that remains tractable past 20+ modules and remains performant under multi-instance fanout.
If the user is building a behavior model instead, redirect them to
build-behavior-model. The two have different state contracts and different
performance constraints.
When triggered:
If the user has not yet defined contracts, push them back to
define-contracts first. Performance model implementation without
generated bases is a Phase-4 refactor waiting to happen.
[abstract]A cycle-accurate simulator has four mandatory subsystems and three optional but recommended ones.
Mandatory:
Recommended: 5. A transaction memory pool for high-frequency allocation. 6. A dump manager with module-and-interface granularity. 7. An idle detection mechanism that stops simulation automatically when all modules are quiescent.
[industry-pattern]Three discipline options. They are not mutually compatible; pick before the framework solidifies.
| Style | When to use | Example projects |
|---|---|---|
| Custom non-blocking advance | Default for new projects: each module exposes advance(int cycles) and idle(); the scheduler calls advance only for non-idle modules. No coroutines, no SC_THREAD. | Many in-house cycle-accurate models; aligns with invariant 6 (determinism). |
| SystemC SC_THREAD | Only when the workload is genuinely concurrent and modules naturally express as state machines with wait(). | Classic SystemC TLM examples; legacy IP integration. |
| Discrete-event | When events are sparse (e.g., NoC traffic generators) and time advances irregularly. | gem5 (SimObject + tick scheduling); SST (event-driven). |
Default recommendation: custom non-blocking advance. It's faster, easier to debug, and avoids stack switches. The detailed contract is below.
[abstract]Every module follows this lifecycle:
constructor → end_of_elaboration() → reset() → [idle()/advance() loop]
The base class for each module is generated from the hierarchy DSL
(see define-contracts). Hand-writing the base class is a smell.
[abstract]Every module implements two methods. The scheduler relies on this contract.
class ModuleBase {
virtual void advance(int cycles) = 0; // tick the module N cycles
virtual bool idle(); // is this module quiescent?
};
Rules:
idle() returns true only when no input queue holds work AND
no internal pipeline stage is busy AND no output queue holds
pending data.advance(cycles) performs exactly cycles ticks of work. Implementation
choice: a tight while(cycles--) loop for cycle-by-cycle precision,
or a single bulk update for transaction-level precision — but be
explicit about which.idle() == true. This is the
primary performance optimization in the framework, often 20–40% throughput
improvement on multi-instance configurations.[abstract]The scheduler walks all modules in a fixed registered order, advances the non-idle ones, and detects collective quiescence.
void Scheduler::run_all() {
bool any_active = false;
for (auto* m : registered_modules) {
if (mask_filter(m)) continue; // optional debug filter
if (!m->idle()) {
m->advance(interval_ns);
any_active = true;
}
}
if (any_active) idle_count = 0;
else idle_count += interval_ns;
if (idle_count > idle_threshold) notify_simulation_done();
schedule_next_tick(interval_ns);
}
Key parameters live in the configuration file (see "Runtime configuration"):
sim_interval_ns — how many nanoseconds advance per tick.idle_threshold_ns — how long all-idle must hold before stopping.Why fixed-order, no priority queue: cycle-accurate workloads are dominated by the per-module compute, not by scheduling overhead. A linear walk is cache-friendly and reproducible.
[abstract]Three primitives cover almost every flop-pipeline / FIFO / ring-buffer modeling need. Implementing them once means modules don't reinvent them.
DelayQueue<T> — fixed-latency pipelineModels a flop chain with a fixed cycle latency. Internally uses differential delay encoding: each entry stores its delay relative to the previous entry, so advancing by one cycle is O(1) regardless of queue depth.
DelayQueue<Packet> pipe;
pipe.insert(5, packet); // arrives 5 cycles later
pipe.advance(1, [this](Packet& p) { // tick 1 cycle, drain ready entries
send_output(p);
});
When to use: ALU latency modeling, fixed-latency NoC hops, configurable retry timers.
StagingFifo<T> — bounded FIFO with backpressureModels a hardware FIFO with a fixed depth. The producer-side proc_xxx
callback returns false when full, which signals the upstream module
to stall.
StagingFifo<RequestT> in_fifo(/*depth=*/4);
// Producer side
bool proc_request(RequestT* trans, int delay) {
return in_fifo.stage_transaction(trans); // false if full
}
// Consumer side, in advance():
auto* req = in_fifo.nb_read();
if (req) process(*req);
When to use: input port buffering, backpressure-driven pipelines.
Pair with set_callback() to notify the upstream Conn when a slot
becomes available — this drives the resume of the upstream protocol.
RingBuffer<T> — hardware ring bufferModels a circular hardware buffer with Push, Pop, and optional
random EntryRemove(index) for out-of-order completion.
RingBuffer<CmdEntry> cmd_ring(/*size=*/0x1000, /*depth=*/256, "cmd_ring");
cmd_ring.Push(entry, /*slots=*/4);
cmd_ring.Front();
cmd_ring.Pop();
cmd_ring.EntryRemove(2); // out-of-order removal
When to use: command rings, scoreboard queues, reorder buffers.
[abstract]The memory model has two responsibilities and one optimization.
Responsibility 1: present a flat byte-addressable space to all modules,
with read(addr, size, buf) and write(addr, size, buf).
Responsibility 2: distinguish access source — driver / core / dump — so that monitor hooks and access-counters can be filtered.
Optimization: sparse allocation. Use std::map<page_addr, vector<uint8_t>>
keyed on aligned page addresses. Allocate a 4 MB (or chosen) page on first
touch. This supports terabyte-scale address spaces without committing
physical memory.
class PhysicalMem {
static constexpr size_t PAGE_SIZE = 4 * 1024 * 1024;
std::map<paddr_t, std::vector<uint8_t>> pages;
// read/write transparently allocate pages on first touch
};
Why this matters: chip simulators routinely use 128 TB+ address spaces (HBM + on-chip + MMIO). Pre-allocating is impossible. Sparse allocation is the only viable model.
[industry-pattern]Cycle-accurate simulators allocate millions of transaction objects per second. Routing every allocation through the system allocator costs 20–30% of simulation time.
A bucketed pool with size-class binning (1, 2, 4, 8, 16, 32, 64 bytes
for small classes; 16-byte aligned for larger) replaces new/delete:
template<class T>
T* allocate() {
constexpr size_t size = bucket_size(sizeof(T) + sizeof(Header));
void* mem = bucket_allocate<bucket_id(size), size>();
return new(mem) T();
}
Each allocation prefixes a small header (8 bytes typically) carrying the bucket ID and a magic value for sanity-checking on free. Free returns the block to the bucket's free-list; no system-call.
Performance impact: 30–50% reduction in transaction-allocation overhead in typical workloads.
[abstract]Three patterns cover the majority of modules. Pick by the module's role.
Aggregates fine-grained sub-blocks. Advance walks sub-blocks in the right order; idle aggregates sub-block idle status.
void CompositeImpl::advance(int cycles) {
while (cycles--) {
sub_a->main();
sub_b->main();
sub_c->main();
drain_output_queue();
}
}
bool CompositeImpl::idle() {
m_is_idle = sub_a->idle() && sub_b->idle() && sub_c->idle()
&& output_queue.empty();
return ModuleBase::idle();
}
Advances pipeline stages from output back to input within one cycle to prevent same-cycle data punch-through. Then arbitrates outputs.
void PipelineImpl::advance(int cycles) {
output_stage->advance();
middle_stage->advance();
input_stage->advance();
if (!output_stage->mOutFifo.empty() && !conn->busy()) {
auto* t = mempool()->allocate<Out_TData>();
*t = output_stage->mOutFifo.front();
conn->send_out(t, /*delay=*/1);
output_stage->mOutFifo.pop_front();
}
}
Has internal request queues fed by proc_* callbacks. Advance drains
queues, dispatches work, records pending completions.
[abstract]Two valid patterns. Use both.
conn — cross-moduleif (!conn->output_busy()) {
auto* trans = mempool()->allocate<Output_TData>();
*trans = data;
conn->send_output(trans, /*delay=*/1);
}
The conn is generated; the mempool is from the framework. The handshake
type is determined by the protocol annotation in the IDF (see
define-contracts).
Sub-blocks of a composite module communicate through the parent's owned queue or a "Toplevel" interface object. Avoid having sub-blocks call each other directly.
[abstract]A JSON config file pinned next to the executable, layered:
{
"sim": { "sim_interval_ns": 100, "idle_threshold_ns": 1000 },
"mem": { "size": "0x2000000000000" },
"dump": { "blocks": ["cmd_processor"], "interfaces": [] },
"flags": { "multi_thread": false }
}
Use JSON Pointer or equivalent path syntax for module access. Never read
configuration files inside hot paths — cache values during
end_of_elaboration.
[abstract]wait() or yield in advance(). Breaks the determinism
guarantee; debugging concurrent traces is brutal. Always run to completion
per advance.idle() returns false because of a stale flag. Simulation never
terminates because one queue forgot to clear. Audit every state transition
for the matching idle update.advance(). Tens of millions of cfg["x"]
calls dominate the profile. Cache config in end_of_elaboration.new/delete in the hot path. 20–30% perf loss. Use the mempool.build-behavior-model).define-contracts — the contracts your performance model consumes.build-behavior-model — the sister artifact, when it exists.align-and-difftest — only if your perf model also serves as the
difftest reference (rare; consider a behavior model instead).references/scheduler-comparison.md — long-form scheduler comparison.references/pipeline-primitives.md — long-form primitive APIs and idioms.references/case-gem5.md, references/case-sst.md — alternative event-
driven scheduler designs.