From 3commas
Diagnose production issues in running Ergo Framework nodes via MCP server. Covers performance bottlenecks, process leaks, memory issues, network problems, event fanout, restart loops, and stuck processes. Use when investigating issues on live Ergo nodes.
npx claudepluginhub 3commas-io/commas-claude --plugin 3commasThis skill uses the workspace's default tool permissions.
Diagnose production issues in running Ergo Framework distributed systems via MCP.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Automates semantic versioning and release workflow for Claude Code plugins: bumps versions in package.json, marketplace.json, plugin.json; verifies builds; creates git tags, GitHub releases, changelogs.
Diagnose production issues in running Ergo Framework distributed systems via MCP.
Every tool accepts node parameter -- the framework establishes connections automatically. Never call network_connect before querying. Just pass node=<name>.
All nodes are equal. The MCP endpoint is just a proxy, not a primary node.
claude mcp add --transport http ergo http://localhost:9922/mcp
States: Init -> Sleep <-> Running -> Terminated. WaitResponse during sync Call. Zombee on forced kill.
| Metric | Meaning |
|---|---|
| MessagesMailbox | Current queue depth |
| MailboxLatency | Age of oldest message (-tags=latency) |
| RunningTime/Uptime | Utilization ratio (NOT CPU) |
| Drain (MessagesIn/Wakeups) | ~1 = idle, >>1 = overloaded |
Detects blocked callbacks (mutex, channel, IO inside handler).
Formula: Liveness = RunningTime_ns / (Uptime_s * MailboxLatency_ns)
Collect:
process_list min_mailbox=1 min_uptime=60 min_messages_in=1 sort_by=mailbox_latency limit=10
Filters: pending messages (latency > 0), alive > 60s, received messages (not idle). Exclude zombees from results. Compute liveness for each.
| Latency | Liveness | Diagnosis | Action |
|---|---|---|---|
| high | high (>0.01) | Overloaded but alive | Scale out |
| high | low (<0.001) | Callback blocked | Find blocking call (pprof) |
| 0 | n/a | Healthy, mailbox empty | Nothing |
| transient (ms) | any | Caught mid-callback | Normal |
Same latency, different cause: 18s latency + RunningTime 5000s = overloaded (liveness 0.027). 18s latency + RunningTime 82us = stuck (liveness ~0).
MailboxSize = -1 means unlimited (default). Can accumulate messages indefinitely → memory growth. If > 0, ErrProcessMailboxFull on overflow.
Default Call timeout = 5 seconds. WaitResponse > 5s is abnormal.
LogMessages in node_info: [0]=Trace [1]=Debug [2]=Info [3]=Warning [4]=Error [5]=Panic. Watch [4] and [5].
Important Delivery errors: ErrProcessUnknown = doesn't exist, ErrProcessMailboxFull = overloaded, ErrTimeout = ACK lost (5s), ErrNoConnection = unreachable.
Supervisor restart intensity: Sliding window (default 5 restarts / 5 sec). Exceeded = supervisor dies with ErrSupervisorRestartsExceeded, cascades up. Inspect supervisor for restarts_count.
Pool backpressure: Capacity = PoolSize * WorkerMailboxSize. All full = messages dropped. Inspect pool for messages_unhandled.
Connection pool: Default 3 TCP connections per peer. Same sender = same TCP (ordered). Different senders = parallel. Pool size determined by acceptor (receiving side) during handshake. PoolDSN non-null = this side dialed. Reconnections counter = instability.
Goroutine labels (with -tags=pprof): Actors: {"pid":"<ABC.0.1005>"}. Meta: {"meta":"Alias#...", "role":"reader"}.
Error counters in node_info: SendErrorsLocal (process unknown/terminated/mailbox full), SendErrorsRemote (connection failures), CallErrorsLocal, CallErrorsRemote. Non-zero = problems.
Connection counters in network_info: ConnectionsEstablished - ConnectionsLost = active. Growing ConnectionsLost = instability. HandshakeErrors = auth problems.
Shutdown: Processes >5s per message are logged. state=running, queue=0 = stuck in callback.
pprof_goroutines debug=1 filter="ProcessRun" exclude="toolPprof" limit=20 # actor goroutines
pprof_goroutines debug=2 filter="<behavior>" limit=5 # specific actor trace
pprof_cpu duration=5 exclude="runtime" limit=15 # CPU hotspots
pprof_heap limit=20 filter="myapp" # memory allocators
All pprof tools support filter (include matching) and exclude (remove matching) by substring.
Response header shows total N, matched M, showing K.
network_ping name=<node> # quick alive check with RTT
network_node_info node=A name=B # A's view of connection to B
network_node_info node=B name=A # B's view (always check BOTH sides)
Connection info includes Reconnections counter -- non-zero indicates instability.
sample_start tool=<tool> arguments={...} interval_ms=2000 duration_sec=60 linger_sec=30
sample_listen log_levels=["warning","error"] duration_sec=60 linger_sec=30
sample_read sampler_id=<id>
sample_list # shows "completed, lingering Ns"
sample_stop sampler_id=<id> # immediate terminate
linger_sec (default 30): sampler stays alive after completion for data retrieval. No rush to read before duration expires.
Proxy samplers: sample_start with node=X spawns the sampler on node X. The tool runs locally on X. To read/list/stop, sample_read/sample_list/sample_stop MUST also include node=X.
Passive samplers also capture messages and calls sent to them -- use as test receivers: send_message to=<sampler_id> type_name=X message={...}, then sample_read to see captured entries with from, type, message.
Stopping: read data first (sample_read), present to user, then sample_stop. Exception: user says "just kill it".
message_types filter="order" # discover types
message_type_info type_name="TestOrder" # inspect fields
send_message to=<target> type_name="TestOrder" message={...} # send typed struct
call_process to=<target> type_name="GetStatusRequest" request={...} # sync call
Short name works (e.g., TestOrder). Go→JSON: string→"v", number→42.5, bool→true, *T→value or null, []T→[...], map→{...}, struct→nested {...}.
All tools accept timeout parameter (seconds, default 30, max 120) for remote calls. Use higher timeout for heavy operations:
pprof_cpu node=X duration=10 timeout=60
pprof_goroutines node=X debug=2 limit=100 timeout=60
cluster_nodes -- discover allnode_info node=X + network_ping name=X for every nodeprocess_list sort_by=mailbox limit=10 -- find suspectspprof_cpu duration=5 exclude="runtime" -- where is CPU spent?process_info on suspect -- confirm with queue sizesSnapshot: pprof_cpu duration=5 limit=20 -- top functions
Application only: pprof_cpu duration=5 filter="myapp" exclude="runtime" -- filter noise
Continuous (goroutine sampling):
sample_start tool=pprof_goroutines arguments={"debug":1,"filter":"ProcessRun","exclude":"toolPprof","limit":20} interval_ms=500 duration_sec=30
pprof_heap limit=20 -- top allocators (inuse + cumulative alloc columns)runtime_stats -- heap_alloc, num_gcsample_start tool=runtime_stats interval_ms=5000 duration_sec=300node_info -- Spawned vs Terminatedprocess_list max_uptime=60 limit=50 -- recent spawnssample_start tool=node_info interval_ms=10000process_list max_uptime=10 limit=50 -- very young processessample_listen log_levels=["error","panic"] duration_sec=60process_inspect on supervisor -- restarts_countprocess_list state=zombeepprof_goroutines debug=2 filter="<behavior_name>" limit=5 -- find by behavior nameAlways check both sides:
network_node_info node=A name=B # A's view
network_node_info node=B name=A # B's view
| Symptom | Root Cause |
|---|---|
| Ping fails or RTT > 1s | Connection problem or overloaded node |
| MessagesOut >> MessagesIn (cross-check) | Silent data loss |
| High Reconnections | Unstable connection |
pprof_goroutines debug=1 filter="ProcessRun" exclude="toolPprof" limit=20 # actors only
pprof_goroutines debug=2 filter="waitResponse" limit=10 # stuck in Call
pprof_goroutines debug=1 exclude="runtime_pollWait" limit=30 # non-IO goroutines
event_list utilization_state=no_subscribers -- wasted publishingevent_list sort_by=subscribers limit=10 -- highest fanoutsample_listen event=<name> duration_sec=30 -- see data| Tag | Enables | Without It |
|---|---|---|
-tags=pprof | Per-process goroutine traces | pid param errors |
-tags=latency | MailboxLatency measurement | Returns -1 |
network_connect just to query, framework connects automaticallytimeout=60 for heavy remote operations