From asynkron-devtools
Profiles .NET applications for CPU performance, memory allocations, lock contention, exceptions, heap analysis, and JIT inlining using dotnet-trace and dotnet-gcdump. Useful for bottlenecks, leaks, high CPU, and GC pressure.
npx claudepluginhub asynkron/asynkron-skills --plugin asynkron-devtoolsThis skill uses the workspace's default tool permissions.
This tool requires .NET 10+ SDK (for `dnx` support).
Guides selection and use of .NET diagnostic tools like PerfView and dotnet-trace to collect production performance data across Windows/Linux, containers, and Kubernetes.
Profiles Node.js, Python, and Java app performance by analyzing CPU usage, memory allocation, and execution hotspots to identify bottlenecks and recommend optimizations.
Profiles apps to identify CPU/memory bottlenecks using runtime tools: clinic.js/0x for Node.js, py-spy/cProfile for Python, Chrome DevTools/Lighthouse for frontend. For slow apps, leaks, high CPU.
Share bugs, ideas, or general feedback.
This tool requires .NET 10+ SDK (for dnx support).
Check if dnx is available:
dnx --help
The profiler also depends on dotnet-trace and dotnet-gcdump. Install them if missing:
dotnet tool install -g dotnet-trace
dotnet tool install -g dotnet-gcdump
A CLI profiler for .NET that outputs structured text — no GUI needed. Designed for both human inspection and AI-assisted analysis. It wraps dotnet-trace and dotnet-gcdump and presents call trees, hot functions, allocation tables, and contention rankings as plain text.
Results are written to profile-output/ in the current directory.
dnx asynkron-profiler [flags] -- [target]
On first run, dnx will prompt to download the package.
--cpu)Sampled CPU profiling. Shows call trees and hot function tables.
dnx asynkron-profiler --cpu -- ./MyApp.csproj
dnx asynkron-profiler --cpu -- ./bin/Release/net10.0/MyApp
--memory)Tracks GC allocation tick events. Shows per-type allocation call trees and allocation sources.
dnx asynkron-profiler --memory -- ./MyApp.csproj
--contention)Shows wait-time call trees and contended method rankings. Use for diagnosing lock congestion and thread starvation.
dnx asynkron-profiler --contention -- ./MyApp.csproj
--exception)Counts thrown exceptions, shows throw-site call trees. Filter by type with --exception-type.
dnx asynkron-profiler --exception -- ./MyApp.csproj
dnx asynkron-profiler --exception --exception-type InvalidOperationException -- ./MyApp.csproj
--heap)Takes a GC heap snapshot using dotnet-gcdump. Shows retained objects by type and size.
dnx asynkron-profiler --heap -- ./MyApp.csproj
The profiler can capture JIT-to-native compilation events, showing what methods get JIT compiled and which calls get inlined. This is useful for understanding runtime code generation and verifying that hot paths are being optimized by the JIT.
You can analyze previously captured trace files without re-running the app:
dnx asynkron-profiler --input /path/to/trace.nettrace
dnx asynkron-profiler --input /path/to/trace.speedscope.json --cpu
dnx asynkron-profiler --input /path/to/heap.gcdump --heap
| Flag | Purpose |
|---|---|
--cpu | CPU profiling |
--memory | Memory allocation profiling |
--contention | Lock contention analysis |
--exception | Exception profiling |
--heap | Heap snapshot |
--root <text> | Root call tree at first matching method |
--filter <text> | Filter function tables by substring |
--exception-type <text> | Filter exceptions by type name |
--calltree-depth <n> | Max call tree depth (default: 30) |
--calltree-width <n> | Max children per node (default: 4) |
--calltree-self | Include self-time tree |
--calltree-sibling-cutoff <n> | Hide siblings below X% (default: 5) |
--include-runtime | Include runtime/framework frames |
--input <path> | Analyze existing trace file |
--tfm <tfm> | Target framework for .csproj/.sln |
| Mode | Formats |
|---|---|
| CPU | .speedscope.json, .nettrace |
| Memory | .nettrace, .etlx |
| Exceptions | .nettrace, .etlx |
| Contention | .nettrace, .etlx |
| Heap | .gcdump |
Profiling is iterative. Follow this workflow to systematically identify and eliminate bottlenecks.
Always build Release before profiling. Debug builds have disabled optimizations, extra checks, and no inlining — profiling them gives misleading results.
dotnet build -c Release
| Symptom | Start with |
|---|---|
| High CPU / slow execution | --cpu |
| High memory / GC pressure | --memory |
| High latency but low CPU | --contention |
| Too many exceptions in logs | --exception |
| Memory keeps growing (leak) | --heap |
| Want to verify JIT optimization | JIT/inlining analysis |
The profiler outputs a hot function table showing where time or allocations are concentrated:
=== HOT FUNCTIONS ===
Time (ms) Calls Function
-------------------------------------------------
38805.39 19533 MyApp.Core.ProcessItem...
19769.23 9897 MyApp.Core.TransformData...
Focus on the top 3-5 entries. These are your optimization targets.
For memory profiling, the allocation call graph shows where allocations originate:
CreateEnvironment
Calls: 1048
Allocated by:
<- ProcessLoop (1048x, 100%)
<- RunMain (4x)
This traces allocations back to their source — the method that triggered them, not just where new was called.
Track progress across rounds:
Round 1: 322 MB, 172 ms
Round 2: 173 MB, 150 ms (pooling)
Round 3: 107 MB, 116 ms (fast paths)
When the profiler summary isn't enough, capture a detailed trace for manual analysis:
# Capture detailed GC trace
dotnet-trace collect \
--profile gc-verbose \
--format NetTrace \
-o trace.nettrace \
-- dotnet run -c Release --project ./MyApp
# Analyze with the profiler
dnx asynkron-profiler --input trace.nettrace --memory
# Or convert for external tools
dotnet-trace convert trace.nettrace --format Speedscope
Reduce allocations in hot loops:
Span<T> / stackalloc for short-lived buffersReduce CPU in hot paths:
[MethodImpl(MethodImplOptions.AggressiveInlining)] on small hot methodsNoInlining, rare case)Reduce contention:
Interlocked, ConcurrentDictionary)ReaderWriterLockSlim for read-heavy workloadsReduce exceptions:
TryParse / TryGet patterns instead of catching exceptionsThese patterns work with the .NET JIT compiler to produce faster native code. Use the profiler's JIT/inlining analysis to verify these optimizations take effect.
The most impactful pattern for hot methods. The JIT inlines small methods into their callers, eliminating call overhead and enabling further optimizations. But it won't inline large methods. The trick: keep the common case tiny and inlineable, push the rare case into a separate non-inlined method.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Result HandleHotPath(Data data)
{
// Fast path: ~20-30 lines max, handles the common case
if (data.IsSimpleCase)
{
// Direct, minimal work
return Result.From(data.Value);
}
// Rare/complex case — delegate to non-inlined method
return HandleHotPathSlow(data);
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static Result HandleHotPathSlow(Data data)
{
// Complex logic: type coercion, error handling, edge cases
// This can be as large as needed — it won't bloat the call site
}
Why this works:
When to apply:
How to verify: Use the profiler's JIT/inlining analysis to confirm the fast path is being inlined at the call site.
For hot dispatch (e.g., instruction interpreters, message handlers, event processors), a delegate array indexed by enum is faster than a switch:
// Define handler signature
delegate Result Handler(Context ctx, Instruction instr);
// Build dispatch table once (static constructor)
private static readonly Handler[] _dispatch = new Handler[64];
static MyRunner()
{
_dispatch[(int)Kind.Add] = HandleAdd;
_dispatch[(int)Kind.Call] = HandleCall;
_dispatch[(int)Kind.Branch] = HandleBranch;
// ...
}
// Hot loop — direct delegate invocation, no switch overhead
while (running)
{
var instr = instructions[pc];
var result = _dispatch[(int)instr.Kind](ctx, instr);
// ...
}
Why this works:
When to apply:
When the profiler shows a type being allocated millions of times in a loop, pool it instead.
// 1. Define what poolable objects look like
interface IRentable
{
void Activate(); // Called when rented — initialize state
void Reset(); // Called when returned — clear state for reuse
}
// 2. Lock-free pool using Interlocked.CompareExchange
class ObjectPool<T> where T : class, IRentable
{
private readonly T?[] _items;
private readonly Func<T> _factory;
public T Rent()
{
for (int i = 0; i < _items.Length; i++)
{
var item = Interlocked.Exchange(ref _items[i], null);
if (item is not null) { item.Activate(); return item; }
}
var created = _factory();
created.Activate();
return created;
}
public void Return(T item)
{
item.Reset();
for (int i = 0; i < _items.Length; i++)
{
if (Interlocked.CompareExchange(ref _items[i], item, null) == null)
return;
}
// Pool full — abandon to GC (graceful degradation)
}
}
// 3. RAII wrapper ensures objects are returned
using var handle = pool.Rent();
var obj = handle.Value;
// ... use obj ...
// Automatically returned on dispose
Impact: A tight loop creating 1M scoped objects goes from 1M allocations to ~32 (pool size). Dramatically reduces GC pressure.
When to apply:
For cached computed values that are expensive to create but read frequently:
static TCache GetOrCreate<TCache>(ref TCache? field, Func<TCache> factory)
where TCache : class
{
var existing = Volatile.Read(ref field);
if (existing is not null) return existing;
var created = factory();
var prior = Interlocked.CompareExchange(ref field, created, null);
return prior ?? created;
}
Why not Lazy<T>: This pattern avoids the Lazy<T> allocation itself, and the Volatile.Read fast path is a single instruction on x86/ARM. The worst case (two threads create simultaneously) wastes one creation but is still correct — no locks needed.
dotnet build -c Release) before profiling — profiling Debug builds gives misleading resultsdotnet run for accurate measurements--root to focus on a specific call path when the tree is too broad--filter to narrow function tables to your own code, excluding framework noise--calltree-sibling-cutoff to hide insignificant branches--memory to find allocation sources, then --heap for retained object analysis--cpu, then drill into contention if CPU usage is low but latency is high--exception with --exception-type to focus on specific exception categories