From dotnet-skills
Gating CI on perf regressions. Automated threshold alerts, baseline tracking, trend reports.
npx claudepluginhub wshaddix/dotnet-skillsThis skill uses the workspace's default tool permissions.
Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Continuous benchmarking guidance for detecting performance regressions in CI pipelines. Covers baseline file management with BenchmarkDotNet JSON exporters, GitHub Actions workflows for artifact-based baseline comparison, regression detection patterns with configurable thresholds, and alerting strategies for performance degradation.
Version assumptions: BenchmarkDotNet v0.14+ for JSON export, GitHub Actions runner environment. Examples use actions/upload-artifact@v4 and actions/download-artifact@v4.
Out of scope: BenchmarkDotNet setup, benchmark class design, memory diagnosers, and common pitfalls are owned by this epic's companion skill -- see [skill:dotnet-benchmarkdotnet]. Performance-oriented architecture patterns are owned by [skill:dotnet-performance-patterns]. Profiling tools (dotnet-counters, dotnet-trace, dotnet-dump) are covered by dotnet-profiling. OpenTelemetry metrics collection and distributed tracing -- see [skill:dotnet-observability]. Composable CI/CD workflow design and matrix build strategies -- see [skill:dotnet-gha-patterns]. Architecture patterns (caching, resilience) -- see [skill:dotnet-architecture-patterns].
Cross-references: [skill:dotnet-benchmarkdotnet] for benchmark class setup and JSON exporter configuration, [skill:dotnet-observability] for correlating benchmark regressions with runtime metrics changes, [skill:dotnet-gha-patterns] for composable workflow patterns (reusable workflows, composite actions, matrix builds).
BenchmarkDotNet's JSON exporter produces machine-readable results for automated comparison. Configure the exporter in benchmark classes:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Exporters.Json;
[JsonExporterAttribute.Full]
[MemoryDiagnoser]
public class CriticalPathBenchmarks
{
[Benchmark(Baseline = true)]
public void ProcessOrder() { /* ... */ }
[Benchmark]
public void ProcessOrderOptimized() { /* ... */ }
}
Or configure via custom config for all benchmark classes:
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Exporters.Json;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
var config = ManualConfig.Create(DefaultConfig.Instance)
.AddJob(Job.ShortRun) // fewer iterations for CI speed
.AddExporter(JsonExporter.Full)
.WithArtifactsPath("./benchmark-results");
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args, config);
The exported JSON file (*-report-full.json) contains structured benchmark results:
{
"Title": "CriticalPathBenchmarks",
"Benchmarks": [
{
"FullName": "MyApp.Benchmarks.CriticalPathBenchmarks.ProcessOrder",
"Statistics": {
"Mean": 1234.5678,
"Median": 1230.1234,
"StandardDeviation": 15.234,
"StandardError": 4.812
},
"Memory": {
"BytesAllocatedPerOperation": 1024,
"Gen0Collections": 0.0012,
"Gen1Collections": 0,
"Gen2Collections": 0
}
}
]
}
Key fields for regression comparison:
| Field | Purpose |
|---|---|
Statistics.Mean | Average execution time (nanoseconds) |
Statistics.Median | Middle execution time (more robust to outliers) |
Statistics.StandardDeviation | Measurement variability |
Memory.BytesAllocatedPerOperation | GC allocation per operation |
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Git-committed baseline file | Versioned, auditable, no external deps | Repo size grows; must update deliberately | Small benchmark suites, stable hardware |
| GitHub Actions artifacts | No repo bloat; automatic retention | 90-day default retention; cross-workflow access requires tokens | Large benchmark suites, shared runners |
| External storage (S3/Azure Blob) | Unlimited history; cross-repo sharing | Extra infrastructure; credential management | Multi-repo benchmark comparison |
This skill focuses on the GitHub Actions artifact strategy as the default. For composable workflow patterns and reusable actions, see [skill:dotnet-gha-patterns].
name: Benchmarks
on:
pull_request:
paths:
- 'src/**'
- 'benchmarks/**'
workflow_dispatch:
permissions:
contents: read
actions: read # required for artifact download
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup .NET
uses: actions/setup-dotnet@v4
with:
dotnet-version: '8.0.x'
- name: Run benchmarks
run: dotnet run -c Release --project benchmarks/MyBenchmarks.csproj -- --exporters json
- name: Upload benchmark results
uses: actions/upload-artifact@v4
with:
name: benchmark-results-${{ github.sha }}
path: benchmarks/BenchmarkDotNet.Artifacts/results/
retention-days: 90
This workflow downloads the baseline from a previous run and compares against current results:
name: Benchmark Regression Check
on:
pull_request:
paths:
- 'src/**'
- 'benchmarks/**'
permissions:
contents: read
actions: read
env:
BENCHMARK_PROJECT: benchmarks/MyBenchmarks.csproj
RESULTS_DIR: benchmarks/BenchmarkDotNet.Artifacts/results
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup .NET
uses: actions/setup-dotnet@v4
with:
dotnet-version: '8.0.x'
- name: Download baseline results
uses: actions/download-artifact@v4
with:
name: benchmark-baseline
path: ./baseline-results
continue-on-error: true
id: download-baseline
- name: Run benchmarks
run: dotnet run -c Release --project ${{ env.BENCHMARK_PROJECT }} -- --exporters json
- name: Compare with baseline
if: steps.download-baseline.outcome == 'success'
shell: bash
run: |
set -euo pipefail
python3 scripts/compare-benchmarks.py \
--baseline ./baseline-results \
--current "${{ env.RESULTS_DIR }}" \
--threshold 10 \
--output benchmark-comparison.md
- name: Upload current results as new baseline
if: github.ref == 'refs/heads/main'
uses: actions/upload-artifact@v4
with:
name: benchmark-baseline
path: ${{ env.RESULTS_DIR }}/
retention-days: 90
overwrite: true
- name: Upload comparison report
if: steps.download-baseline.outcome == 'success'
uses: actions/upload-artifact@v4
with:
name: benchmark-comparison-${{ github.sha }}
path: benchmark-comparison.md
retention-days: 30
Key design decisions:
continue-on-error: true on baseline download handles first-run (no baseline exists yet)main branch merges to prevent PR branches from polluting the baselineoverwrite: true replaces the previous baseline artifactFor converting these inline workflows into reusable workflow_call patterns, see [skill:dotnet-gha-patterns].
Compare current benchmark results against baseline using percentage thresholds. A regression is flagged when the current mean exceeds the baseline mean by more than the configured threshold:
#!/usr/bin/env python3
"""compare-benchmarks.py -- Detect benchmark regressions from BenchmarkDotNet JSON exports."""
import json
import sys
from pathlib import Path
def load_benchmarks(results_dir: str) -> dict:
"""Load benchmark results from BenchmarkDotNet JSON export files."""
benchmarks = {}
for json_file in Path(results_dir).glob("*-report-full.json"):
with open(json_file) as f:
data = json.load(f)
for bm in data.get("Benchmarks", []):
name = bm["FullName"]
benchmarks[name] = {
"mean": bm["Statistics"]["Mean"],
"median": bm["Statistics"]["Median"],
"stddev": bm["Statistics"]["StandardDeviation"],
"allocated": bm.get("Memory", {}).get("BytesAllocatedPerOperation", 0),
}
return benchmarks
def compare(baseline_dir: str, current_dir: str, threshold_pct: float) -> list:
"""Compare current results against baseline. Returns list of regressions."""
baseline = load_benchmarks(baseline_dir)
current = load_benchmarks(current_dir)
regressions = []
for name, curr in current.items():
if name not in baseline:
continue # new benchmark, no comparison possible
base = baseline[name]
if base["mean"] == 0:
continue # avoid division by zero
time_change_pct = ((curr["mean"] - base["mean"]) / base["mean"]) * 100
alloc_change = curr["allocated"] - base["allocated"]
if time_change_pct > threshold_pct:
regressions.append({
"name": name,
"baseline_mean": base["mean"],
"current_mean": curr["mean"],
"change_pct": time_change_pct,
"alloc_change": alloc_change,
})
return regressions
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Compare BenchmarkDotNet results")
parser.add_argument("--baseline", required=True, help="Path to baseline results directory")
parser.add_argument("--current", required=True, help="Path to current results directory")
parser.add_argument("--threshold", type=float, default=10.0,
help="Regression threshold percentage (default: 10)")
parser.add_argument("--output", default="comparison.md", help="Output markdown file")
args = parser.parse_args()
regressions = compare(args.baseline, args.current, args.threshold)
with open(args.output, "w") as f:
if regressions:
f.write("## Benchmark Regressions Detected\n\n")
f.write("| Benchmark | Baseline (ns) | Current (ns) | Change | Alloc Delta |\n")
f.write("|-----------|--------------|-------------|--------|-------------|\n")
for r in regressions:
f.write(f"| `{r['name']}` | {r['baseline_mean']:.2f} | "
f"{r['current_mean']:.2f} | +{r['change_pct']:.1f}% | "
f"{r['alloc_change']:+d} B |\n")
f.write(f"\nThreshold: {args.threshold}%\n")
else:
f.write("## Benchmark Results\n\nNo regressions detected ")
f.write(f"(threshold: {args.threshold}%).\n")
if regressions:
print(f"REGRESSION: {len(regressions)} benchmark(s) exceeded "
f"{args.threshold}% threshold", file=sys.stderr)
sys.exit(1)
| Environment | Suggested Threshold | Rationale |
|---|---|---|
| Dedicated benchmark hardware | 5% | Low noise floor; small regressions are signal |
| GitHub Actions shared runners | 10-15% | Shared runners introduce 5-10% variance from noisy neighbors |
| Self-hosted runners | 5-10% | More stable than shared, but still monitor variance |
Calibrate thresholds empirically: Run the same benchmark suite 5-10 times on your CI environment without code changes. The maximum observed variance sets your noise floor. Set the threshold above this noise floor (typically 2x the observed variance).
Memory allocation regressions are more reliable signals than timing regressions because allocations are deterministic (not affected by noisy neighbors):
# Add to the compare function:
if alloc_change > 0:
regressions.append({
"name": name,
"type": "allocation",
"baseline_alloc": base["allocated"],
"current_alloc": curr["allocated"],
"alloc_change": alloc_change,
})
Use allocation changes as a hard gate (zero tolerance for new allocations in zero-alloc paths) and timing changes as a soft gate (warning with threshold).
Post benchmark comparison results as a PR comment for reviewer visibility:
- name: Comment PR with results
if: steps.download-baseline.outcome == 'success' && github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const body = fs.readFileSync('benchmark-comparison.md', 'utf8');
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body: body
});
Exit with non-zero status from the comparison script to fail the GitHub Actions job. This prevents merging PRs that introduce performance regressions:
- name: Check for regressions
if: steps.download-baseline.outcome == 'success'
shell: bash
run: |
set -euo pipefail
python3 scripts/compare-benchmarks.py \
--baseline ./baseline-results \
--current "${{ env.RESULTS_DIR }}" \
--threshold 10
# Script exits non-zero if regressions found -- fails the job
For required status checks and branch protection integration with benchmark gates, see [skill:dotnet-gha-patterns].
For long-term trend analysis beyond single-PR comparison, upload results to a persistent store and track metrics over time:
| Approach | Tool | Complexity |
|---|---|---|
| GitHub Actions artifacts | Built-in, 90-day retention | Low -- artifact download/upload only |
| GitHub Pages with benchmark-action | benchmark-action/github-action-benchmark@v1 | Medium -- auto-generates trend charts |
| External time-series DB | InfluxDB, Prometheus + Grafana | High -- full observability stack |
The simplest approach for most projects is the artifact-based baseline comparison shown in this skill. Graduate to trend tracking when you need historical regression analysis across many releases.
Full benchmark runs take 10-30+ minutes. Use Job.ShortRun in CI to reduce iteration counts while retaining regression detection capability:
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
public class CiConfig : ManualConfig
{
public CiConfig()
{
AddJob(Job.ShortRun
.WithWarmupCount(3)
.WithIterationCount(5)
.WithInvocationCount(1));
AddExporter(BenchmarkDotNet.Exporters.Json.JsonExporter.Full);
}
}
Apply conditionally based on environment:
var config = Environment.GetEnvironmentVariable("CI") is not null
? new CiConfig()
: DefaultConfig.Instance;
BenchmarkRunner.Run<CriticalPathBenchmarks>(config);
Run only critical-path benchmarks in CI to reduce pipeline duration:
# Run only benchmarks in the "Critical" category
dotnet run -c Release --project benchmarks/MyBenchmarks.csproj -- \
--filter *Critical* --exporters json
[BenchmarkCategory("Critical")]
[MemoryDiagnoser]
[JsonExporterAttribute.Full]
public class CriticalPathBenchmarks
{
[Benchmark]
public void ProcessOrder() { /* ... */ }
}
[BenchmarkCategory("Extended")]
[MemoryDiagnoser]
public class ExtendedBenchmarks
{
[Benchmark]
public void RareCodePath() { /* ... */ }
}
Run Critical benchmarks on every PR; run Extended benchmarks on a nightly schedule.
name: Nightly Benchmarks (Full Suite)
on:
schedule:
- cron: '0 3 * * *' # 3 AM UTC daily
workflow_dispatch:
jobs:
benchmark-full:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup .NET
uses: actions/setup-dotnet@v4
with:
dotnet-version: '8.0.x'
- name: Run full benchmark suite
run: dotnet run -c Release --project benchmarks/MyBenchmarks.csproj -- --exporters json
# No --filter: runs all benchmarks including Extended category
- name: Upload full results
uses: actions/upload-artifact@v4
with:
name: benchmark-full-${{ github.run_number }}
path: benchmarks/BenchmarkDotNet.Artifacts/results/
retention-days: 90
For scheduled workflow patterns and matrix builds across TFMs, see [skill:dotnet-gha-patterns].
Job.ShortRun in CI, not Job.Default -- default benchmark jobs run many iterations for statistical precision, taking 10-30+ minutes per benchmark class. CI pipelines need faster feedback with ShortRun (3 warmup, 5 iteration).set -euo pipefail in bash steps -- without pipefail, a regression detection script that exits non-zero in a pipeline (e.g., script | tee) does not fail the GitHub Actions step.continue-on-error: true on the baseline download step and skip comparison when no baseline exists.[JsonExporterAttribute.Full] or JsonExporter.Full in the config.