Search everything...

Skill

dotnet-ci-benchmarking

Gating CI on perf regressions. Automated threshold alerts, baseline tracking, trend reports.

Install

npx claudepluginhub wshaddix/dotnet-skills

Tool Access

This skill uses the workspace's default tool permissions.

SKILL.md

Similar Skills

prompt-lookup

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

prompts.chat

157.6k

skill-lookup

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

157.6k

agent-eval

Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.

ecc

148.7k

Stats

Stars1

Forks0

Last CommitFeb 19, 2026

Actions

View Source View Plugin View on GitHub View README

dotnet-ci-benchmarking | dotnet-skills | ClaudePluginHub

Skill

dotnet-ci-benchmarking

From dotnet-skills

Gating CI on perf regressions. Automated threshold alerts, baseline tracking, trend reports.

Install

npx claudepluginhub wshaddix/dotnet-skills

Tool Access

This skill uses the workspace's default tool permissions.

SKILL.md

dotnet-ci-benchmarking

Continuous benchmarking guidance for detecting performance regressions in CI pipelines. Covers baseline file management with BenchmarkDotNet JSON exporters, GitHub Actions workflows for artifact-based baseline comparison, regression detection patterns with configurable thresholds, and alerting strategies for performance degradation.

Version assumptions: BenchmarkDotNet v0.14+ for JSON export, GitHub Actions runner environment. Examples use actions/upload-artifact@v4 and actions/download-artifact@v4.

Out of scope: BenchmarkDotNet setup, benchmark class design, memory diagnosers, and common pitfalls are owned by this epic's companion skill -- see [skill:dotnet-benchmarkdotnet]. Performance-oriented architecture patterns are owned by [skill:dotnet-performance-patterns]. Profiling tools (dotnet-counters, dotnet-trace, dotnet-dump) are covered by dotnet-profiling. OpenTelemetry metrics collection and distributed tracing -- see [skill:dotnet-observability]. Composable CI/CD workflow design and matrix build strategies -- see [skill:dotnet-gha-patterns]. Architecture patterns (caching, resilience) -- see [skill:dotnet-architecture-patterns].

Cross-references: [skill:dotnet-benchmarkdotnet] for benchmark class setup and JSON exporter configuration, [skill:dotnet-observability] for correlating benchmark regressions with runtime metrics changes, [skill:dotnet-gha-patterns] for composable workflow patterns (reusable workflows, composite actions, matrix builds).

Baseline File Management

BenchmarkDotNet JSON Export

BenchmarkDotNet's JSON exporter produces machine-readable results for automated comparison. Configure the exporter in benchmark classes:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Exporters.Json;

[JsonExporterAttribute.Full]
[MemoryDiagnoser]
public class CriticalPathBenchmarks
{
    [Benchmark(Baseline = true)]
    public void ProcessOrder() { /* ... */ }

    [Benchmark]
    public void ProcessOrderOptimized() { /* ... */ }
}

Or configure via custom config for all benchmark classes:

using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Exporters.Json;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

var config = ManualConfig.Create(DefaultConfig.Instance)
    .AddJob(Job.ShortRun)  // fewer iterations for CI speed
    .AddExporter(JsonExporter.Full)
    .WithArtifactsPath("./benchmark-results");

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args, config);

JSON Export Structure

The exported JSON file (*-report-full.json) contains structured benchmark results:

{
  "Title": "CriticalPathBenchmarks",
  "Benchmarks": [
    {
      "FullName": "MyApp.Benchmarks.CriticalPathBenchmarks.ProcessOrder",
      "Statistics": {
        "Mean": 1234.5678,
        "Median": 1230.1234,
        "StandardDeviation": 15.234,
        "StandardError": 4.812
      },
      "Memory": {
        "BytesAllocatedPerOperation": 1024,
        "Gen0Collections": 0.0012,
        "Gen1Collections": 0,
        "Gen2Collections": 0
      }
    }
  ]
}

Key fields for regression comparison:

Field	Purpose
`Statistics.Mean`	Average execution time (nanoseconds)
`Statistics.Median`	Middle execution time (more robust to outliers)
`Statistics.StandardDeviation`	Measurement variability
`Memory.BytesAllocatedPerOperation`	GC allocation per operation

Baseline Storage Strategies

Strategy	Pros	Cons	Best For
Git-committed baseline file	Versioned, auditable, no external deps	Repo size grows; must update deliberately	Small benchmark suites, stable hardware
GitHub Actions artifacts	No repo bloat; automatic retention	90-day default retention; cross-workflow access requires tokens	Large benchmark suites, shared runners
External storage (S3/Azure Blob)	Unlimited history; cross-repo sharing	Extra infrastructure; credential management	Multi-repo benchmark comparison

This skill focuses on the GitHub Actions artifact strategy as the default. For composable workflow patterns and reusable actions, see [skill:dotnet-gha-patterns].

GitHub Actions Benchmark Workflow

Basic Benchmark Workflow

name: Benchmarks

on:
  pull_request:
    paths:
      - 'src/**'
      - 'benchmarks/**'
  workflow_dispatch:

permissions:
  contents: read
  actions: read   # required for artifact download

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'

      - name: Run benchmarks
        run: dotnet run -c Release --project benchmarks/MyBenchmarks.csproj -- --exporters json

      - name: Upload benchmark results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results-${{ github.sha }}
          path: benchmarks/BenchmarkDotNet.Artifacts/results/
          retention-days: 90

Baseline Comparison Workflow

This workflow downloads the baseline from a previous run and compares against current results:

name: Benchmark Regression Check

on:
  pull_request:
    paths:
      - 'src/**'
      - 'benchmarks/**'

permissions:
  contents: read
  actions: read

env:
  BENCHMARK_PROJECT: benchmarks/MyBenchmarks.csproj
  RESULTS_DIR: benchmarks/BenchmarkDotNet.Artifacts/results

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'

      - name: Download baseline results
        uses: actions/download-artifact@v4
        with:
          name: benchmark-baseline
          path: ./baseline-results
        continue-on-error: true
        id: download-baseline

      - name: Run benchmarks
        run: dotnet run -c Release --project ${{ env.BENCHMARK_PROJECT }} -- --exporters json

      - name: Compare with baseline
        if: steps.download-baseline.outcome == 'success'
        shell: bash
        run: |
          set -euo pipefail
          python3 scripts/compare-benchmarks.py \
            --baseline ./baseline-results \
            --current "${{ env.RESULTS_DIR }}" \
            --threshold 10 \
            --output benchmark-comparison.md

      - name: Upload current results as new baseline
        if: github.ref == 'refs/heads/main'
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-baseline
          path: ${{ env.RESULTS_DIR }}/
          retention-days: 90
          overwrite: true

      - name: Upload comparison report
        if: steps.download-baseline.outcome == 'success'
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-comparison-${{ github.sha }}
          path: benchmark-comparison.md
          retention-days: 30

Key design decisions:

continue-on-error: true on baseline download handles first-run (no baseline exists yet)
Baseline is only updated from main branch merges to prevent PR branches from polluting the baseline
overwrite: true replaces the previous baseline artifact

For converting these inline workflows into reusable workflow_call patterns, see [skill:dotnet-gha-patterns].

Regression Detection Patterns

Threshold-Based Comparison

Compare current benchmark results against baseline using percentage thresholds. A regression is flagged when the current mean exceeds the baseline mean by more than the configured threshold:

#!/usr/bin/env python3
"""compare-benchmarks.py -- Detect benchmark regressions from BenchmarkDotNet JSON exports."""

import json
import sys
from pathlib import Path

def load_benchmarks(results_dir: str) -> dict:
    """Load benchmark results from BenchmarkDotNet JSON export files."""
    benchmarks = {}
    for json_file in Path(results_dir).glob("*-report-full.json"):
        with open(json_file) as f:
            data = json.load(f)
        for bm in data.get("Benchmarks", []):
            name = bm["FullName"]
            benchmarks[name] = {
                "mean": bm["Statistics"]["Mean"],
                "median": bm["Statistics"]["Median"],
                "stddev": bm["Statistics"]["StandardDeviation"],
                "allocated": bm.get("Memory", {}).get("BytesAllocatedPerOperation", 0),
            }
    return benchmarks

def compare(baseline_dir: str, current_dir: str, threshold_pct: float) -> list:
    """Compare current results against baseline. Returns list of regressions."""
    baseline = load_benchmarks(baseline_dir)
    current = load_benchmarks(current_dir)
    regressions = []

    for name, curr in current.items():
        if name not in baseline:
            continue  # new benchmark, no comparison possible
        base = baseline[name]
        if base["mean"] == 0:
            continue  # avoid division by zero

        time_change_pct = ((curr["mean"] - base["mean"]) / base["mean"]) * 100
        alloc_change = curr["allocated"] - base["allocated"]

        if time_change_pct > threshold_pct:
            regressions.append({
                "name": name,
                "baseline_mean": base["mean"],
                "current_mean": curr["mean"],
                "change_pct": time_change_pct,
                "alloc_change": alloc_change,
            })

    return regressions

if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(description="Compare BenchmarkDotNet results")
    parser.add_argument("--baseline", required=True, help="Path to baseline results directory")
    parser.add_argument("--current", required=True, help="Path to current results directory")
    parser.add_argument("--threshold", type=float, default=10.0,
                        help="Regression threshold percentage (default: 10)")
    parser.add_argument("--output", default="comparison.md", help="Output markdown file")
    args = parser.parse_args()

    regressions = compare(args.baseline, args.current, args.threshold)

    with open(args.output, "w") as f:
        if regressions:
            f.write("## Benchmark Regressions Detected\n\n")
            f.write("| Benchmark | Baseline (ns) | Current (ns) | Change | Alloc Delta |\n")
            f.write("|-----------|--------------|-------------|--------|-------------|\n")
            for r in regressions:
                f.write(f"| `{r['name']}` | {r['baseline_mean']:.2f} | "
                        f"{r['current_mean']:.2f} | +{r['change_pct']:.1f}% | "
                        f"{r['alloc_change']:+d} B |\n")
            f.write(f"\nThreshold: {args.threshold}%\n")
        else:
            f.write("## Benchmark Results\n\nNo regressions detected ")
            f.write(f"(threshold: {args.threshold}%).\n")

    if regressions:
        print(f"REGRESSION: {len(regressions)} benchmark(s) exceeded "
              f"{args.threshold}% threshold", file=sys.stderr)
        sys.exit(1)

Choosing Thresholds

Environment	Suggested Threshold	Rationale
Dedicated benchmark hardware	5%	Low noise floor; small regressions are signal
GitHub Actions shared runners	10-15%	Shared runners introduce 5-10% variance from noisy neighbors
Self-hosted runners	5-10%	More stable than shared, but still monitor variance

Calibrate thresholds empirically: Run the same benchmark suite 5-10 times on your CI environment without code changes. The maximum observed variance sets your noise floor. Set the threshold above this noise floor (typically 2x the observed variance).

Allocation Regression Detection

Memory allocation regressions are more reliable signals than timing regressions because allocations are deterministic (not affected by noisy neighbors):

# Add to the compare function:
if alloc_change > 0:
    regressions.append({
        "name": name,
        "type": "allocation",
        "baseline_alloc": base["allocated"],
        "current_alloc": curr["allocated"],
        "alloc_change": alloc_change,
    })

Use allocation changes as a hard gate (zero tolerance for new allocations in zero-alloc paths) and timing changes as a soft gate (warning with threshold).

Alerting Strategies

PR Comment with Regression Summary

Post benchmark comparison results as a PR comment for reviewer visibility:

      - name: Comment PR with results
        if: steps.download-baseline.outcome == 'success' && github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const body = fs.readFileSync('benchmark-comparison.md', 'utf8');
            await github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: body
            });

Fail the Build on Regression

Exit with non-zero status from the comparison script to fail the GitHub Actions job. This prevents merging PRs that introduce performance regressions:

      - name: Check for regressions
        if: steps.download-baseline.outcome == 'success'
        shell: bash
        run: |
          set -euo pipefail
          python3 scripts/compare-benchmarks.py \
            --baseline ./baseline-results \
            --current "${{ env.RESULTS_DIR }}" \
            --threshold 10
          # Script exits non-zero if regressions found -- fails the job

For required status checks and branch protection integration with benchmark gates, see [skill:dotnet-gha-patterns].

Trend Tracking

For long-term trend analysis beyond single-PR comparison, upload results to a persistent store and track metrics over time:

Approach	Tool	Complexity
GitHub Actions artifacts	Built-in, 90-day retention	Low -- artifact download/upload only
GitHub Pages with benchmark-action	`benchmark-action/github-action-benchmark@v1`	Medium -- auto-generates trend charts
External time-series DB	InfluxDB, Prometheus + Grafana	High -- full observability stack

The simplest approach for most projects is the artifact-based baseline comparison shown in this skill. Graduate to trend tracking when you need historical regression analysis across many releases.

CI-Specific BenchmarkDotNet Configuration

ShortRun for CI Speed

Full benchmark runs take 10-30+ minutes. Use Job.ShortRun in CI to reduce iteration counts while retaining regression detection capability:

using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

public class CiConfig : ManualConfig
{
    public CiConfig()
    {
        AddJob(Job.ShortRun
            .WithWarmupCount(3)
            .WithIterationCount(5)
            .WithInvocationCount(1));

        AddExporter(BenchmarkDotNet.Exporters.Json.JsonExporter.Full);
    }
}

Apply conditionally based on environment:

var config = Environment.GetEnvironmentVariable("CI") is not null
    ? new CiConfig()
    : DefaultConfig.Instance;

BenchmarkRunner.Run<CriticalPathBenchmarks>(config);

Filtering Benchmarks for CI

Run only critical-path benchmarks in CI to reduce pipeline duration:

# Run only benchmarks in the "Critical" category
dotnet run -c Release --project benchmarks/MyBenchmarks.csproj -- \
  --filter *Critical* --exporters json

[BenchmarkCategory("Critical")]
[MemoryDiagnoser]
[JsonExporterAttribute.Full]
public class CriticalPathBenchmarks
{
    [Benchmark]
    public void ProcessOrder() { /* ... */ }
}

[BenchmarkCategory("Extended")]
[MemoryDiagnoser]
public class ExtendedBenchmarks
{
    [Benchmark]
    public void RareCodePath() { /* ... */ }
}

Run Critical benchmarks on every PR; run Extended benchmarks on a nightly schedule.

Nightly Benchmark Schedule

name: Nightly Benchmarks (Full Suite)

on:
  schedule:
    - cron: '0 3 * * *'  # 3 AM UTC daily
  workflow_dispatch:

jobs:
  benchmark-full:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'

      - name: Run full benchmark suite
        run: dotnet run -c Release --project benchmarks/MyBenchmarks.csproj -- --exporters json
        # No --filter: runs all benchmarks including Extended category

      - name: Upload full results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-full-${{ github.run_number }}
          path: benchmarks/BenchmarkDotNet.Artifacts/results/
          retention-days: 90

For scheduled workflow patterns and matrix builds across TFMs, see [skill:dotnet-gha-patterns].

Agent Gotchas

Use Job.ShortRun in CI, not Job.Default -- default benchmark jobs run many iterations for statistical precision, taking 10-30+ minutes per benchmark class. CI pipelines need faster feedback with ShortRun (3 warmup, 5 iteration).
Set threshold above measured noise floor -- shared CI runners introduce 5-10% timing variance from noisy neighbors. A 5% threshold on shared runners produces false positives. Calibrate by running the same code multiple times and measuring variance.
Use allocation changes as hard gates -- allocation counts are deterministic and unaffected by runner noise. A zero-to-nonzero allocation change is always a real regression, unlike timing variations.
Only update baselines from main branch -- if PR branches can update the baseline, a regression in one PR becomes the new baseline, masking it from subsequent comparisons.
Always set set -euo pipefail in bash steps -- without pipefail, a regression detection script that exits non-zero in a pipeline (e.g., script | tee) does not fail the GitHub Actions step.
Handle missing baselines gracefully -- the first CI run has no baseline to compare against. Use continue-on-error: true on the baseline download step and skip comparison when no baseline exists.
Export JSON, not just Markdown -- Markdown reports are human-readable but not machine-parseable for automated regression detection. Always include [JsonExporterAttribute.Full] or JsonExporter.Full in the config.

Similar Skills

prompt-lookup

Searches prompts.chat for AI prompt templates by keyword or category, retrieves by ID with variable handling, and improves prompts via AI. Use for discovering or enhancing prompts.

prompts.chat

157.6k

skill-lookup

Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.

prompts.chat

157.6k

agent-eval

Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.

ecc

148.7k

Stats

Stars1

Forks0

Last CommitFeb 19, 2026

Actions

View Source View Plugin View on GitHub View README

dotnet-ci-benchmarking

Version assumptions: BenchmarkDotNet v0.14+ for JSON export, GitHub Actions runner environment. Examples use actions/upload-artifact@v4 and actions/download-artifact@v4.

Baseline File Management

BenchmarkDotNet JSON Export

BenchmarkDotNet's JSON exporter produces machine-readable results for automated comparison. Configure the exporter in benchmark classes:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Exporters.Json;

[JsonExporterAttribute.Full]
[MemoryDiagnoser]
public class CriticalPathBenchmarks
{
    [Benchmark(Baseline = true)]
    public void ProcessOrder() { /* ... */ }

    [Benchmark]
    public void ProcessOrderOptimized() { /* ... */ }
}

Or configure via custom config for all benchmark classes:

using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Exporters.Json;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

var config = ManualConfig.Create(DefaultConfig.Instance)
    .AddJob(Job.ShortRun)  // fewer iterations for CI speed
    .AddExporter(JsonExporter.Full)
    .WithArtifactsPath("./benchmark-results");

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args, config);

JSON Export Structure

The exported JSON file (*-report-full.json) contains structured benchmark results:

{
  "Title": "CriticalPathBenchmarks",
  "Benchmarks": [
    {
      "FullName": "MyApp.Benchmarks.CriticalPathBenchmarks.ProcessOrder",
      "Statistics": {
        "Mean": 1234.5678,
        "Median": 1230.1234,
        "StandardDeviation": 15.234,
        "StandardError": 4.812
      },
      "Memory": {
        "BytesAllocatedPerOperation": 1024,
        "Gen0Collections": 0.0012,
        "Gen1Collections": 0,
        "Gen2Collections": 0
      }
    }
  ]
}

Key fields for regression comparison:

Field	Purpose
`Statistics.Mean`	Average execution time (nanoseconds)
`Statistics.Median`	Middle execution time (more robust to outliers)
`Statistics.StandardDeviation`	Measurement variability
`Memory.BytesAllocatedPerOperation`	GC allocation per operation

Baseline Storage Strategies

Strategy	Pros	Cons	Best For
Git-committed baseline file	Versioned, auditable, no external deps	Repo size grows; must update deliberately	Small benchmark suites, stable hardware
GitHub Actions artifacts	No repo bloat; automatic retention	90-day default retention; cross-workflow access requires tokens	Large benchmark suites, shared runners
External storage (S3/Azure Blob)	Unlimited history; cross-repo sharing	Extra infrastructure; credential management	Multi-repo benchmark comparison

This skill focuses on the GitHub Actions artifact strategy as the default. For composable workflow patterns and reusable actions, see [skill:dotnet-gha-patterns].

GitHub Actions Benchmark Workflow

Basic Benchmark Workflow

name: Benchmarks

on:
  pull_request:
    paths:
      - 'src/**'
      - 'benchmarks/**'
  workflow_dispatch:

permissions:
  contents: read
  actions: read   # required for artifact download

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'

      - name: Run benchmarks
        run: dotnet run -c Release --project benchmarks/MyBenchmarks.csproj -- --exporters json

      - name: Upload benchmark results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results-${{ github.sha }}
          path: benchmarks/BenchmarkDotNet.Artifacts/results/
          retention-days: 90

Baseline Comparison Workflow

This workflow downloads the baseline from a previous run and compares against current results:

name: Benchmark Regression Check

on:
  pull_request:
    paths:
      - 'src/**'
      - 'benchmarks/**'

permissions:
  contents: read
  actions: read

env:
  BENCHMARK_PROJECT: benchmarks/MyBenchmarks.csproj
  RESULTS_DIR: benchmarks/BenchmarkDotNet.Artifacts/results

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'

      - name: Download baseline results
        uses: actions/download-artifact@v4
        with:
          name: benchmark-baseline
          path: ./baseline-results
        continue-on-error: true
        id: download-baseline

      - name: Run benchmarks
        run: dotnet run -c Release --project ${{ env.BENCHMARK_PROJECT }} -- --exporters json

      - name: Compare with baseline
        if: steps.download-baseline.outcome == 'success'
        shell: bash
        run: |
          set -euo pipefail
          python3 scripts/compare-benchmarks.py \
            --baseline ./baseline-results \
            --current "${{ env.RESULTS_DIR }}" \
            --threshold 10 \
            --output benchmark-comparison.md

      - name: Upload current results as new baseline
        if: github.ref == 'refs/heads/main'
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-baseline
          path: ${{ env.RESULTS_DIR }}/
          retention-days: 90
          overwrite: true

      - name: Upload comparison report
        if: steps.download-baseline.outcome == 'success'
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-comparison-${{ github.sha }}
          path: benchmark-comparison.md
          retention-days: 30

Key design decisions:

continue-on-error: true on baseline download handles first-run (no baseline exists yet)
Baseline is only updated from main branch merges to prevent PR branches from polluting the baseline
overwrite: true replaces the previous baseline artifact

For converting these inline workflows into reusable workflow_call patterns, see [skill:dotnet-gha-patterns].

Regression Detection Patterns

Threshold-Based Comparison

Compare current benchmark results against baseline using percentage thresholds. A regression is flagged when the current mean exceeds the baseline mean by more than the configured threshold:

#!/usr/bin/env python3
"""compare-benchmarks.py -- Detect benchmark regressions from BenchmarkDotNet JSON exports."""

import json
import sys
from pathlib import Path

def load_benchmarks(results_dir: str) -> dict:
    """Load benchmark results from BenchmarkDotNet JSON export files."""
    benchmarks = {}
    for json_file in Path(results_dir).glob("*-report-full.json"):
        with open(json_file) as f:
            data = json.load(f)
        for bm in data.get("Benchmarks", []):
            name = bm["FullName"]
            benchmarks[name] = {
                "mean": bm["Statistics"]["Mean"],
                "median": bm["Statistics"]["Median"],
                "stddev": bm["Statistics"]["StandardDeviation"],
                "allocated": bm.get("Memory", {}).get("BytesAllocatedPerOperation", 0),
            }
    return benchmarks

def compare(baseline_dir: str, current_dir: str, threshold_pct: float) -> list:
    """Compare current results against baseline. Returns list of regressions."""
    baseline = load_benchmarks(baseline_dir)
    current = load_benchmarks(current_dir)
    regressions = []

    for name, curr in current.items():
        if name not in baseline:
            continue  # new benchmark, no comparison possible
        base = baseline[name]
        if base["mean"] == 0:
            continue  # avoid division by zero

        time_change_pct = ((curr["mean"] - base["mean"]) / base["mean"]) * 100
        alloc_change = curr["allocated"] - base["allocated"]

        if time_change_pct > threshold_pct:
            regressions.append({
                "name": name,
                "baseline_mean": base["mean"],
                "current_mean": curr["mean"],
                "change_pct": time_change_pct,
                "alloc_change": alloc_change,
            })

    return regressions

if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(description="Compare BenchmarkDotNet results")
    parser.add_argument("--baseline", required=True, help="Path to baseline results directory")
    parser.add_argument("--current", required=True, help="Path to current results directory")
    parser.add_argument("--threshold", type=float, default=10.0,
                        help="Regression threshold percentage (default: 10)")
    parser.add_argument("--output", default="comparison.md", help="Output markdown file")
    args = parser.parse_args()

    regressions = compare(args.baseline, args.current, args.threshold)

    with open(args.output, "w") as f:
        if regressions:
            f.write("## Benchmark Regressions Detected\n\n")
            f.write("| Benchmark | Baseline (ns) | Current (ns) | Change | Alloc Delta |\n")
            f.write("|-----------|--------------|-------------|--------|-------------|\n")
            for r in regressions:
                f.write(f"| `{r['name']}` | {r['baseline_mean']:.2f} | "
                        f"{r['current_mean']:.2f} | +{r['change_pct']:.1f}% | "
                        f"{r['alloc_change']:+d} B |\n")
            f.write(f"\nThreshold: {args.threshold}%\n")
        else:
            f.write("## Benchmark Results\n\nNo regressions detected ")
            f.write(f"(threshold: {args.threshold}%).\n")

    if regressions:
        print(f"REGRESSION: {len(regressions)} benchmark(s) exceeded "
              f"{args.threshold}% threshold", file=sys.stderr)
        sys.exit(1)

Choosing Thresholds

Environment	Suggested Threshold	Rationale
Dedicated benchmark hardware	5%	Low noise floor; small regressions are signal
GitHub Actions shared runners	10-15%	Shared runners introduce 5-10% variance from noisy neighbors
Self-hosted runners	5-10%	More stable than shared, but still monitor variance

Allocation Regression Detection

Memory allocation regressions are more reliable signals than timing regressions because allocations are deterministic (not affected by noisy neighbors):

# Add to the compare function:
if alloc_change > 0:
    regressions.append({
        "name": name,
        "type": "allocation",
        "baseline_alloc": base["allocated"],
        "current_alloc": curr["allocated"],
        "alloc_change": alloc_change,
    })

Use allocation changes as a hard gate (zero tolerance for new allocations in zero-alloc paths) and timing changes as a soft gate (warning with threshold).

Alerting Strategies

PR Comment with Regression Summary

Post benchmark comparison results as a PR comment for reviewer visibility:

      - name: Comment PR with results
        if: steps.download-baseline.outcome == 'success' && github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const body = fs.readFileSync('benchmark-comparison.md', 'utf8');
            await github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: body
            });

Fail the Build on Regression

Exit with non-zero status from the comparison script to fail the GitHub Actions job. This prevents merging PRs that introduce performance regressions:

      - name: Check for regressions
        if: steps.download-baseline.outcome == 'success'
        shell: bash
        run: |
          set -euo pipefail
          python3 scripts/compare-benchmarks.py \
            --baseline ./baseline-results \
            --current "${{ env.RESULTS_DIR }}" \
            --threshold 10
          # Script exits non-zero if regressions found -- fails the job

For required status checks and branch protection integration with benchmark gates, see [skill:dotnet-gha-patterns].

Trend Tracking

For long-term trend analysis beyond single-PR comparison, upload results to a persistent store and track metrics over time:

Approach	Tool	Complexity
GitHub Actions artifacts	Built-in, 90-day retention	Low -- artifact download/upload only
GitHub Pages with benchmark-action	`benchmark-action/github-action-benchmark@v1`	Medium -- auto-generates trend charts
External time-series DB	InfluxDB, Prometheus + Grafana	High -- full observability stack

The simplest approach for most projects is the artifact-based baseline comparison shown in this skill. Graduate to trend tracking when you need historical regression analysis across many releases.

CI-Specific BenchmarkDotNet Configuration

ShortRun for CI Speed

Full benchmark runs take 10-30+ minutes. Use Job.ShortRun in CI to reduce iteration counts while retaining regression detection capability:

using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

public class CiConfig : ManualConfig
{
    public CiConfig()
    {
        AddJob(Job.ShortRun
            .WithWarmupCount(3)
            .WithIterationCount(5)
            .WithInvocationCount(1));

        AddExporter(BenchmarkDotNet.Exporters.Json.JsonExporter.Full);
    }
}

Apply conditionally based on environment:

var config = Environment.GetEnvironmentVariable("CI") is not null
    ? new CiConfig()
    : DefaultConfig.Instance;

BenchmarkRunner.Run<CriticalPathBenchmarks>(config);

Filtering Benchmarks for CI

Run only critical-path benchmarks in CI to reduce pipeline duration:

# Run only benchmarks in the "Critical" category
dotnet run -c Release --project benchmarks/MyBenchmarks.csproj -- \
  --filter *Critical* --exporters json

[BenchmarkCategory("Critical")]
[MemoryDiagnoser]
[JsonExporterAttribute.Full]
public class CriticalPathBenchmarks
{
    [Benchmark]
    public void ProcessOrder() { /* ... */ }
}

[BenchmarkCategory("Extended")]
[MemoryDiagnoser]
public class ExtendedBenchmarks
{
    [Benchmark]
    public void RareCodePath() { /* ... */ }
}

Run Critical benchmarks on every PR; run Extended benchmarks on a nightly schedule.

Nightly Benchmark Schedule

name: Nightly Benchmarks (Full Suite)

on:
  schedule:
    - cron: '0 3 * * *'  # 3 AM UTC daily
  workflow_dispatch:

jobs:
  benchmark-full:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup .NET
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '8.0.x'

      - name: Run full benchmark suite
        run: dotnet run -c Release --project benchmarks/MyBenchmarks.csproj -- --exporters json
        # No --filter: runs all benchmarks including Extended category

      - name: Upload full results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-full-${{ github.run_number }}
          path: benchmarks/BenchmarkDotNet.Artifacts/results/
          retention-days: 90

For scheduled workflow patterns and matrix builds across TFMs, see [skill:dotnet-gha-patterns].

Agent Gotchas

Use Job.ShortRun in CI, not Job.Default -- default benchmark jobs run many iterations for statistical precision, taking 10-30+ minutes per benchmark class. CI pipelines need faster feedback with ShortRun (3 warmup, 5 iteration).
Set threshold above measured noise floor -- shared CI runners introduce 5-10% timing variance from noisy neighbors. A 5% threshold on shared runners produces false positives. Calibrate by running the same code multiple times and measuring variance.
Use allocation changes as hard gates -- allocation counts are deterministic and unaffected by runner noise. A zero-to-nonzero allocation change is always a real regression, unlike timing variations.
Only update baselines from main branch -- if PR branches can update the baseline, a regression in one PR becomes the new baseline, masking it from subsequent comparisons.
Always set set -euo pipefail in bash steps -- without pipefail, a regression detection script that exits non-zero in a pipeline (e.g., script | tee) does not fail the GitHub Actions step.
Handle missing baselines gracefully -- the first CI run has no baseline to compare against. Use continue-on-error: true on the baseline download step and skip comparison when no baseline exists.
Export JSON, not just Markdown -- Markdown reports are human-readable but not machine-parseable for automated regression detection. Always include [JsonExporterAttribute.Full] or JsonExporter.Full in the config.