Latency Optimization | everything-claude-trading | ClaudePluginHub

Skill

Latency Optimization

From everything-claude-trading

> Low-latency trading infrastructure — co-location, FPGA, kernel bypass, network optimization.

Install

$

npx claudepluginhub brainbytes-dev/everything-claude-trading

Tool Access

This skill uses the workspace's default tool permissions.

Preview

> Low-latency trading infrastructure — co-location, FPGA, kernel bypass, network optimization.

SKILL.md

Similar Skills

kotlin-ktor-patterns

Provides Ktor server patterns for routing DSL, plugins (auth, CORS, serialization), Koin DI, WebSockets, services, and testApplication testing.

everything-claude-code

163.2k

deep-research

Conducts multi-source web research with firecrawl and exa MCPs: searches, scrapes pages, synthesizes cited reports. For deep dives, competitive analysis, tech evaluations, or due diligence.

everything-claude-code

163.2k

inventory-demand-planning

Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.

everything-claude-code

163.2k

Stats

Stars0

Forks0

Last CommitMar 14, 2026

Actions

View Source View Plugin View on GitHub View README

Latency Optimization

Low-latency trading infrastructure — co-location, FPGA, kernel bypass, network optimization.

When to Activate

User is building or evaluating low-latency trading systems
Optimizing tick-to-trade latency for market making or arbitrage
Understanding co-location, FPGA-based trading, kernel bypass networking
Measuring and profiling latency in trading systems
Evaluating hardware and network topology for trading infrastructure

Core Concepts

Why Latency Matters

In competitive trading strategies (market making, statistical arbitrage, index arbitrage), speed determines profitability. A latency advantage means:

Canceling stale quotes before being adversely selected
Capturing fleeting arbitrage opportunities
Being first in the exchange queue (price-time priority)
Reacting to market data before competitors adjust positions

Latency Budget

A typical tick-to-trade path and budget:

Market Data Feed → NIC → Kernel → User Space → Strategy Logic → Order Generation → NIC → Network → Exchange

Component              Typical Latency      Optimized Latency
────────────────────────────────────────────────────────────────
Exchange → NIC         ~1 μs (co-located)   ~0.5 μs
NIC → Kernel           ~5-10 μs             bypassed (0)
Kernel → User Space    ~2-5 μs              bypassed (0)
Market Data Parse      ~1-5 μs              ~0.2 μs (FPGA)
Strategy Logic         ~1-10 μs             ~0.1 μs (FPGA)
Order Serialization    ~1-2 μs              ~0.1 μs (FPGA)
User Space → Kernel    ~2-5 μs              bypassed (0)
NIC → Exchange         ~1 μs (co-located)   ~0.5 μs
────────────────────────────────────────────────────────────────
Total                  ~15-40 μs            ~1-3 μs (FPGA)

Technology Stack Tiers

Tier	Technology	Tick-to-Trade	Use Case	Annual Cost
1	FPGA + kernel bypass	1-5 μs	Market making, HFT	$1-5M
2	C++ + kernel bypass	5-20 μs	Stat arb, fast alpha	$200K-1M
3	C++ standard	20-100 μs	Medium-frequency	$50-200K
4	Java/C# optimized	100-500 μs	Slower systematic	$20-50K
5	Python	1-10 ms	End-of-day strategies	<$20K

Methodology

Step 1: Co-Location

Physical proximity to exchange matching engines:
- NYSE: Mahwah, NJ (Equinix NY5)
- Nasdaq: Carteret, NJ (Equinix NY11)
- CME: Aurora, IL (Equinix CH1)
- LSE/CBOE EU: Slough, UK (Equinix LD4)
- Tokyo Stock Exchange: Tokyo (Equinix TY3)

Co-location provides:
- Sub-microsecond network latency to exchange
- Deterministic latency (minimal jitter)
- Access to exchange direct market data feeds

Cost: $5-20K/month per rack, plus cross-connects ($200-500/month each)

Key considerations:

Cabinet location within the data center matters (nanoseconds from cable length)
Equal-length cables provided by some exchanges (CME) to eliminate advantage
Power and cooling constraints limit hardware density
Multiple exchange co-locations needed for cross-market strategies

Step 2: Kernel Bypass Networking

/*
 * Standard path: NIC → Kernel TCP/IP stack → User space
 * Kernel bypass: NIC → User space directly (via DPDK, Solarflare OpenOnload, Mellanox VMA)
 *
 * Savings: 5-15 microseconds per message
 */

// Solarflare OpenOnload example (Linux):
// Simply preload the library — application code unchanged
// LD_PRELOAD=libonload.so ./trading_app

// DPDK approach (more control, more effort):
// - Polls NIC directly from user space
// - No interrupts, no context switches
// - Requires dedicated NIC and CPU cores

// Key kernel bypass technologies:
// 1. Solarflare OpenOnload: easiest, socket-compatible, ~5 μs savings
// 2. DPDK (Data Plane Development Kit): most flexible, ~8-12 μs savings
// 3. ef_vi (Solarflare raw API): lowest latency, proprietary
// 4. Mellanox VMA: similar to OpenOnload for Mellanox NICs
// 5. ExaNIC: hardware timestamping + kernel bypass

Step 3: FPGA-Based Trading

FPGA (Field-Programmable Gate Array) advantages:
- Parallel processing: process multiple fields simultaneously
- Deterministic latency: no OS jitter, no garbage collection
- Pipeline architecture: overlap parsing, logic, and serialization

Typical FPGA trading pipeline:
┌──────────┐   ┌───────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────┐
│ NIC/PHY  │→→│ MD Parser │→→│ Strategy     │→→│ Order Build  │→→│ NIC/PHY  │
│ Receive  │   │ (3 clocks)│   │ (5-10 clocks)│   │ (3 clocks)  │   │ Transmit │
└──────────┘   └───────────┘   └──────────────┘   └──────────────┘   └──────────┘
                                                                Total: ~200-500 ns

# FPGA vendor landscape for trading:
fpga_vendors = {
    'Xilinx_Alveo': {
        'model': 'U250/U280',
        'interface': 'PCIe Gen3/4',
        'network': '10/25/100 GbE',
        'toolchain': 'Vivado HLS / Vitis',
        'typical_latency': '200-800 ns',
    },
    'Intel_Stratix': {
        'model': 'Stratix 10',
        'interface': 'PCIe Gen3',
        'network': '10/25/100 GbE',
        'toolchain': 'Quartus / OneAPI',
        'typical_latency': '200-600 ns',
    },
    'Algo_Logic': {
        'note': 'Pre-built trading FPGA solutions',
        'market_data': 'Hardware MD parser for major exchanges',
        'order_entry': 'Hardware FIX/OUCH/ITCH engines',
    },
}

Step 4: Software Optimization

// Key C++ optimizations for low-latency trading:

// 1. Lock-free data structures
#include <atomic>
template<typename T, size_t N>
class SPSCQueue {
    // Single-producer single-consumer lock-free queue
    std::array<T, N> buffer;
    alignas(64) std::atomic<size_t> head{0};  // separate cache lines
    alignas(64) std::atomic<size_t> tail{0};
    // ...
};

// 2. Memory pre-allocation (no malloc in hot path)
// Pre-allocate all objects at startup, use object pools

// 3. CPU pinning and isolation
// isolcpus=2,3,4,5 in kernel boot params
// pthread_setaffinity_np() to pin trading threads

// 4. NUMA awareness
// Ensure memory is allocated on the same NUMA node as the CPU
// numactl --cpunodebind=0 --membind=0 ./trading_app

// 5. Huge pages (reduce TLB misses)
// echo 1024 > /proc/sys/vm/nr_hugepages
// mmap with MAP_HUGETLB

// 6. Compiler optimizations
// -O3 -march=native -flto -fno-exceptions
// Profile-guided optimization (PGO): -fprofile-generate / -fprofile-use

// 7. Avoid branch mispredictions
// Use branchless code: result = (a > b) * a + (a <= b) * b
// __builtin_expect() for likely/unlikely branches

Step 5: Latency Measurement

def latency_measurement_framework():
    """
    Accurate latency measurement requires:
    1. Hardware timestamping (NIC or FPGA, not system clock)
    2. Nanosecond precision (TSC, RDTSC, or PTP)
    3. Measure at multiple points in the pipeline
    4. Statistical analysis of the distribution (not just mean)
    """
    # Key metrics:
    metrics = {
        'median_latency': 'P50 — typical case',
        'p99_latency': 'P99 — tail latency (critical for HFT)',
        'p999_latency': 'P99.9 — extreme tail',
        'jitter': 'P99 - P50 — consistency matters as much as speed',
        'max_latency': 'Worst case — often caused by OS jitter',
    }

    # Sources of jitter (latency variance):
    jitter_sources = {
        'context_switches': 'Solution: CPU isolation, RT kernel',
        'interrupts': 'Solution: IRQ affinity, move interrupts off trading cores',
        'TLB_misses': 'Solution: huge pages',
        'cache_misses': 'Solution: keep hot data in L1/L2, avoid false sharing',
        'GC_pauses': 'Solution: avoid GC languages, or tune GC (Zing JVM)',
        'page_faults': 'Solution: mlockall(), pre-touch memory',
        'NUMA_remote': 'Solution: NUMA-aware allocation',
        'power_management': 'Solution: disable C-states, set governor to performance',
    }

    return metrics, jitter_sources

# Hardware timestamp capture:
# On Linux with Solarflare NIC:
# - Use SO_TIMESTAMPING socket option for hardware RX/TX timestamps
# - Accuracy: ~5 nanoseconds
# - Compare hardware timestamps to measure true wire-to-wire latency

Step 6: Market Data Feed Optimization

def market_data_architecture():
    """
    Market data feed handling is often the latency bottleneck.
    """
    approaches = {
        'consolidated_feed': {
            'source': 'SIP (US), CTA/UTP',
            'latency': '500 μs - 1 ms (too slow for HFT)',
            'use_case': 'Compliance, slow strategies',
        },
        'direct_feed': {
            'source': 'Exchange proprietary (ITCH, PITCH, XDP)',
            'latency': '1-10 μs (co-located)',
            'use_case': 'Market making, fast strategies',
            'formats': {
                'nasdaq': 'ITCH 5.0 (binary, efficient)',
                'nyse': 'XDP (binary)',
                'bats_cboe': 'PITCH (binary)',
                'cme': 'MDP 3.0 (binary, multicast)',
            }
        },
        'fpga_parsed': {
            'source': 'Direct feed parsed in FPGA',
            'latency': '100-500 ns from NIC to parsed message',
            'use_case': 'HFT, sub-microsecond strategies',
        },
    }

    # Feed handler optimization:
    # 1. Parse only fields you need (skip unused fields)
    # 2. Use fixed-offset parsing (binary protocols)
    # 3. Multicast: join specific groups, filter early
    # 4. Gap detection and recovery without blocking
    # 5. Book building: incremental updates, not full rebuilds

    return approaches

Examples

Latency Budget for Market Making

budget = {
    'market_data_receive': 0.5,    # μs (co-located, direct feed)
    'md_parse_fpga': 0.3,          # μs (FPGA parsing)
    'strategy_logic': 1.0,         # μs (quote update calculation)
    'order_build': 0.2,            # μs
    'network_to_exchange': 0.5,    # μs (co-located)
    'exchange_processing': 5.0,    # μs (exchange matching engine)
}
total = sum(budget.values())  # 7.5 μs wire-to-wire
# Competitive for equity market making in 2024

Cost-Benefit of Latency Investment

# Is the latency investment worth it?
# Revenue model for market making:
spread_capture = 0.5  # cents per share (half the spread)
volume_per_day = 1_000_000  # shares
trading_days = 252

gross_revenue = spread_capture * volume_per_day * trading_days / 100  # $1.26M/year

# Faster latency → better queue position → higher fill rate → more revenue
# 10 μs improvement might increase fill rate by 5-15% → $60-190K additional revenue
# Co-location cost: ~$150K/year
# FPGA development: ~$500K one-time + $200K/year maintenance

# Breakeven: depends on strategy capacity and competition

Quality Gate