From everything-claude-trading
> Low-latency trading infrastructure — co-location, FPGA, kernel bypass, network optimization.
npx claudepluginhub brainbytes-dev/everything-claude-tradingThis skill uses the workspace's default tool permissions.
> Low-latency trading infrastructure — co-location, FPGA, kernel bypass, network optimization.
Provides Ktor server patterns for routing DSL, plugins (auth, CORS, serialization), Koin DI, WebSockets, services, and testApplication testing.
Conducts multi-source web research with firecrawl and exa MCPs: searches, scrapes pages, synthesizes cited reports. For deep dives, competitive analysis, tech evaluations, or due diligence.
Provides demand forecasting, safety stock optimization, replenishment planning, and promotional lift estimation for multi-location retailers managing 300-800 SKUs.
Low-latency trading infrastructure — co-location, FPGA, kernel bypass, network optimization.
In competitive trading strategies (market making, statistical arbitrage, index arbitrage), speed determines profitability. A latency advantage means:
A typical tick-to-trade path and budget:
Market Data Feed → NIC → Kernel → User Space → Strategy Logic → Order Generation → NIC → Network → Exchange
Component Typical Latency Optimized Latency
────────────────────────────────────────────────────────────────
Exchange → NIC ~1 μs (co-located) ~0.5 μs
NIC → Kernel ~5-10 μs bypassed (0)
Kernel → User Space ~2-5 μs bypassed (0)
Market Data Parse ~1-5 μs ~0.2 μs (FPGA)
Strategy Logic ~1-10 μs ~0.1 μs (FPGA)
Order Serialization ~1-2 μs ~0.1 μs (FPGA)
User Space → Kernel ~2-5 μs bypassed (0)
NIC → Exchange ~1 μs (co-located) ~0.5 μs
────────────────────────────────────────────────────────────────
Total ~15-40 μs ~1-3 μs (FPGA)
| Tier | Technology | Tick-to-Trade | Use Case | Annual Cost |
|---|---|---|---|---|
| 1 | FPGA + kernel bypass | 1-5 μs | Market making, HFT | $1-5M |
| 2 | C++ + kernel bypass | 5-20 μs | Stat arb, fast alpha | $200K-1M |
| 3 | C++ standard | 20-100 μs | Medium-frequency | $50-200K |
| 4 | Java/C# optimized | 100-500 μs | Slower systematic | $20-50K |
| 5 | Python | 1-10 ms | End-of-day strategies | <$20K |
Physical proximity to exchange matching engines:
- NYSE: Mahwah, NJ (Equinix NY5)
- Nasdaq: Carteret, NJ (Equinix NY11)
- CME: Aurora, IL (Equinix CH1)
- LSE/CBOE EU: Slough, UK (Equinix LD4)
- Tokyo Stock Exchange: Tokyo (Equinix TY3)
Co-location provides:
- Sub-microsecond network latency to exchange
- Deterministic latency (minimal jitter)
- Access to exchange direct market data feeds
Cost: $5-20K/month per rack, plus cross-connects ($200-500/month each)
Key considerations:
/*
* Standard path: NIC → Kernel TCP/IP stack → User space
* Kernel bypass: NIC → User space directly (via DPDK, Solarflare OpenOnload, Mellanox VMA)
*
* Savings: 5-15 microseconds per message
*/
// Solarflare OpenOnload example (Linux):
// Simply preload the library — application code unchanged
// LD_PRELOAD=libonload.so ./trading_app
// DPDK approach (more control, more effort):
// - Polls NIC directly from user space
// - No interrupts, no context switches
// - Requires dedicated NIC and CPU cores
// Key kernel bypass technologies:
// 1. Solarflare OpenOnload: easiest, socket-compatible, ~5 μs savings
// 2. DPDK (Data Plane Development Kit): most flexible, ~8-12 μs savings
// 3. ef_vi (Solarflare raw API): lowest latency, proprietary
// 4. Mellanox VMA: similar to OpenOnload for Mellanox NICs
// 5. ExaNIC: hardware timestamping + kernel bypass
FPGA (Field-Programmable Gate Array) advantages:
- Parallel processing: process multiple fields simultaneously
- Deterministic latency: no OS jitter, no garbage collection
- Pipeline architecture: overlap parsing, logic, and serialization
Typical FPGA trading pipeline:
┌──────────┐ ┌───────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐
│ NIC/PHY │→→│ MD Parser │→→│ Strategy │→→│ Order Build │→→│ NIC/PHY │
│ Receive │ │ (3 clocks)│ │ (5-10 clocks)│ │ (3 clocks) │ │ Transmit │
└──────────┘ └───────────┘ └──────────────┘ └──────────────┘ └──────────┘
Total: ~200-500 ns
# FPGA vendor landscape for trading:
fpga_vendors = {
'Xilinx_Alveo': {
'model': 'U250/U280',
'interface': 'PCIe Gen3/4',
'network': '10/25/100 GbE',
'toolchain': 'Vivado HLS / Vitis',
'typical_latency': '200-800 ns',
},
'Intel_Stratix': {
'model': 'Stratix 10',
'interface': 'PCIe Gen3',
'network': '10/25/100 GbE',
'toolchain': 'Quartus / OneAPI',
'typical_latency': '200-600 ns',
},
'Algo_Logic': {
'note': 'Pre-built trading FPGA solutions',
'market_data': 'Hardware MD parser for major exchanges',
'order_entry': 'Hardware FIX/OUCH/ITCH engines',
},
}
// Key C++ optimizations for low-latency trading:
// 1. Lock-free data structures
#include <atomic>
template<typename T, size_t N>
class SPSCQueue {
// Single-producer single-consumer lock-free queue
std::array<T, N> buffer;
alignas(64) std::atomic<size_t> head{0}; // separate cache lines
alignas(64) std::atomic<size_t> tail{0};
// ...
};
// 2. Memory pre-allocation (no malloc in hot path)
// Pre-allocate all objects at startup, use object pools
// 3. CPU pinning and isolation
// isolcpus=2,3,4,5 in kernel boot params
// pthread_setaffinity_np() to pin trading threads
// 4. NUMA awareness
// Ensure memory is allocated on the same NUMA node as the CPU
// numactl --cpunodebind=0 --membind=0 ./trading_app
// 5. Huge pages (reduce TLB misses)
// echo 1024 > /proc/sys/vm/nr_hugepages
// mmap with MAP_HUGETLB
// 6. Compiler optimizations
// -O3 -march=native -flto -fno-exceptions
// Profile-guided optimization (PGO): -fprofile-generate / -fprofile-use
// 7. Avoid branch mispredictions
// Use branchless code: result = (a > b) * a + (a <= b) * b
// __builtin_expect() for likely/unlikely branches
def latency_measurement_framework():
"""
Accurate latency measurement requires:
1. Hardware timestamping (NIC or FPGA, not system clock)
2. Nanosecond precision (TSC, RDTSC, or PTP)
3. Measure at multiple points in the pipeline
4. Statistical analysis of the distribution (not just mean)
"""
# Key metrics:
metrics = {
'median_latency': 'P50 — typical case',
'p99_latency': 'P99 — tail latency (critical for HFT)',
'p999_latency': 'P99.9 — extreme tail',
'jitter': 'P99 - P50 — consistency matters as much as speed',
'max_latency': 'Worst case — often caused by OS jitter',
}
# Sources of jitter (latency variance):
jitter_sources = {
'context_switches': 'Solution: CPU isolation, RT kernel',
'interrupts': 'Solution: IRQ affinity, move interrupts off trading cores',
'TLB_misses': 'Solution: huge pages',
'cache_misses': 'Solution: keep hot data in L1/L2, avoid false sharing',
'GC_pauses': 'Solution: avoid GC languages, or tune GC (Zing JVM)',
'page_faults': 'Solution: mlockall(), pre-touch memory',
'NUMA_remote': 'Solution: NUMA-aware allocation',
'power_management': 'Solution: disable C-states, set governor to performance',
}
return metrics, jitter_sources
# Hardware timestamp capture:
# On Linux with Solarflare NIC:
# - Use SO_TIMESTAMPING socket option for hardware RX/TX timestamps
# - Accuracy: ~5 nanoseconds
# - Compare hardware timestamps to measure true wire-to-wire latency
def market_data_architecture():
"""
Market data feed handling is often the latency bottleneck.
"""
approaches = {
'consolidated_feed': {
'source': 'SIP (US), CTA/UTP',
'latency': '500 μs - 1 ms (too slow for HFT)',
'use_case': 'Compliance, slow strategies',
},
'direct_feed': {
'source': 'Exchange proprietary (ITCH, PITCH, XDP)',
'latency': '1-10 μs (co-located)',
'use_case': 'Market making, fast strategies',
'formats': {
'nasdaq': 'ITCH 5.0 (binary, efficient)',
'nyse': 'XDP (binary)',
'bats_cboe': 'PITCH (binary)',
'cme': 'MDP 3.0 (binary, multicast)',
}
},
'fpga_parsed': {
'source': 'Direct feed parsed in FPGA',
'latency': '100-500 ns from NIC to parsed message',
'use_case': 'HFT, sub-microsecond strategies',
},
}
# Feed handler optimization:
# 1. Parse only fields you need (skip unused fields)
# 2. Use fixed-offset parsing (binary protocols)
# 3. Multicast: join specific groups, filter early
# 4. Gap detection and recovery without blocking
# 5. Book building: incremental updates, not full rebuilds
return approaches
budget = {
'market_data_receive': 0.5, # μs (co-located, direct feed)
'md_parse_fpga': 0.3, # μs (FPGA parsing)
'strategy_logic': 1.0, # μs (quote update calculation)
'order_build': 0.2, # μs
'network_to_exchange': 0.5, # μs (co-located)
'exchange_processing': 5.0, # μs (exchange matching engine)
}
total = sum(budget.values()) # 7.5 μs wire-to-wire
# Competitive for equity market making in 2024
# Is the latency investment worth it?
# Revenue model for market making:
spread_capture = 0.5 # cents per share (half the spread)
volume_per_day = 1_000_000 # shares
trading_days = 252
gross_revenue = spread_capture * volume_per_day * trading_days / 100 # $1.26M/year
# Faster latency → better queue position → higher fill rate → more revenue
# 10 μs improvement might increase fill rate by 5-15% → $60-190K additional revenue
# Co-location cost: ~$150K/year
# FPGA development: ~$500K one-time + $200K/year maintenance
# Breakeven: depends on strategy capacity and competition