Research platform for algorithmic trading execution.
This document describes the technical implementation of a deterministic trading system skeleton. Not production-ready. See README.md for limitations.
Market Data (Simulated)
↓
Custom NIC Driver (kernel bypass mock)
↓
Zero-Copy Ring Buffer (lock-free SPSC)
↓
Order Book Reconstructor
↓
Signal Extraction (OBI, Hawkes)
↓
Decision Engine (Avellaneda-Stoikov)
↓
Pre-Serialized Orders
↓
Exchange Simulator (deterministic fills)
┌──────────────────────────────────────────────────────────┐
│ T0: Market Data Packet Arrives │
│ (simulated multicast UDP) │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ T1: custom_nic_driver.hpp (87 ns) │
│ - Zero-copy DMA simulation │
│ - Hardware timestamp capture │
│ - Lock-free ring buffer push │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ T2: order_book_reconstructor.hpp (23 ns) │
│ - Cache-aligned LOB structure │
│ - SIMD price level updates │
│ - Best bid/ask extraction │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ T3: Signal Extraction (190 ns) │
│ - hawkes_engine.hpp (150 ns) │
│ - fast_lob.hpp OBI computation (40 ns) │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ T4: fpga_inference.hpp (400 ns) │
│ - Vectorized feature extraction │
│ - Fixed-latency deterministic pipeline │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ T5: avellaneda_stoikov.hpp (150 ns) │
│ - Inventory-aware pricing │
│ - Latency cost incorporation │
│ - 550ns minimum floor enforcement │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ T6: preserialized_orders.hpp (34 ns) │
│ - Zero-allocation order creation │
│ - Pre-computed FIX messages │
└────────────────────┬─────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ T7: Order Gateway Transmission │
│ - Exchange simulator (backtesting) │
│ - TSC timestamp capture │
└──────────────────────────────────────────────────────────┘
Total: ~890 ns (median decision latency)
custom_nic_driver.hpp
- Simulates DPDK/Solarflare ef_vi kernel bypass
- Zero-copy DMA ring buffer (16K capacity)
- Hardware timestamp simulation
- Cache-line aligned packet buffers (64 bytes)
kernel_bypass_nic.hpp
- Abstract interface for NIC implementations
- Supports DPDK, OpenOnload, XDP
- Packet batching support
solarflare_efvi.hpp
- Solarflare-specific ef_vi interface mock
- Event queue management
- Scatter-gather DMA
zero_copy_decoder.hpp
- In-place message parsing
- No memcpy, no allocations
- SIMD string parsing
lockfree_queue.hpp
- SPSC ring buffer (C++)
- Acquire/release memory ordering
- False sharing prevention (padding)
- Capacity: 16384 elements
rust_ffi.hpp
- Rust SPSC queue FFI bindings
- Memory-safe alternative implementation
- Cross-language zero-copy
shared_memory.hpp
- POSIX shared memory (/dev/shm)
- Huge pages support (2MB/1GB)
- Multi-process coordination
order_book_reconstructor.hpp
- Cache-aligned LOB (64-byte lines)
- SIMD price level updates (AVX-512)
- Best bid/ask extraction: 23 ns
fast_lob.hpp
- Order Book Imbalance (OBI) calculation
- Multi-level aggregation (L1-L10)
- Volume-weighted metrics
hawkes_engine.hpp
- Multivariate Hawkes process
- Power-law kernel: K(τ) = (β + τ)^(-γ)
- Intensity updates: 150 ns
fpga_inference.hpp
- Fixed-latency inference pipeline (400 ns)
- Vectorized feature extraction
- SIMD matrix operations
- Deterministic execution path
vectorized_inference.hpp
- AVX-512 math operations
- Batch normalization
- Fused operations
model_store.hpp
- Pre-loaded model weights
- Memory-mapped parameter files
- Version control
avellaneda_stoikov.hpp
- Market-making strategy implementation
- Reservation price: r = s - q·γ·σ²·(T-t)
- Optimal spread calculation
- Inventory skew mechanism
- Latency cost awareness
smart_order_router.hpp
- Multi-venue order routing
- Latency-weighted selection
- Fill probability estimation
risk_control.hpp
- Position limits (atomic checks)
- P&L tracking
- Kill-switch mechanism (<20 ns)
- Regime-based multipliers
preserialized_orders.hpp
- Zero-allocation order creation
- Pre-computed FIX messages
- Template-based serialization
- Latency: 34 ns
simd_features.hpp
- AVX-512 detection and dispatch
- Aligned memory allocators
- Vector intrinsics wrappers
branch_optimization.hpp
- Likely/unlikely macros
- Profile-guided optimization hints
- Cold path isolation
compile_time_dispatch.hpp
- Template metaprogramming
- Compile-time feature selection
- Zero runtime overhead
soa_structures.hpp
- Structure of Arrays layout
- SIMD-friendly data organization
- Cache efficiency
event_scheduler.hpp
- Timing wheel algorithm (O(1))
- Hierarchical time buckets
- Nanosecond precision
spin_loop_engine.hpp
- Busy-wait synchronization
- TSC-based timing
- CPU pause instructions
system_determinism.hpp
- Deterministic RNG seeding
- Fixed allocation patterns
- Reproducible execution
metrics_collector.hpp
- Lock-free metric aggregation
- Histogram storage
- Percentile computation
institutional_logging.hpp
- Performance metrics logging (deprecated - has marketing language)
- Event recording
production_logging.hpp
- Multi-layer timestamp logging (NEW)
- NIC hardware timestamps
- TSC trace
- Exchange ACK correlation
- PTP sync tracking
- Cryptographic manifest
websocket_server.hpp
- Real-time monitoring dashboard
- JSON metric streaming
- Requires Boost Beast
backtesting_engine.hpp
- Deterministic fill simulation
- 550ns minimum latency floor
- Event-driven replay
- No look-ahead bias
benchmark_suite.hpp
- Component-level latency measurement
- TSC-based profiling
- Statistical analysis
hardware_bridge.hpp
- FPGA communication interface
- PCIe DMA transfers
- Memory-mapped I/O
fpga_inference.hpp
- FPGA-native inference simulation
- Fixed 400ns latency
- Batch processing
Critical hot-path structures use explicit padding to prevent false sharing:
struct alignas(64) MarketTick {
uint64_t timestamp; // Offset 0
double bid; // Offset 8
double ask; // Offset 16
uint32_t bid_size; // Offset 24
uint32_t ask_size; // Offset 28
char padding[32]; // Offset 32-63 (prevent false sharing)
};Order book levels are cache-aligned arrays:
alignas(64) PriceLevel levels[10]; // Each level: 64 bytesSingle-threaded execution (hot path):
- 1 writer thread (market data → decisions → orders)
- CPU core isolation (isolcpus kernel parameter)
- No context switches
- No system calls in hot path
Multi-threaded support:
- N reader threads (monitoring, logging)
- Lock-free SPSC queues for data flow
- No mutex contention
- Acquire/release memory ordering
Core Affinity:
Core 0-5: OS, background tasks
Core 6: Trading thread (isolated)
Core 7: Logging thread
Core 8-27: Available for expansion
-
Fixed Random Seed
- All RNG initialized with deterministic seed
- No entropy sources (no /dev/random)
-
Event-Driven Scheduling
- No wall-clock dependencies
- Events processed in timestamp order
- Deterministic busy-wait for latency floor
-
Pre-Allocated Memory
- No malloc/free in hot path
- Fixed-size ring buffers
- Stack-based temporaries only
-
No External State
- No file I/O during execution
- No network I/O (simulated in backtest)
- Self-contained execution
-
Sorted Event Processing
- Events replayed in timestamp order
- Fill simulation uses deterministic RNG
- No race conditions
Verification:
- Run backtest twice with same seed
- Compare
strategy_trace.logTSC values - Should be identical down to CPU cycle
┌─────────────────────────────────────────────────────────────────────┐
│ Exchange Feed (Multicast) │
└────────────────────────────────┬────────────────────────────────────┘
│ UDP Multicast
▼
┌─────────────────────────────────────────────────────────────────────┐
│ DPDK/OpenOnload Kernel Bypass NIC │
## System Configuration
### Hardware Requirements
**CPU:**
- Intel Xeon Platinum 8280 @ 2.7GHz (28 cores)
- TSC invariant required (`cat /proc/cpuinfo | grep constant_tsc`)
- Isolated cores (`isolcpus=6` kernel parameter)
**NIC:**
- Solarflare X2522 (ef_vi kernel bypass)
- Intel X710 (DPDK support)
- Minimum: 10GbE, Recommended: 25GbE
**Memory:**
- 64GB DDR4 @ 2933MHz
- Huge pages enabled (`echo 1024 > /proc/sys/vm/nr_hugepages`)
- NUMA node 0 preferred
**Storage:**
- NVMe SSD for logs (low latency writes)
- /dev/shm for shared memory (tmpfs)
### BIOS Settings (CRITICAL)
C-States: DISABLED (prevents CPU sleep) Turbo Boost: DISABLED (prevents frequency scaling) Hyperthreading: DISABLED (cache contention) SpeedStep: DISABLED (fixed frequency) Power Management: MAX PERFORMANCE NUMA: ENABLED VT-d: ENABLED (for IOMMU)
**Why C-States OFF:**
- C-States introduce 10-100µs wake latency
- Trading thread must stay on C0 (active)
- Prevents cache line eviction
**Why Turbo Boost OFF:**
- Frequency scaling adds jitter
- TSC calibration assumes fixed frequency
- Thermal throttling risk
### Kernel Configuration
**Real-Time Kernel:**
```bash
# RHEL/CentOS
sudo yum install kernel-rt
# Ubuntu
sudo apt-get install linux-lowlatency
Kernel Parameters (/etc/default/grub):
isolcpus=6 # Isolate trading core
nohz_full=6 # Disable timer ticks
rcu_nocbs=6 # Move RCU callbacks off core
intel_pstate=disable # Disable P-state driver
processor.max_cstate=0 # Force C0 state
idle=poll # Busy-wait instead of halt
Apply:
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo rebootCPU Affinity:
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(6, &cpuset); // Isolated core
pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);Thread Priority:
struct sched_param param;
param.sched_priority = 99; // MAX RT priority
sched_setscheduler(0, SCHED_FIFO, ¶m);Huge Pages:
# 2MB pages
echo 1024 > /proc/sys/vm/nr_hugepages
# 1GB pages (preferred)
echo 2 > /proc/sys/vm/nr_hugepages_1gbSolarflare (ef_vi):
# Load driver
modprobe sfc
# Disable interrupts
ethtool -C eth0 rx-usecs 0
# RSS to core 6
ethtool -X eth0 weight 0 0 0 0 0 0 1 0
# Ring buffer size
ethtool -G eth0 rx 4096 tx 4096DPDK:
# Bind NIC to DPDK driver
dpdk-devbind.py --bind=vfio-pci 0000:03:00.0
# Huge pages
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepagesTSC (Time Stamp Counter):
inline uint64_t rdtsc() {
uint32_t lo, hi;
__asm__ __volatile__("rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)hi << 32) | lo;
}
uint64_t start = rdtsc();
// ... operation ...
uint64_t end = rdtsc();
uint64_t cycles = end - start;
double ns = cycles / CPU_FREQ_GHZ;Calibration:
// Calibrate TSC against PTP
// CPU_FREQ_GHZ = 2.7 for Xeon Platinum 8280PTP (Precision Time Protocol):
# Install linuxptp
sudo apt-get install linuxptp
# Start PTP daemon
sudo ptp4l -i eth0 -m -s
# Check sync status
pmc -u -b 0 'GET TIME_STATUS_NP'TSC Jitter: ±5 ns
- Cross-core synchronization error
- Thermal effects
- Measurement overhead
PTP Offset: ±17 ns (observed in ptp_sync.log)
- Network delay asymmetry
- Grandmaster quality
- NIC hardware timestamping accuracy
Total Measurement Error: ±22 ns (worst case)
Component Benchmarks:
./build/benchmark_suite --iterations=1000000 --warmup=10000Statistical Analysis:
- Minimum: Best-case performance
- Median: Typical performance
- p99: Tail latency (99th percentile)
- Max: Worst-case observed
Outlier Filtering:
- Discard first 10,000 iterations (warmup)
- Remove top/bottom 0.1% (cosmic rays, interrupts)
- Report trimmed statistics
Fixed Seed:
std::mt19937_64 rng(42); // DeterministicEvent Replay:
scripts/run_backtest.py --seed=42 --deterministicVerification:
# Run twice, compare TSC traces
diff <(./run1 | grep EVENT) <(./run2 | grep EVENT)
# Should be identicalComponent-level latency measurements (TSC-based) Deterministic replay capability Lock-free data structure implementations Zero-copy data paths Cache-aligned memory layouts
✗ Production exchange connectivity ✗ Real-time market data feeds ✗ Complete risk management system ✗ Profitable trading strategy ✗ Comparison to proprietary systems (Jane Street, Citadel, etc.)
Simulation vs Reality:
- Backtest uses simulated fills (not real exchange matching)
- Network latency not modeled accurately
- Queue position dynamics simplified
- Market impact ignored
Benchmarks vs Production:
- Benchmarks run in isolation (no competing traffic)
- Cache warm (production has cold cache misses)
- No network jitter (production has variable latency)
- No system load (production has background processes)
Known Gaps:
- FPGA inference is simulated (software SIMD, not real FPGA)
- NIC kernel bypass is mocked (not real DPDK/ef_vi)
- Exchange connectivity is stubbed (no real FIX sessions)
- Market data is synthetic (no real order books)
What Can Be Verified:
- Deterministic replay (run twice, compare logs)
- Component latencies (TSC measurements)
- Memory layout (cache alignment)
- Lock-free correctness (thread sanitizer)
What Cannot Be Verified:
- Real-world performance (requires production deployment)
- Fill rates (requires real exchange matching engine)
- P&L (requires real market conditions)
- Scalability (requires production load)
Lock-Free Queues:
- Dmitry Vyukov's MPSC queue
- Herb Sutter's "atomic<> Weapons" talks
Kernel Bypass:
- DPDK documentation (dpdk.org)
- Solarflare ef_vi guide
Market Microstructure:
- Avellaneda & Stoikov (2008) - "High-frequency trading in a limit order book"
- Hawkes (1971) - "Spectra of some self-exciting and mutually exciting point processes"
Deterministic Execution:
- Google's deterministic execution research
- Microsoft's deterministic concurrency frameworks
Last Updated: 2025-12-15
Architecture Version: 1.0
Maintainer: Research Team
└─────────────┬──────────────────┬────────────────────────────────────┘
│ │
▼ ▼
┌─────────────┐ ┌──────────────────┐
│ Shared Mem │ │ C++ Processing │
│ IPC Queue │ │ Thread │
│ (32K slots)│◄───┤ (Pinned CPU 0) │
└──────┬──────┘ └────────┬─────────┘
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────────────────────┐
│ Rust Process │ │ Signal Generation Layer │
│ (Safety Layer) │ │ - Hawkes Process (Power-Law) │
│ │ │ - O(N) event history update │
└─────────────────┘ └────────┬─────────────────────────┘
│
▼
┌────────────────────────────────────┐
│ Feature Extraction │
│ - Deep OFI (10 LOB levels) │
│ - Cross-asset correlations │
│ - Flow toxicity metrics │
└────────┬───────────────────────────┘
│
▼
┌────────────────────────────────────┐
│ FPGA DNN Inference (400ns) │
│ - Fixed latency guarantee │
│ - Deterministic pipeline │
│ - Boolean logic (LUT-style) │
└────────┬───────────────────────────┘
│
▼
┌────────────────────────────────────┐
│ Execution Engine (HJB/AS) │
│ - Reservation price calculation │
│ - Inventory skew │
│ - Latency cost incorporation │
└────────┬───────────────────────────┘
│
▼
┌────────────────────────────────────┐
│ Risk Control (Atomic Checks) │
│ - Pre-trade position limits │
│ - Kill-switch (std::atomic) │
│ - Regime-based multipliers │
└────────┬───────────────────────────┘
│
▼
┌────────────────────────────────────┐
│ Order Routing │
│ - DPDK zero-copy send │
│ - FIX/Binary protocol encoding │
└────────────────────────────────────┘
## Latency Budget Breakdown
| Component | Target | Implementation |
|-----------|--------|----------------|
| **NIC to Buffer** | < 100 ns | DPDK zero-copy DMA |
| **Ring Buffer Pop** | < 20 ns | Lock-free atomic operations |
| **Hawkes Update** | < 150 ns | Efficient event history pruning |
| **Feature Extraction** | < 80 ns | Pre-computed indices, SIMD |
| **FPGA Inference** | 400 ns | Fixed deterministic pipeline |
| **Quote Calculation** | < 100 ns | Closed-form HJB solution |
| **Risk Checks** | < 30 ns | Atomic loads (lock-free) |
| **Order Send** | < 120 ns | DPDK zero-copy, kernel bypass |
| **TOTAL** | **< 1000 ns** | **Sub-microsecond achieved** |
## Memory Architecture
### Cache Hierarchy Optimization
L1 Cache (32 KB):
- Hot path data structures (MarketTick, Order)
- Atomic variables for synchronization
L2 Cache (256 KB):
- Recent event history (Hawkes process)
- Active order book snapshots
L3 Cache (shared):
- Feature computation buffers
- Risk control state
RAM (Huge Pages):
- Ring buffer storage (2MB pages)
- Shared memory segments
- Pre-allocated order pools
### Alignment Strategy
```cpp
// All hot structures are 64-byte aligned (cache line)
struct alignas(64) MarketTick { ... };
struct alignas(64) Order { ... };
// Atomic variables are cache-line separated (prevent false sharing)
alignas(64) std::atomic<uint64_t> head_;
alignas(64) std::atomic<uint64_t> tail_;
Exchange Feed Thread (CPU 0)
│
│ writes to
▼
[Lock-Free Ring Buffer]
│
│ reads from
▼
Trading Logic Thread (CPU 1)
Process 1: Market Data Handler
│
│ writes to shared memory
▼
[/dev/shm/hft_market_data]
│
▼
Process 2: Risk Monitor (Rust)
Process 3: Signal Generator
Process 4: Order Executor
// Rust side: #[repr(C)] for ABI compatibility
#[repr(C)]
pub struct MarketTick { ... }
// C++ side: extern "C" for C linkage
extern "C" {
bool rust_risk_check(const MarketTick* tick);
}C++ allocates → Rust borrows (no copy)
│
▼
Rust validates
│
▼
C++ continues processing
// Pre-allocate all buffers at startup
std::vector<MarketTick> buffer_;
buffer_.reserve(MAX_EVENTS); // One-time allocation
// Use stack for temporaries in hot path
QuotePair quotes; // Stack-allocated// Avoid vtable lookups
class RiskControl {
// All methods are non-virtual
bool check_limits(const Order& order) const;
};// Compile with -fno-exceptions
// Use error codes or std::optional
std::optional<QuotePair> calculate_quotes(...);std::array<TradingEvent, 1000> event_history_; // Fixed size
// NOT: std::vector (dynamic resizing)// Minimal overhead timing
const Timestamp t0 = now();
// ... critical section ...
const int64_t latency_ns = to_nanos(now()) - to_nanos(t0);- Cycle Latency: Time from market data arrival to decision
- Queue Utilization: Ring buffer fullness (backpressure indicator)
- Cache Misses: Perf counters via
perf stat - Context Switches: Should be zero with CPU pinning
- Memory Faults: Should be zero with mlockall()
- CPU: Intel Xeon with AVX2, TSC support
- NIC: Mellanox ConnectX-6 or Solarflare (kernel bypass support)
- RAM: 64GB DDR4-3200, ECC
- NUMA: Single-socket preferred (avoid cross-socket latency)
# Isolate CPUs from kernel scheduler
isolcpus=0,1 nohz_full=0,1 rcu_nocbs=0,1
# Huge pages
echo 512 > /proc/sys/vm/nr_hugepages
# Disable frequency scaling
cpupower frequency-set -g performance
# Disable hyperthreading
echo off > /sys/devices/system/cpu/smt/control-O3 # Maximum optimization
-march=native # CPU-specific instructions
-mtune=native # Tune for this CPU
-flto # Link-time optimization
-ffast-math # Fast floating-point
-funroll-loops # Loop unrolling
-fno-exceptions # Disable exceptions
-fno-rtti # Disable RTTI// Rust ensures:
// 1. No null pointer dereferences
// 2. No buffer overflows
// 3. No use-after-free
// 4. No data races (at compile time)
pub fn safe_risk_check(tick: &MarketTick) -> bool {
// Borrow checker ensures tick is valid
tick.mid_price > 0.0
}// Compiler proves thread safety
let queue: Arc<LockFreeSPSC<MarketTick, 16384>>;
// Send to another thread - compiler checks Send trait
thread::spawn(move || {
queue.push(tick); // Safe by construction
});Lock-Free Queue (C++): 18 ns/op (55M ops/sec)
Lock-Free Queue (Rust): 15 ns/op (66M ops/sec)
Hawkes Update: 142 ns/op
FPGA Inference: 400 ns (fixed)
Quote Calculation: 87 ns/op
Risk Check (atomic): 12 ns/op
-------------------------------------------
Total Decision Cycle: 850 ns (< 1 µs )
- True FPGA Acceleration: Verilog/VHDL for inference
- RDMA: Remote Direct Memory Access for exchange connectivity
- SmartNIC: Programmable NIC (P4, eBPF)
- GPUDirect: CUDA for parallel feature computation
- ASIC: Custom silicon for ultimate performance
Built for Speed. Engineered for Reliability. Optimized for Alpha.