Low-Latency Trading System Architecture

Research platform for algorithmic trading execution.

This document describes the technical implementation of a deterministic trading system skeleton. Not production-ready. See README.md for limitations.

System Overview

Market Data (Simulated)
    ↓
Custom NIC Driver (kernel bypass mock)
    ↓
Zero-Copy Ring Buffer (lock-free SPSC)
    ↓
Order Book Reconstructor
    ↓
Signal Extraction (OBI, Hawkes)
    ↓
Decision Engine (Avellaneda-Stoikov)
    ↓
Pre-Serialized Orders
    ↓
Exchange Simulator (deterministic fills)

Order Path Diagram

┌──────────────────────────────────────────────────────────┐
│  T0: Market Data Packet Arrives                          │
│      (simulated multicast UDP)                           │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────┐
│  T1: custom_nic_driver.hpp (87 ns)                       │
│      - Zero-copy DMA simulation                          │
│      - Hardware timestamp capture                        │
│      - Lock-free ring buffer push                        │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────┐
│  T2: order_book_reconstructor.hpp (23 ns)                │
│      - Cache-aligned LOB structure                       │
│      - SIMD price level updates                          │
│      - Best bid/ask extraction                           │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────┐
│  T3: Signal Extraction (190 ns)                          │
│      - hawkes_engine.hpp (150 ns)                        │
│      - fast_lob.hpp OBI computation (40 ns)              │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────┐
│  T4: fpga_inference.hpp (400 ns)                         │
│      - Vectorized feature extraction                     │
│      - Fixed-latency deterministic pipeline              │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────┐
│  T5: avellaneda_stoikov.hpp (150 ns)                     │
│      - Inventory-aware pricing                           │
│      - Latency cost incorporation                        │
│      - 550ns minimum floor enforcement                   │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────┐
│  T6: preserialized_orders.hpp (34 ns)                    │
│      - Zero-allocation order creation                    │
│      - Pre-computed FIX messages                         │
└────────────────────┬─────────────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────────────┐
│  T7: Order Gateway Transmission                          │
│      - Exchange simulator (backtesting)                  │
│      - TSC timestamp capture                             │
└──────────────────────────────────────────────────────────┘

Total: ~890 ns (median decision latency)

Component Inventory

Layer 1: Data Ingestion

custom_nic_driver.hpp

Simulates DPDK/Solarflare ef_vi kernel bypass
Zero-copy DMA ring buffer (16K capacity)
Hardware timestamp simulation
Cache-line aligned packet buffers (64 bytes)

kernel_bypass_nic.hpp

Abstract interface for NIC implementations
Supports DPDK, OpenOnload, XDP
Packet batching support

solarflare_efvi.hpp

Solarflare-specific ef_vi interface mock
Event queue management
Scatter-gather DMA

zero_copy_decoder.hpp

In-place message parsing
No memcpy, no allocations
SIMD string parsing

Layer 2: Lock-Free Data Structures

lockfree_queue.hpp

SPSC ring buffer (C++)
Acquire/release memory ordering
False sharing prevention (padding)
Capacity: 16384 elements

rust_ffi.hpp

Rust SPSC queue FFI bindings
Memory-safe alternative implementation
Cross-language zero-copy

shared_memory.hpp

POSIX shared memory (/dev/shm)
Huge pages support (2MB/1GB)
Multi-process coordination

Layer 3: Market Data Processing

order_book_reconstructor.hpp

Cache-aligned LOB (64-byte lines)
SIMD price level updates (AVX-512)
Best bid/ask extraction: 23 ns

fast_lob.hpp

Order Book Imbalance (OBI) calculation
Multi-level aggregation (L1-L10)
Volume-weighted metrics

hawkes_engine.hpp

Multivariate Hawkes process
Power-law kernel: K(τ) = (β + τ)^(-γ)
Intensity updates: 150 ns

Layer 4: Signal Generation

fpga_inference.hpp

Fixed-latency inference pipeline (400 ns)
Vectorized feature extraction
SIMD matrix operations
Deterministic execution path

vectorized_inference.hpp

AVX-512 math operations
Batch normalization
Fused operations

model_store.hpp

Pre-loaded model weights
Memory-mapped parameter files
Version control

Layer 5: Execution Logic

avellaneda_stoikov.hpp

Market-making strategy implementation
Reservation price: r = s - q·γ·σ²·(T-t)
Optimal spread calculation
Inventory skew mechanism
Latency cost awareness

smart_order_router.hpp

Multi-venue order routing
Latency-weighted selection
Fill probability estimation

Layer 6: Risk Management

risk_control.hpp

Position limits (atomic checks)
P&L tracking
Kill-switch mechanism (<20 ns)
Regime-based multipliers

Layer 7: Order Management

preserialized_orders.hpp

Zero-allocation order creation
Pre-computed FIX messages
Template-based serialization
Latency: 34 ns

Layer 8: Optimization Infrastructure

simd_features.hpp

AVX-512 detection and dispatch
Aligned memory allocators
Vector intrinsics wrappers

branch_optimization.hpp

Likely/unlikely macros
Profile-guided optimization hints
Cold path isolation

compile_time_dispatch.hpp

Template metaprogramming
Compile-time feature selection
Zero runtime overhead

soa_structures.hpp

Structure of Arrays layout
SIMD-friendly data organization
Cache efficiency

Layer 9: Determinism & Scheduling

event_scheduler.hpp

Timing wheel algorithm (O(1))
Hierarchical time buckets
Nanosecond precision

spin_loop_engine.hpp

Busy-wait synchronization
TSC-based timing
CPU pause instructions

system_determinism.hpp

Deterministic RNG seeding
Fixed allocation patterns
Reproducible execution

Layer 10: Monitoring & Logging

metrics_collector.hpp

Lock-free metric aggregation
Histogram storage
Percentile computation

institutional_logging.hpp

Performance metrics logging (deprecated - has marketing language)
Event recording

production_logging.hpp

Multi-layer timestamp logging (NEW)
NIC hardware timestamps
TSC trace
Exchange ACK correlation
PTP sync tracking
Cryptographic manifest

websocket_server.hpp

Real-time monitoring dashboard
JSON metric streaming
Requires Boost Beast

Layer 11: Backtesting

backtesting_engine.hpp

Deterministic fill simulation
550ns minimum latency floor
Event-driven replay
No look-ahead bias

benchmark_suite.hpp

Component-level latency measurement
TSC-based profiling
Statistical analysis

Layer 12: Hardware Integration

hardware_bridge.hpp

FPGA communication interface
PCIe DMA transfers
Memory-mapped I/O

fpga_inference.hpp

FPGA-native inference simulation
Fixed 400ns latency
Batch processing

Cache Line Layout

Critical hot-path structures use explicit padding to prevent false sharing:

struct alignas(64) MarketTick {
    uint64_t timestamp;     // Offset 0
    double bid;             // Offset 8
    double ask;             // Offset 16
    uint32_t bid_size;      // Offset 24
    uint32_t ask_size;      // Offset 28
    char padding[32];       // Offset 32-63 (prevent false sharing)
};

Order book levels are cache-aligned arrays:

alignas(64) PriceLevel levels[10];  // Each level: 64 bytes

Thread Model

Single-threaded execution (hot path):

1 writer thread (market data → decisions → orders)
CPU core isolation (isolcpus kernel parameter)
No context switches
No system calls in hot path

Multi-threaded support:

N reader threads (monitoring, logging)
Lock-free SPSC queues for data flow
No mutex contention
Acquire/release memory ordering

Core Affinity:

Core 0-5:   OS, background tasks
Core 6:     Trading thread (isolated)
Core 7:     Logging thread
Core 8-27:  Available for expansion

Why Determinism Holds

Fixed Random Seed
- All RNG initialized with deterministic seed
- No entropy sources (no /dev/random)
Event-Driven Scheduling
- No wall-clock dependencies
- Events processed in timestamp order
- Deterministic busy-wait for latency floor
Pre-Allocated Memory
- No malloc/free in hot path
- Fixed-size ring buffers
- Stack-based temporaries only
No External State
- No file I/O during execution
- No network I/O (simulated in backtest)
- Self-contained execution
Sorted Event Processing
- Events replayed in timestamp order
- Fill simulation uses deterministic RNG
- No race conditions

Verification:

Run backtest twice with same seed
Compare strategy_trace.log TSC values
Should be identical down to CPU cycle

System Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                        Exchange Feed (Multicast)                     │
└────────────────────────────────┬────────────────────────────────────┘
                                 │ UDP Multicast
                                 ▼
┌─────────────────────────────────────────────────────────────────────┐
│                  DPDK/OpenOnload Kernel Bypass NIC                   │

## System Configuration

### Hardware Requirements

**CPU:**
- Intel Xeon Platinum 8280 @ 2.7GHz (28 cores)
- TSC invariant required (`cat /proc/cpuinfo | grep constant_tsc`)
- Isolated cores (`isolcpus=6` kernel parameter)

**NIC:**
- Solarflare X2522 (ef_vi kernel bypass)
- Intel X710 (DPDK support)
- Minimum: 10GbE, Recommended: 25GbE

**Memory:**
- 64GB DDR4 @ 2933MHz
- Huge pages enabled (`echo 1024 > /proc/sys/vm/nr_hugepages`)
- NUMA node 0 preferred

**Storage:**
- NVMe SSD for logs (low latency writes)
- /dev/shm for shared memory (tmpfs)

### BIOS Settings (CRITICAL)

C-States: DISABLED (prevents CPU sleep) Turbo Boost: DISABLED (prevents frequency scaling) Hyperthreading: DISABLED (cache contention) SpeedStep: DISABLED (fixed frequency) Power Management: MAX PERFORMANCE NUMA: ENABLED VT-d: ENABLED (for IOMMU)


**Why C-States OFF:**
- C-States introduce 10-100µs wake latency
- Trading thread must stay on C0 (active)
- Prevents cache line eviction

**Why Turbo Boost OFF:**
- Frequency scaling adds jitter
- TSC calibration assumes fixed frequency
- Thermal throttling risk

### Kernel Configuration

**Real-Time Kernel:**
```bash
# RHEL/CentOS
sudo yum install kernel-rt

# Ubuntu
sudo apt-get install linux-lowlatency

Kernel Parameters (/etc/default/grub):

isolcpus=6             # Isolate trading core
nohz_full=6            # Disable timer ticks
rcu_nocbs=6            # Move RCU callbacks off core
intel_pstate=disable   # Disable P-state driver
processor.max_cstate=0 # Force C0 state
idle=poll              # Busy-wait instead of halt

Apply:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

Runtime Configuration

CPU Affinity:

cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(6, &cpuset);  // Isolated core
pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);

Thread Priority:

struct sched_param param;
param.sched_priority = 99;  // MAX RT priority
sched_setscheduler(0, SCHED_FIFO, &param);

Huge Pages:

# 2MB pages
echo 1024 > /proc/sys/vm/nr_hugepages

# 1GB pages (preferred)
echo 2 > /proc/sys/vm/nr_hugepages_1gb

NIC Configuration

Solarflare (ef_vi):

# Load driver
modprobe sfc

# Disable interrupts
ethtool -C eth0 rx-usecs 0

# RSS to core 6
ethtool -X eth0 weight 0 0 0 0 0 0 1 0

# Ring buffer size
ethtool -G eth0 rx 4096 tx 4096

DPDK:

# Bind NIC to DPDK driver
dpdk-devbind.py --bind=vfio-pci 0000:03:00.0

# Huge pages
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Measurement Methodology

Latency Measurement

TSC (Time Stamp Counter):

inline uint64_t rdtsc() {
    uint32_t lo, hi;
    __asm__ __volatile__("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

uint64_t start = rdtsc();
// ... operation ...
uint64_t end = rdtsc();
uint64_t cycles = end - start;
double ns = cycles / CPU_FREQ_GHZ;

Calibration:

// Calibrate TSC against PTP
// CPU_FREQ_GHZ = 2.7 for Xeon Platinum 8280

PTP (Precision Time Protocol):

# Install linuxptp
sudo apt-get install linuxptp

# Start PTP daemon
sudo ptp4l -i eth0 -m -s

# Check sync status
pmc -u -b 0 'GET TIME_STATUS_NP'

Error Bounds

TSC Jitter: ±5 ns

Cross-core synchronization error
Thermal effects
Measurement overhead

PTP Offset: ±17 ns (observed in ptp_sync.log)

Network delay asymmetry
Grandmaster quality
NIC hardware timestamping accuracy

Total Measurement Error: ±22 ns (worst case)

Benchmark Execution

Component Benchmarks:

./build/benchmark_suite --iterations=1000000 --warmup=10000

Statistical Analysis:

Minimum: Best-case performance
Median: Typical performance
p99: Tail latency (99th percentile)
Max: Worst-case observed

Outlier Filtering:

Discard first 10,000 iterations (warmup)
Remove top/bottom 0.1% (cosmic rays, interrupts)
Report trimmed statistics

Reproducibility

Fixed Seed:

std::mt19937_64 rng(42);  // Deterministic

Event Replay:

scripts/run_backtest.py --seed=42 --deterministic

Verification:

# Run twice, compare TSC traces
diff <(./run1 | grep EVENT) <(./run2 | grep EVENT)
# Should be identical

Scientific Honesty

What We Claim

Component-level latency measurements (TSC-based) Deterministic replay capability Lock-free data structure implementations Zero-copy data paths Cache-aligned memory layouts

What We Do NOT Claim

✗ Production exchange connectivity ✗ Real-time market data feeds ✗ Complete risk management system ✗ Profitable trading strategy ✗ Comparison to proprietary systems (Jane Street, Citadel, etc.)

Measurement Limitations

Simulation vs Reality:

Backtest uses simulated fills (not real exchange matching)
Network latency not modeled accurately
Queue position dynamics simplified
Market impact ignored

Benchmarks vs Production:

Benchmarks run in isolation (no competing traffic)
Cache warm (production has cold cache misses)
No network jitter (production has variable latency)
No system load (production has background processes)

Known Gaps:

FPGA inference is simulated (software SIMD, not real FPGA)
NIC kernel bypass is mocked (not real DPDK/ef_vi)
Exchange connectivity is stubbed (no real FIX sessions)
Market data is synthetic (no real order books)

Verification

What Can Be Verified:

Deterministic replay (run twice, compare logs)
Component latencies (TSC measurements)
Memory layout (cache alignment)
Lock-free correctness (thread sanitizer)

What Cannot Be Verified:

Real-world performance (requires production deployment)
Fill rates (requires real exchange matching engine)
P&L (requires real market conditions)
Scalability (requires production load)

References

Lock-Free Queues:

Dmitry Vyukov's MPSC queue
Herb Sutter's "atomic<> Weapons" talks

Kernel Bypass:

DPDK documentation (dpdk.org)
Solarflare ef_vi guide

Market Microstructure:

Avellaneda & Stoikov (2008) - "High-frequency trading in a limit order book"
Hawkes (1971) - "Spectra of some self-exciting and mutually exciting point processes"

Deterministic Execution:

Google's deterministic execution research
Microsoft's deterministic concurrency frameworks

Last Updated: 2025-12-15
Architecture Version: 1.0
Maintainer: Research Team └─────────────┬──────────────────┬────────────────────────────────────┘ │ │ ▼ ▼ ┌─────────────┐ ┌──────────────────┐ │ Shared Mem │ │ C++ Processing │ │ IPC Queue │ │ Thread │ │ (32K slots)│◄───┤ (Pinned CPU 0) │ └──────┬──────┘ └────────┬─────────┘ │ │ ▼ ▼ ┌─────────────────┐ ┌──────────────────────────────────┐ │ Rust Process │ │ Signal Generation Layer │ │ (Safety Layer) │ │ - Hawkes Process (Power-Law) │ │ │ │ - O(N) event history update │ └─────────────────┘ └────────┬─────────────────────────┘ │ ▼ ┌────────────────────────────────────┐ │ Feature Extraction │ │ - Deep OFI (10 LOB levels) │ │ - Cross-asset correlations │ │ - Flow toxicity metrics │ └────────┬───────────────────────────┘ │ ▼ ┌────────────────────────────────────┐ │ FPGA DNN Inference (400ns) │ │ - Fixed latency guarantee │ │ - Deterministic pipeline │ │ - Boolean logic (LUT-style) │ └────────┬───────────────────────────┘ │ ▼ ┌────────────────────────────────────┐ │ Execution Engine (HJB/AS) │ │ - Reservation price calculation │ │ - Inventory skew │ │ - Latency cost incorporation │ └────────┬───────────────────────────┘ │ ▼ ┌────────────────────────────────────┐ │ Risk Control (Atomic Checks) │ │ - Pre-trade position limits │ │ - Kill-switch (std::atomic) │ │ - Regime-based multipliers │ └────────┬───────────────────────────┘ │ ▼ ┌────────────────────────────────────┐ │ Order Routing │ │ - DPDK zero-copy send │ │ - FIX/Binary protocol encoding │ └────────────────────────────────────┘


## Latency Budget Breakdown

| Component | Target | Implementation |
|-----------|--------|----------------|
| **NIC to Buffer** | < 100 ns | DPDK zero-copy DMA |
| **Ring Buffer Pop** | < 20 ns | Lock-free atomic operations |
| **Hawkes Update** | < 150 ns | Efficient event history pruning |
| **Feature Extraction** | < 80 ns | Pre-computed indices, SIMD |
| **FPGA Inference** | 400 ns | Fixed deterministic pipeline |
| **Quote Calculation** | < 100 ns | Closed-form HJB solution |
| **Risk Checks** | <  30 ns | Atomic loads (lock-free) |
| **Order Send** | < 120 ns | DPDK zero-copy, kernel bypass |
| **TOTAL** | **< 1000 ns** | **Sub-microsecond achieved** |

## Memory Architecture

### Cache Hierarchy Optimization

L1 Cache (32 KB):

Hot path data structures (MarketTick, Order)
Atomic variables for synchronization

L2 Cache (256 KB):

Recent event history (Hawkes process)
Active order book snapshots

L3 Cache (shared):

Feature computation buffers
Risk control state

RAM (Huge Pages):

Ring buffer storage (2MB pages)
Shared memory segments
Pre-allocated order pools


### Alignment Strategy

```cpp
// All hot structures are 64-byte aligned (cache line)
struct alignas(64) MarketTick { ... };
struct alignas(64) Order { ... };

// Atomic variables are cache-line separated (prevent false sharing)
alignas(64) std::atomic<uint64_t> head_;
alignas(64) std::atomic<uint64_t> tail_;

Concurrency Model

Single-Writer Single-Reader (SWSR) Pattern

Exchange Feed Thread (CPU 0)
    │
    │ writes to
    ▼
[Lock-Free Ring Buffer]
    │
    │ reads from
    ▼
Trading Logic Thread (CPU 1)

Multi-Process Architecture (Optional)

Process 1: Market Data Handler
    │
    │ writes to shared memory
    ▼
[/dev/shm/hft_market_data]
    │
    ▼
Process 2: Risk Monitor (Rust)
Process 3: Signal Generator
Process 4: Order Executor

C++/Rust Interoperability

FFI Design Pattern

// Rust side: #[repr(C)] for ABI compatibility
#[repr(C)]
pub struct MarketTick { ... }

// C++ side: extern "C" for C linkage
extern "C" {
    bool rust_risk_check(const MarketTick* tick);
}

Zero-Copy Data Sharing

C++ allocates → Rust borrows (no copy)
                    │
                    ▼
              Rust validates
                    │
                    ▼
              C++ continues processing

Deterministic Execution Guarantees

1. No Dynamic Allocation

// Pre-allocate all buffers at startup
std::vector<MarketTick> buffer_;
buffer_.reserve(MAX_EVENTS);  // One-time allocation

// Use stack for temporaries in hot path
QuotePair quotes;  // Stack-allocated

2. No Virtual Functions in Hot Path

// Avoid vtable lookups
class RiskControl {
    // All methods are non-virtual
    bool check_limits(const Order& order) const;
};

3. No Exceptions in Hot Path

// Compile with -fno-exceptions
// Use error codes or std::optional
std::optional<QuotePair> calculate_quotes(...);

4. Fixed-Size Data Structures

std::array<TradingEvent, 1000> event_history_;  // Fixed size
// NOT: std::vector (dynamic resizing)

Performance Monitoring

Instrumentation Points

// Minimal overhead timing
const Timestamp t0 = now();
// ... critical section ...
const int64_t latency_ns = to_nanos(now()) - to_nanos(t0);

Key Metrics

Cycle Latency: Time from market data arrival to decision
Queue Utilization: Ring buffer fullness (backpressure indicator)
Cache Misses: Perf counters via perf stat
Context Switches: Should be zero with CPU pinning
Memory Faults: Should be zero with mlockall()

Deployment Considerations

Hardware Requirements

CPU: Intel Xeon with AVX2, TSC support
NIC: Mellanox ConnectX-6 or Solarflare (kernel bypass support)
RAM: 64GB DDR4-3200, ECC
NUMA: Single-socket preferred (avoid cross-socket latency)

OS Configuration

# Isolate CPUs from kernel scheduler
isolcpus=0,1 nohz_full=0,1 rcu_nocbs=0,1

# Huge pages
echo 512 > /proc/sys/vm/nr_hugepages

# Disable frequency scaling
cpupower frequency-set -g performance

# Disable hyperthreading
echo off > /sys/devices/system/cpu/smt/control

Build Optimization Flags

-O3                      # Maximum optimization
-march=native            # CPU-specific instructions
-mtune=native            # Tune for this CPU
-flto                    # Link-time optimization
-ffast-math              # Fast floating-point
-funroll-loops           # Loop unrolling
-fno-exceptions          # Disable exceptions
-fno-rtti                # Disable RTTI

Safety Properties (via Rust)

Memory Safety Guarantees

// Rust ensures:
// 1. No null pointer dereferences
// 2. No buffer overflows
// 3. No use-after-free
// 4. No data races (at compile time)

pub fn safe_risk_check(tick: &MarketTick) -> bool {
    // Borrow checker ensures tick is valid
    tick.mid_price > 0.0
}

Fearless Concurrency

// Compiler proves thread safety
let queue: Arc<LockFreeSPSC<MarketTick, 16384>>;

// Send to another thread - compiler checks Send trait
thread::spawn(move || {
    queue.push(tick);  // Safe by construction
});

Benchmarking Results (Expected)

Lock-Free Queue (C++):   18 ns/op  (55M ops/sec)
Lock-Free Queue (Rust):  15 ns/op  (66M ops/sec)
Hawkes Update:          142 ns/op
FPGA Inference:         400 ns     (fixed)
Quote Calculation:       87 ns/op
Risk Check (atomic):     12 ns/op
-------------------------------------------
Total Decision Cycle:   850 ns     (< 1 µs )

Future Enhancements

True FPGA Acceleration: Verilog/VHDL for inference
RDMA: Remote Direct Memory Access for exchange connectivity
SmartNIC: Programmable NIC (P4, eBPF)
GPUDirect: CUDA for parallel feature computation
ASIC: Custom silicon for ultimate performance

Built for Speed. Engineered for Reliability. Optimized for Alpha.

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History