Experimental Linux kernel fast-path research for SRAM-style AI inference servers, focused on io_uring submission latency, batching, registered buffers, tracing, and native attribution.
In deterministic AI inference (~20µs execution), Linux host overhead can match or exceed device latency, effectively doubling end-to-end request time.
In our synthetic SRAM-style workload (~20µs compute), baseline p99 reaches ~40–50µs, indicating host overhead comparable to device execution.
This repo isolates that overhead and prototypes the kernel fast paths required to close the gap.
Once inference becomes deterministic, the Linux control plane—not the model—dominates latency.
This project targets the post-compute bottleneck regime, where hardware execution is no longer the dominant source of latency.
- Deterministic compute does not eliminate latency variance: Even with zero-variance hardware execution, host-side effects drive significant jitter.
- Linux submission and completion paths remain significant: System call overhead and completion delivery pipelines contribute measurable microseconds.
- Tail latency (p99/p999) is driven by host-side effects: Scheduling and interrupt handling costs dominate the "tail" of the latency distribution.
- Existing io_uring fast paths reduce but do not eliminate this gap: Existing io_uring fast paths reduce important parts of the path, while this repo measures what residual latency remains.
To provide a comprehensive evaluation, the validation harness supports two distinct tracks:
- NOP mode: Measures raw
io_uringring overhead with minimal operations. - SRAM20 mode: Implements a deterministic AI inference model using a 20µs busy-wait to simulate predictable hardware execution.
SRAM-style AI inference does not eliminate latency—it exposes the Linux control plane as the dominant bottleneck. Closing this gap requires not faster accelerators, but faster kernel paths.
WSL results are used for harness validation only. They are NOT used to draw conclusions about:
- SQPOLL effectiveness
- Kernel scheduling behavior
- Completion latency
All definitive research conclusions require native Linux validation.
- Quickstart Guide
- Native Linux Validation Guide
- Existing io_uring Fast Paths and Remaining Gaps
- Maintainer FAQ
- Project Roadmap
Our latest research indicates that for microsecond-scale inference, Batching is the most powerful optimization lever. It reduces per-request submission overhead by ~7× in synthetic SRAM-style workloads, bringing the effective submission tax from ~600ns to <100ns per request.
See Submission Path Analysis for the full technical breakdown.
There exists a batch size range (8–16) that minimizes per-request overhead without significantly increasing base latency. Pushing beyond batch 16 yields diminishing returns and increases total end-to-end time.
See Batch Sweep Results for the optimization data.
The following plots summarize the batch sweep experiments:
Real-world inference systems must balance throughput and latency in the presence of host-side jitter. Our adaptive batching experiment demonstrates that a simple latency-based heuristic can outperform static strategies, particularly in reducing p99 tail latency.
See Adaptive Batching Results for the performance breakdown.
Recent native-like validation data shows that for deterministic workloads, Submission-side latency (submit → issue) is the primary bottleneck. Even after applying existing io_uring fast paths, the cost of the system call transition and request dispatch remains a significant contributor to tail latency.
Research has pivoted from completion-side polling to optimizing the submission plane to match the performance of microsecond-scale hardware.
Initial WSL-based validation indicates:
- p99 Dominance: Latency is currently dominated by Submission Path overhead and Hypervisor Jitter.
- Completion Path: Residual host-side completion latency is sub-microsecond in synchronous modes.
- Decision: Experimental CQ polling is NOT yet justified on native hardware. Further bare-metal measurement is required to isolate kernel-specific completion bottlenecks.
See Native Latency Breakdown for detailed attribution data.
This is:
- Experimental Linux kernel fast-path research and prototyping.
- Reproducible latency modeling for deterministic AI workloads.
- A measurement-first effort to justify new kernel APIs.
This is NOT:
- A production-ready kernel patch (yet).
- A replacement for standard
io_uringfeatures. - Performance theater using non-deterministic hardware.
GPL v2


