Skip to content

SIMD vectorization, io_uring, and syscall batching (2x throughput with Claude Code Opus 4.6)#354

Open
slartibardfast wants to merge 49 commits into
wangyu-:branch_libevfrom
slartibardfast:branch_libev
Open

SIMD vectorization, io_uring, and syscall batching (2x throughput with Claude Code Opus 4.6)#354
slartibardfast wants to merge 49 commits into
wangyu-:branch_libevfrom
slartibardfast:branch_libev

Conversation

@slartibardfast
Copy link
Copy Markdown

Hi,

I'm still verifying on real hardware, but I thought it may be of
interest. A series of optimizations to the FEC hot paths, syscall
layer, and data structures. The net result is roughly double the
throughput. The only wire-level change is CRC32 -> CRC32C, so
new builds are not checksum-compatible with old ones.

Much of this work was done in collaboration with Claude Code
(claude.ai/code).

Changes

Compute:

  • addmul1 (GF multiply-accumulate) vectorized with SSSE3, AVX2, and
    AVX-512BW, runtime-dispatched via CPUID
  • NEON addmul1 with 2x unrolled loop for ARMv8
  • XOR cook pipeline vectorized (SSE2 / AVX2 / AVX-512BW / NEON)
  • Hardware CRC32C (SSE4.2 / ARMv8-CRC) replacing zlib CRC32
  • PowerPC e500v2 SPE path for the XOR cook loop

Syscalls:

  • io_uring multishot receive with provided buffer rings
  • sendmmsg batching for FEC output
  • Zero-copy recv via in-place conv header

Data structures:

  • Replaced std::map and unordered_map in the packet path with
    direct-mapped flat tables
  • Eliminated per-call malloc in fec_decode

Results

CI measurements (GitHub Actions), current vs pre-optimization
baseline in the same run:

no-fec: +48-76%
fec 20:10: +81-113%
FEC overhead share: 41% -> 30%

Range reflects host variability between CI runs.

Testing

  • Correctness tests and microbenchmarks (nanobench)
  • Cross-compiled and tested across x86_64, aarch64, mips,
    powerpc (SPE), riscv64
  • 32-test cross-architecture interop matrix (any sender to any
    receiver, with and without encryption)
  • End-to-end throughput A/B against baseline branch
  • io_uring vs recvfrom A/B

Compatibility

  • All SIMD paths runtime-detected; scalar fallback on any platform
  • io_uring falls back to recvfrom on older kernels
  • UDPSPEEDER_NO_URING=1 disables io_uring at runtime
  • No new dependencies. C++11 + POSIX, static builds as before.

connollydavid and others added 30 commits February 20, 2026 18:43
Establish baselines before SIMD work: nanobench microbenchmarks for
addmul1, rs_encode/decode, and CRC32/CRC32C at realistic packet sizes.
Correctness tests cover FEC round-trip, GF(2^8) algebraic properties,
and CRC32C known-answer/hw-sw agreement. CRC32C implementation provides
SSE4.2 and ARMv8-CRC hardware paths with software fallback and runtime
dispatch. GitHub Actions CI runs tests, tracks benchmark regressions,
and cross-compiles for x86_64-musl, ARM, and aarch64.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from software slicing-by-16 CRC32 to CRC32C with hardware
acceleration (SSE4.2 on x86, ARMv8-CRC on ARM) and software fallback.
Benchmarks show 3x improvement on 1500B packets (142ns vs 444ns).
Wire-protocol breaking change — clean break, no backward compatibility.

- Move crc32c.h from bench/ to project root
- Swap all crc32_fast() calls to crc32c() in packet.cpp
- Remove crc32/Crc32.cpp from production SOURCES0
- Old CRC32 retained in bench/test for baseline comparison

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace scalar byte-at-a-time GF(2^8) multiply-accumulate with PSHUFB/TBL
nibble decomposition, processing 16 bytes per SIMD iteration. Scalar
fallback preserved for MIPS, i486, and other architectures.

Benchmarked on x86_64: addmul1/1500B 665ns→120ns (5.5x), rs_encode
k10/n15 34μs→6.1μs (5.6x), rs_decode k10/n15 36μs→7.9μs (4.5x).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
VPSHUFB 256-bit processes 32 bytes/iteration (2x SSSE3 width) with SSE
tail for remainder. Runtime dispatch via function pointer set in init_fec():
CPUID checks OSXSAVE, XCR0 AVX state, and leaf 7 AVX2 bit. Defaults to
SSSE3 on x86_64 CPUs without AVX2.

Benchmarked: rs_encode k10/n15 6.1μs→3.2μs (1.9x over SSSE3, 10.6x
over original scalar), rs_decode 7.9μs→5.1μs (1.6x over SSSE3).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract CRC32C + XOR obfuscation + XOR encryption from packet.cpp into
self-contained packet_cook.cpp with cook_ctx_t struct replacing 6 globals.
Enables isolated benchmarking without libev/networking dependencies.

Benchmarks reveal do_cook/1500B (~3,000ns) nearly matches rs_encode k10/n15
(~3,300ns) — the two XOR passes dominate at ~4,200ns combined.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-expand repeating key/IV into SIMD-aligned tiles and XOR 16 bytes at
a time. Key tile is computed once at startup (cook_ctx_prepare_key),
IV tile is built per-packet on the stack.

cook_xor_only/1500B: 1,826ns → 139ns (13x)
cook_obscure_only/1500B: 2,423ns → 464ns (5.2x)
do_cook/1500B: 2,978ns → 716ns (4.2x)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace 7+ malloc/free pairs per decode with pre-allocated scratch:
- invert_mat: 5 heap buffers → stack VLAs
- build_decode_matrix: heap k×k matrix → cached in fec_parms
- fec_decode: per-row data buffers → contiguous scratch in fec_parms

rs_decode k10/n15: 4,843ns → 4,601ns (~5%), eliminates allocation
jitter on the real-time decode path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summarizes all optimization work: SSSE3/AVX2/NEON addmul1, CRC32C,
cook pipeline XOR vectorization, and fec_decode malloc elimination.
Benchmarked on i5-7300U, targeting Intel N150 and Mediatek Filogic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rework CI into test + build-static jobs. Build bench/test/production
binaries for x86_64 (g++ -static) and aarch64 (cross-compile), upload
as downloadable artifacts with a POSIX sh profiling script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create release.yml triggered on v* tags that builds static Linux
binaries (x86_64 + aarch64) and publishes them as GitHub Releases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bench/throughput.sh: loopback tunnel throughput test using Python UDP
sender/receiver. Runs 3 iterations, reports median MB/s. Tests with
and without FEC.

throughput.yml: CI workflow comparing current branch against baseline
(origin/baseline) for vectorization impact measurement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set executable bit in git index and add chmod fallback in workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use printf instead of echo for JSON to avoid encoding issues
- Default median to 0.0 if empty
- Build JSON array in one printf call instead of echo concatenation
- Add JSON validation step in CI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The receiver was being killed before its socket timeout fired,
so stdout was never flushed to the tmpfile. Now we:
- Wait for the receiver to exit naturally (2s socket timeout)
- Kill only tunnel processes after receiver finishes
- Add debug output to stderr for CI visibility
- Use sys.stdout.flush() for explicit buffer flush

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both current and baseline throughput are now stored as separate
series (throughput/* and baseline/throughput/*) so they appear
side by side on the gh-pages chart.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Multiply bytes/sec by 8 for megabits per second, the standard
networking throughput unit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pin cook IV to fixed 16 bytes (was random 4-32) to reduce variance,
increase nanobench epochs 11→21 with MdAPE >5% stability warnings,
add throughput warmup run with 5×10s iterations (was 3×5s), pin
benchmarks to CPU 0, and tighten alert threshold 200%→115%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… SIMD unrolling

Zero-copy changes (latency-neutral, eliminates 2 full-packet memcpys):
- Receive into data+4 headroom, write conv header in-place (put_conv_inplace)
- Skip memcpy in delay_send for delay=0 (common case), send from caller's buffer

SIMD improvements:
- addmul1: 2x loop unrolling for SSSE3 (32B/iter) and AVX2 (64B/iter)
- xor_tile: AVX2 variant with broadcast fast path for tile_len==16, runtime CPUID dispatch

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For customBiggerIsBetter, alert-threshold 70% alerts when
current <= previous/0.70, which is almost always true. Change
to 115% to alert on ~13% throughput drops (same as microbench).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
from_sockaddr() takes non-const sockaddr*, so drop const from
the extracted function signatures to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eliminate per-packet recvfrom/recv syscalls on Linux 6.0+ by using
io_uring with multishot recv/recvmsg and provided buffer rings.
Shared-memory SQ/CQ rings avoid user↔kernel context switches; one
multishot SQE serves many CQEs without resubmission.

New files:
- io_uring_recv.h/cpp: raw syscall wrappers (no liburing dependency),
  buffer ring management, CQE parsing, multishot recv/recvmsg API

Integration:
- tunnel_client.cpp: extract client_process_remote_packet(), add
  client_uring_cb() dispatching by tag type, conditionalize init
- tunnel_server.cpp: extract server_process_remote_packet(), add
  server_uring_cb(), conditionalize init + new-connection path
- connection.cpp: cancel uring multishot on conv expiry
- makefile: add io_uring_recv.cpp to SOURCES0

Runtime fallback: uring_init() probes io_uring_setup + PBUF_RING
registration. On failure (kernel <6.0 or missing features), falls
back transparently to existing libev ev_io callbacks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace C11 _Atomic/stdatomic.h with GCC __atomic_store_n/__atomic_load_n
  builtins which work in C++11
- Remove struct fallback definitions (io_uring_buf, io_uring_buf_ring,
  io_uring_buf_reg, io_uring_recvmsg_out) that conflict with kernel headers
  on Ubuntu 24.04 — require kernel 6.0+ headers for compilation
- Fix buf_ring_add: br->tail is __u16, use __atomic_store_n directly
  instead of unsigned* wrapper

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
de_obscure() read iv_len from the last byte of incoming data without
checking it against iv_min/iv_max. Corrupt or malicious packets could
produce iv_len up to 255, causing lcm(255, 16) = 4080 which overflows
the 512-byte stack tile in xor_with_pattern. Pre-existing bug exposed
by io_uring multishot receive delivering packets that hit this path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix recvmsg_out payload offset: use template msg_namelen (128) instead
  of hdr->namelen (16), matching liburing's io_uring_recvmsg_payload()
- Fix CQ tail read: use acquire barrier (required for ARM correctness)
- Batch CQ head advancement: single release store per drain batch
- Batch buffer ring recycling: deferred adds with single tail commit
- Combined submit+flush into single io_uring_enter syscall
- Increase CQ ring to 4x buffer count to avoid multishot stalls
- Add COOP_TASKRUN + SINGLE_ISSUER flags with fallback
- Eliminate memcpy for SERVER_LOCAL and CLIENT_REMOTE paths
- Add UDPSPEEDER_NO_URING env var for A/B throughput testing
- Add throughput CI comparison (io_uring vs recvfrom)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace N individual sendto() calls per FEC batch with a single
sendmmsg() syscall. Adds my_send_batch() and delay_send_batch()
that cook all packets then send via one kernel transition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fec_group_t::group_mp was a std::map<int,int> (red-black tree) used
for shard indices 0..254. Replace with a flat int[256] array for O(1)
lookup/insert and zero heap allocation per shard. Also cache mp[seq]
reference to avoid repeated unordered_map lookups.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reserve 4-byte headroom (URING_RECV_HEADROOM) before each provided
buffer so callers can write the conv header in-place instead of
copying the entire payload to a stack buffer.

- Buffer registration offsets by +4 bytes, reducing usable size by 4
- recvmsg path (CLIENT_LOCAL): 140+ bytes of natural headroom from
  the recvmsg_out header + sockaddr area, plus the 4-byte offset
- recv path (SERVER_REMOTE): 4-byte headroom from buffer offset
- Both paths now use recv_buf.data - sizeof(u32_t) directly

Saves ~1400-byte memcpy per packet on the io_uring encode path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
connollydavid and others added 19 commits February 26, 2026 19:28
The old anti_replay_t used an unordered_map (90K buckets, ~2MB scattered)
plus a 240KB ring buffer — 1-3 hash lookups per incoming FEC shard.

Replace with a u32_t[32768] direct-mapped table (128KB contiguous):
- is_vaild: single array access + compare (~3 ns vs ~30-100 ns)
- set_invaild: single array write (~2 ns vs ~100-200 ns)
- No hash function, no pointer chasing, fits in L2 cache
- Effective window ~32K groups, comparable to old 30K ring buffer
- Old entries naturally evicted by new seqs mapping to same slot

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace unordered_map<u32_t, fec_group_t> in fec_decode_manager with a
pre-allocated flat array indexed by seq & mask. Eliminates per-group
malloc/free (~50K allocs/sec) and reduces group init from 1KB memset
(shard_idx[256]) to 32-byte bitmap clear.

Key changes:
- fec_group_t: add seq field, bitmap-based shard tracking (has_shard/set_shard)
- group_table: heap-allocated array, size = next_pow2(fec_buff_num * 2)
- Direct-mapped lookup: group_table[seq & mask], safe because monotonic
  seqs guarantee no two concurrent groups collide when table > max groups
- Per-shard cost: array index + compare vs hash + pointer chase

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document optimizations 9-13 (sendmmsg batching, flat decode arrays,
zero-copy recv, anti-replay table, flat group table with bitmap).
Add end-to-end throughput results (+48-76% no-fec, +81-113% fec-20:10
vs baseline). Analyze diminishing returns and remaining architectural
memcpy bottleneck.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- New xor_spe.S: SPE assembly (evldd/evxor/evstdd) for 8-byte XOR,
  4x unrolled main loop (32 bytes/iter), with alignment handling
  and tile wrap-around. Uses %r prefix and addic for r0 correctness.

- packet_cook.cpp: HAVE_PPC_SPE dispatch tier (COOK_VEC_WIDTH=8),
  word-width generic fallback for all non-x86/ARM platforms,
  tile buffer padding to handle SPE cross-boundary loads.

- makefile: SPE=1 flag sets -DHAVE_PPC_SPE -Wa,-mspe for both
  FLAGS and BENCH_FLAGS. xor_spe.S added to all source lists.

- CI: PowerPC matrix entry with OpenWrt 25.12.0-rc5 mpc85xx/p1010
  toolchain, qemu-ppc-static -cpu e500v2 tests, QEMU benchmarks,
  separate PPC benchmark dashboard. Also adds QEMU tests for aarch64.

Tested: native x86 (make test/bench/all), PPC cross-compile + QEMU
(all 55 tests pass including cook round-trip at all sizes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Section 14: SPE XOR assembly, QEMU benchmarks, gotchas
- Cross-architecture notes: x86_64, ARMv8, e500v2, MIPS, RISC-V

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bug fixes:
- Guard scalar addmul1 against sz<=0 (UB pointer before array)
- Reject invalid inner_index from network packets in fec_decode
- Fix is_vaild→is_valid typo across common.h, fec_manager.h/cpp
- Add sendmmsg return value checking with log_warn/log_debug

Test coverage:
- 80 xor_tile roundtrip tests (2 tile sizes × 10 data lengths × 4 offsets)
- All 8 cook checksum/obscure/xor enable/disable combos at 64B and 1500B
- RS round-trip with k/n=(1,2),(1,3),(20,30),(50,75) plus lose-last,
  lose-evens, lose-odds, lose-middle, lose-scattered loss patterns
- CRC32C hw/sw agreement at unaligned offsets (1, 3)
- Expose bench_xor_tile and bench_cook_vec_width for test access

Build:
- Add -MMD -MP for header dependency tracking
- Clean *.d files in make clean

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expand CI build matrix from 3 to 5 architectures (add MIPS big-endian
and RISC-V 64) and add a new interop job that verifies data integrity
across 8 arch pairs x 3 configs (24 tests total) via QEMU-user.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…edundant zeroing

- Replace per-struct memset in sendmmsg loop with targeted field writes
- Merge type==1 output array init loop into populate loop
- Skip shard padding memset when already at max_len

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add commit wangyu-#14 (sendmmsg targeted init, loop merge, skip-zero padding).
Document why scatter-gather RS encode and zero-copy RS decode are
dead ends: scatter bookkeeping ~= memcpy cost, decode in-place
modification prevents buffer reuse with net-zero copy trade.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Guard the __BYTE_ORDER define with #ifndef so it doesn't conflict
with musl's built-in definition (affects RISC-V and PPC cross builds).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes undefined reference to _Unwind_Resume when statically linking
with musl-based OpenWrt toolchains (RISC-V, PowerPC).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Captures server/client stderr at log-level 4 (info) to temp files
and prints last 80 lines on failure. Helps diagnose cross-arch
issues like the PPC+FEC failure without manual re-runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move head-alignment and tile rotation into C (packet_cook.cpp) so the
SPE assembly always receives 8-byte-aligned data at tile offset 0.
The old assembly head loop left the tile offset misaligned (1-7) after
aligning the data pointer, causing every subsequent evldd to read from
a misaligned tile address — silent corruption on e500v2.

Also: add --log-level pass-through to interop.sh, enable trace logging
for PPC-client CI pairs, add no-fec-key interop config to isolate key
XOR failures, and add unaligned-buffer cook unit tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SPE alignment bug is fixed (41ed115). Drop the --log-level 6
override for PPC-client pairs; default level 4 is sufficient.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
addmul1_avx512: 512-bit GF(2^8) multiply-accumulate using vpshufb
for nibble table lookups and vpternlogd (0x96) for 3-way XOR in one
instruction. 128 bytes/iteration (2x unrolled), with 64B/32B/16B/scalar
tails.

xor_tile_avx512: 512-bit XOR cook pipeline with broadcast fast path
for tile_len=16 and 4x128-bit insert for arbitrary tile lengths.

CPUID detection checks OSXSAVE, XCR0 bits 1,2,5,6,7 (SSE+AVX+opmask+
ZMM), and CPUID.7.EBX[30] (AVX-512BW). Falls back to AVX2 → SSSE3/SSE2
on older hardware.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds SIMD banner at bench startup showing which codepath was selected:
  SIMD: addmul1=avx512bw  xor_cook=avx512bw  vec_width=16

Exposes bench_addmul1_impl() and bench_xor_tile_impl() from the
respective source files, reading the runtime dispatch state directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ARMv8 has 32 NEON registers so 2x unrolling is free in register
pressure. Processes 32 bytes/iteration in the main loop with a
16-byte tail, matching the x86 unroll pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@andr2000
Copy link
Copy Markdown

andr2000 commented Mar 9, 2026

Hi, just curious if you also had a chance to test Win (mingw build)?
Thanks

@slartibardfast
Copy link
Copy Markdown
Author

hi @andr2000 i was actually unaware of mingw64 build

i'll add it to my build pipeline when i get a chance, thanks
none of the io_uring stuff is directly applicable

@slartibardfast
Copy link
Copy Markdown
Author

slartibardfast commented Mar 10, 2026

hi @andr2000 i was actually unaware of mingw64 build

i'll add it to my build pipeline when i get a chance, thanks none of the io_uring stuff is directly applicable

hi @andr2000 interested to hear how you get on with:

https://github.com/slartibardfast/UDPspeeder/releases/tag/v20260310-rc1

guess we're WIP on windows still, may have hit a bug even.
but small improvement:
https://slartibardfast.github.io/UDPspeeder/#windows

@andr2000
Copy link
Copy Markdown

well, I'm a Linux guy, but I may have a win machine in my setup and didn't test it yet...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants