SIMD vectorization, io_uring, and syscall batching (2x throughput with Claude Code Opus 4.6)#354
Open
slartibardfast wants to merge 49 commits into
Open
SIMD vectorization, io_uring, and syscall batching (2x throughput with Claude Code Opus 4.6)#354slartibardfast wants to merge 49 commits into
slartibardfast wants to merge 49 commits into
Conversation
Establish baselines before SIMD work: nanobench microbenchmarks for addmul1, rs_encode/decode, and CRC32/CRC32C at realistic packet sizes. Correctness tests cover FEC round-trip, GF(2^8) algebraic properties, and CRC32C known-answer/hw-sw agreement. CRC32C implementation provides SSE4.2 and ARMv8-CRC hardware paths with software fallback and runtime dispatch. GitHub Actions CI runs tests, tracks benchmark regressions, and cross-compiles for x86_64-musl, ARM, and aarch64. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from software slicing-by-16 CRC32 to CRC32C with hardware acceleration (SSE4.2 on x86, ARMv8-CRC on ARM) and software fallback. Benchmarks show 3x improvement on 1500B packets (142ns vs 444ns). Wire-protocol breaking change — clean break, no backward compatibility. - Move crc32c.h from bench/ to project root - Swap all crc32_fast() calls to crc32c() in packet.cpp - Remove crc32/Crc32.cpp from production SOURCES0 - Old CRC32 retained in bench/test for baseline comparison Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace scalar byte-at-a-time GF(2^8) multiply-accumulate with PSHUFB/TBL nibble decomposition, processing 16 bytes per SIMD iteration. Scalar fallback preserved for MIPS, i486, and other architectures. Benchmarked on x86_64: addmul1/1500B 665ns→120ns (5.5x), rs_encode k10/n15 34μs→6.1μs (5.6x), rs_decode k10/n15 36μs→7.9μs (4.5x). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
VPSHUFB 256-bit processes 32 bytes/iteration (2x SSSE3 width) with SSE tail for remainder. Runtime dispatch via function pointer set in init_fec(): CPUID checks OSXSAVE, XCR0 AVX state, and leaf 7 AVX2 bit. Defaults to SSSE3 on x86_64 CPUs without AVX2. Benchmarked: rs_encode k10/n15 6.1μs→3.2μs (1.9x over SSSE3, 10.6x over original scalar), rs_decode 7.9μs→5.1μs (1.6x over SSSE3). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract CRC32C + XOR obfuscation + XOR encryption from packet.cpp into self-contained packet_cook.cpp with cook_ctx_t struct replacing 6 globals. Enables isolated benchmarking without libev/networking dependencies. Benchmarks reveal do_cook/1500B (~3,000ns) nearly matches rs_encode k10/n15 (~3,300ns) — the two XOR passes dominate at ~4,200ns combined. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-expand repeating key/IV into SIMD-aligned tiles and XOR 16 bytes at a time. Key tile is computed once at startup (cook_ctx_prepare_key), IV tile is built per-packet on the stack. cook_xor_only/1500B: 1,826ns → 139ns (13x) cook_obscure_only/1500B: 2,423ns → 464ns (5.2x) do_cook/1500B: 2,978ns → 716ns (4.2x) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace 7+ malloc/free pairs per decode with pre-allocated scratch: - invert_mat: 5 heap buffers → stack VLAs - build_decode_matrix: heap k×k matrix → cached in fec_parms - fec_decode: per-row data buffers → contiguous scratch in fec_parms rs_decode k10/n15: 4,843ns → 4,601ns (~5%), eliminates allocation jitter on the real-time decode path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summarizes all optimization work: SSSE3/AVX2/NEON addmul1, CRC32C, cook pipeline XOR vectorization, and fec_decode malloc elimination. Benchmarked on i5-7300U, targeting Intel N150 and Mediatek Filogic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rework CI into test + build-static jobs. Build bench/test/production binaries for x86_64 (g++ -static) and aarch64 (cross-compile), upload as downloadable artifacts with a POSIX sh profiling script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create release.yml triggered on v* tags that builds static Linux binaries (x86_64 + aarch64) and publishes them as GitHub Releases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bench/throughput.sh: loopback tunnel throughput test using Python UDP sender/receiver. Runs 3 iterations, reports median MB/s. Tests with and without FEC. throughput.yml: CI workflow comparing current branch against baseline (origin/baseline) for vectorization impact measurement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set executable bit in git index and add chmod fallback in workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use printf instead of echo for JSON to avoid encoding issues - Default median to 0.0 if empty - Build JSON array in one printf call instead of echo concatenation - Add JSON validation step in CI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The receiver was being killed before its socket timeout fired, so stdout was never flushed to the tmpfile. Now we: - Wait for the receiver to exit naturally (2s socket timeout) - Kill only tunnel processes after receiver finishes - Add debug output to stderr for CI visibility - Use sys.stdout.flush() for explicit buffer flush Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both current and baseline throughput are now stored as separate series (throughput/* and baseline/throughput/*) so they appear side by side on the gh-pages chart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Multiply bytes/sec by 8 for megabits per second, the standard networking throughput unit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pin cook IV to fixed 16 bytes (was random 4-32) to reduce variance, increase nanobench epochs 11→21 with MdAPE >5% stability warnings, add throughput warmup run with 5×10s iterations (was 3×5s), pin benchmarks to CPU 0, and tighten alert threshold 200%→115%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… SIMD unrolling Zero-copy changes (latency-neutral, eliminates 2 full-packet memcpys): - Receive into data+4 headroom, write conv header in-place (put_conv_inplace) - Skip memcpy in delay_send for delay=0 (common case), send from caller's buffer SIMD improvements: - addmul1: 2x loop unrolling for SSSE3 (32B/iter) and AVX2 (64B/iter) - xor_tile: AVX2 variant with broadcast fast path for tile_len==16, runtime CPUID dispatch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For customBiggerIsBetter, alert-threshold 70% alerts when current <= previous/0.70, which is almost always true. Change to 115% to alert on ~13% throughput drops (same as microbench). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
from_sockaddr() takes non-const sockaddr*, so drop const from the extracted function signatures to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eliminate per-packet recvfrom/recv syscalls on Linux 6.0+ by using io_uring with multishot recv/recvmsg and provided buffer rings. Shared-memory SQ/CQ rings avoid user↔kernel context switches; one multishot SQE serves many CQEs without resubmission. New files: - io_uring_recv.h/cpp: raw syscall wrappers (no liburing dependency), buffer ring management, CQE parsing, multishot recv/recvmsg API Integration: - tunnel_client.cpp: extract client_process_remote_packet(), add client_uring_cb() dispatching by tag type, conditionalize init - tunnel_server.cpp: extract server_process_remote_packet(), add server_uring_cb(), conditionalize init + new-connection path - connection.cpp: cancel uring multishot on conv expiry - makefile: add io_uring_recv.cpp to SOURCES0 Runtime fallback: uring_init() probes io_uring_setup + PBUF_RING registration. On failure (kernel <6.0 or missing features), falls back transparently to existing libev ev_io callbacks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace C11 _Atomic/stdatomic.h with GCC __atomic_store_n/__atomic_load_n builtins which work in C++11 - Remove struct fallback definitions (io_uring_buf, io_uring_buf_ring, io_uring_buf_reg, io_uring_recvmsg_out) that conflict with kernel headers on Ubuntu 24.04 — require kernel 6.0+ headers for compilation - Fix buf_ring_add: br->tail is __u16, use __atomic_store_n directly instead of unsigned* wrapper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
de_obscure() read iv_len from the last byte of incoming data without checking it against iv_min/iv_max. Corrupt or malicious packets could produce iv_len up to 255, causing lcm(255, 16) = 4080 which overflows the 512-byte stack tile in xor_with_pattern. Pre-existing bug exposed by io_uring multishot receive delivering packets that hit this path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix recvmsg_out payload offset: use template msg_namelen (128) instead of hdr->namelen (16), matching liburing's io_uring_recvmsg_payload() - Fix CQ tail read: use acquire barrier (required for ARM correctness) - Batch CQ head advancement: single release store per drain batch - Batch buffer ring recycling: deferred adds with single tail commit - Combined submit+flush into single io_uring_enter syscall - Increase CQ ring to 4x buffer count to avoid multishot stalls - Add COOP_TASKRUN + SINGLE_ISSUER flags with fallback - Eliminate memcpy for SERVER_LOCAL and CLIENT_REMOTE paths - Add UDPSPEEDER_NO_URING env var for A/B throughput testing - Add throughput CI comparison (io_uring vs recvfrom) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace N individual sendto() calls per FEC batch with a single sendmmsg() syscall. Adds my_send_batch() and delay_send_batch() that cook all packets then send via one kernel transition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fec_group_t::group_mp was a std::map<int,int> (red-black tree) used for shard indices 0..254. Replace with a flat int[256] array for O(1) lookup/insert and zero heap allocation per shard. Also cache mp[seq] reference to avoid repeated unordered_map lookups. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reserve 4-byte headroom (URING_RECV_HEADROOM) before each provided buffer so callers can write the conv header in-place instead of copying the entire payload to a stack buffer. - Buffer registration offsets by +4 bytes, reducing usable size by 4 - recvmsg path (CLIENT_LOCAL): 140+ bytes of natural headroom from the recvmsg_out header + sockaddr area, plus the 4-byte offset - recv path (SERVER_REMOTE): 4-byte headroom from buffer offset - Both paths now use recv_buf.data - sizeof(u32_t) directly Saves ~1400-byte memcpy per packet on the io_uring encode path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old anti_replay_t used an unordered_map (90K buckets, ~2MB scattered) plus a 240KB ring buffer — 1-3 hash lookups per incoming FEC shard. Replace with a u32_t[32768] direct-mapped table (128KB contiguous): - is_vaild: single array access + compare (~3 ns vs ~30-100 ns) - set_invaild: single array write (~2 ns vs ~100-200 ns) - No hash function, no pointer chasing, fits in L2 cache - Effective window ~32K groups, comparable to old 30K ring buffer - Old entries naturally evicted by new seqs mapping to same slot Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace unordered_map<u32_t, fec_group_t> in fec_decode_manager with a pre-allocated flat array indexed by seq & mask. Eliminates per-group malloc/free (~50K allocs/sec) and reduces group init from 1KB memset (shard_idx[256]) to 32-byte bitmap clear. Key changes: - fec_group_t: add seq field, bitmap-based shard tracking (has_shard/set_shard) - group_table: heap-allocated array, size = next_pow2(fec_buff_num * 2) - Direct-mapped lookup: group_table[seq & mask], safe because monotonic seqs guarantee no two concurrent groups collide when table > max groups - Per-shard cost: array index + compare vs hash + pointer chase Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document optimizations 9-13 (sendmmsg batching, flat decode arrays, zero-copy recv, anti-replay table, flat group table with bitmap). Add end-to-end throughput results (+48-76% no-fec, +81-113% fec-20:10 vs baseline). Analyze diminishing returns and remaining architectural memcpy bottleneck. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- New xor_spe.S: SPE assembly (evldd/evxor/evstdd) for 8-byte XOR, 4x unrolled main loop (32 bytes/iter), with alignment handling and tile wrap-around. Uses %r prefix and addic for r0 correctness. - packet_cook.cpp: HAVE_PPC_SPE dispatch tier (COOK_VEC_WIDTH=8), word-width generic fallback for all non-x86/ARM platforms, tile buffer padding to handle SPE cross-boundary loads. - makefile: SPE=1 flag sets -DHAVE_PPC_SPE -Wa,-mspe for both FLAGS and BENCH_FLAGS. xor_spe.S added to all source lists. - CI: PowerPC matrix entry with OpenWrt 25.12.0-rc5 mpc85xx/p1010 toolchain, qemu-ppc-static -cpu e500v2 tests, QEMU benchmarks, separate PPC benchmark dashboard. Also adds QEMU tests for aarch64. Tested: native x86 (make test/bench/all), PPC cross-compile + QEMU (all 55 tests pass including cook round-trip at all sizes). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Section 14: SPE XOR assembly, QEMU benchmarks, gotchas - Cross-architecture notes: x86_64, ARMv8, e500v2, MIPS, RISC-V Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bug fixes: - Guard scalar addmul1 against sz<=0 (UB pointer before array) - Reject invalid inner_index from network packets in fec_decode - Fix is_vaild→is_valid typo across common.h, fec_manager.h/cpp - Add sendmmsg return value checking with log_warn/log_debug Test coverage: - 80 xor_tile roundtrip tests (2 tile sizes × 10 data lengths × 4 offsets) - All 8 cook checksum/obscure/xor enable/disable combos at 64B and 1500B - RS round-trip with k/n=(1,2),(1,3),(20,30),(50,75) plus lose-last, lose-evens, lose-odds, lose-middle, lose-scattered loss patterns - CRC32C hw/sw agreement at unaligned offsets (1, 3) - Expose bench_xor_tile and bench_cook_vec_width for test access Build: - Add -MMD -MP for header dependency tracking - Clean *.d files in make clean Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expand CI build matrix from 3 to 5 architectures (add MIPS big-endian and RISC-V 64) and add a new interop job that verifies data integrity across 8 arch pairs x 3 configs (24 tests total) via QEMU-user. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…edundant zeroing - Replace per-struct memset in sendmmsg loop with targeted field writes - Merge type==1 output array init loop into populate loop - Skip shard padding memset when already at max_len Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add commit wangyu-#14 (sendmmsg targeted init, loop merge, skip-zero padding). Document why scatter-gather RS encode and zero-copy RS decode are dead ends: scatter bookkeeping ~= memcpy cost, decode in-place modification prevents buffer reuse with net-zero copy trade. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Guard the __BYTE_ORDER define with #ifndef so it doesn't conflict with musl's built-in definition (affects RISC-V and PPC cross builds). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes undefined reference to _Unwind_Resume when statically linking with musl-based OpenWrt toolchains (RISC-V, PowerPC). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Captures server/client stderr at log-level 4 (info) to temp files and prints last 80 lines on failure. Helps diagnose cross-arch issues like the PPC+FEC failure without manual re-runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move head-alignment and tile rotation into C (packet_cook.cpp) so the SPE assembly always receives 8-byte-aligned data at tile offset 0. The old assembly head loop left the tile offset misaligned (1-7) after aligning the data pointer, causing every subsequent evldd to read from a misaligned tile address — silent corruption on e500v2. Also: add --log-level pass-through to interop.sh, enable trace logging for PPC-client CI pairs, add no-fec-key interop config to isolate key XOR failures, and add unaligned-buffer cook unit tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SPE alignment bug is fixed (41ed115). Drop the --log-level 6 override for PPC-client pairs; default level 4 is sufficient. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
addmul1_avx512: 512-bit GF(2^8) multiply-accumulate using vpshufb for nibble table lookups and vpternlogd (0x96) for 3-way XOR in one instruction. 128 bytes/iteration (2x unrolled), with 64B/32B/16B/scalar tails. xor_tile_avx512: 512-bit XOR cook pipeline with broadcast fast path for tile_len=16 and 4x128-bit insert for arbitrary tile lengths. CPUID detection checks OSXSAVE, XCR0 bits 1,2,5,6,7 (SSE+AVX+opmask+ ZMM), and CPUID.7.EBX[30] (AVX-512BW). Falls back to AVX2 → SSSE3/SSE2 on older hardware. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds SIMD banner at bench startup showing which codepath was selected: SIMD: addmul1=avx512bw xor_cook=avx512bw vec_width=16 Exposes bench_addmul1_impl() and bench_xor_tile_impl() from the respective source files, reading the runtime dispatch state directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ARMv8 has 32 NEON registers so 2x unrolling is free in register pressure. Processes 32 bytes/iteration in the main loop with a 16-byte tail, matching the x86 unroll pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hi, just curious if you also had a chance to test Win (mingw build)? |
Author
|
hi @andr2000 i was actually unaware of mingw64 build i'll add it to my build pipeline when i get a chance, thanks |
Author
hi @andr2000 interested to hear how you get on with: https://github.com/slartibardfast/UDPspeeder/releases/tag/v20260310-rc1 guess we're WIP on windows still, may have hit a bug even. |
|
well, I'm a Linux guy, but I may have a win machine in my setup and didn't test it yet... |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi,
I'm still verifying on real hardware, but I thought it may be of
interest. A series of optimizations to the FEC hot paths, syscall
layer, and data structures. The net result is roughly double the
throughput. The only wire-level change is CRC32 -> CRC32C, so
new builds are not checksum-compatible with old ones.
Much of this work was done in collaboration with Claude Code
(claude.ai/code).
Changes
Compute:
AVX-512BW, runtime-dispatched via CPUID
Syscalls:
Data structures:
direct-mapped flat tables
Results
CI measurements (GitHub Actions), current vs pre-optimization
baseline in the same run:
no-fec: +48-76%
fec 20:10: +81-113%
FEC overhead share: 41% -> 30%
Range reflects host variability between CI runs.
Testing
powerpc (SPE), riscv64
receiver, with and without encryption)
Compatibility