Skip to content

feat(simd): add runtime CPU feature detection for x86_64 using multiversion#221

Open
utkarshgupta137 wants to merge 2 commits into
cloudwego:mainfrom
utkarshgupta137:feat/runtime-detection
Open

feat(simd): add runtime CPU feature detection for x86_64 using multiversion#221
utkarshgupta137 wants to merge 2 commits into
cloudwego:mainfrom
utkarshgupta137:feat/runtime-detection

Conversation

@utkarshgupta137
Copy link
Copy Markdown
Contributor

What type of PR is this?

feat: A new feature

Check the PR title.

  • This PR title match the format: <type>(optional scope): <description>
  • The description of this PR title is user-oriented and clear enough for others to understand.
  • Attach the PR updating the user documentation if the current PR requires user awareness at the usage level.

(Optional) More detailed description for this PR(en: English/zh: Chinese).

Previously, all SIMD feature selection was done at compile time via #[cfg(target_feature = "...")]. This meant binaries compiled without -C target-cpu=native (e.g., for distribution) fell back to scalar code on x86_64, leaving significant performance on the table even on machines with AVX2/PCLMULQDQ support.

This PR adds runtime CPU feature detection for x86_64 using the multiversion crate. The changes are:

Runtime dispatch for get_nonspace_bits and prefix_xor (src/util/arch/):

  • When compiled with -C target-cpu=native or explicit feature flags: zero overhead — the runtime detection module is not compiled at all, and the optimized implementations are used directly (identical to before).
  • When compiled without feature flags on x86_64: multiversion detects AVX2/PCLMULQDQ support at runtime (once, cached in a static atomic) and dispatches to either the optimized x86_64 implementations or the scalar fallback.

Fix sonic-number SSE2 gating (sonic-number/src/arch/mod.rs):

  • The x86_64 SIMD number parsing (simd_str2int) only uses SSE2 intrinsics but was incorrectly gated behind target_feature = "avx2". Relaxed to target_feature = "sse2", which is baseline on all x86_64 CPUs. This means SIMD-accelerated number parsing is now always available on x86_64 regardless of compiler flags.

Performance impact (binaries compiled without -C target-cpu=native, running on AVX2 CPU):

Function Before After
get_nonspace_bits Scalar byte-by-byte loop AVX2 shuffle (runtime detected)
prefix_xor Scalar shift cascade PCLMULQDQ carryless multiply (runtime detected)
simd_str2int Scalar loop SSE2 SIMD (always on x86_64)

The sonic-simd abstract types (u8x32, u8x64, StringBlock, etc.) continue to use compile-time selection. When AVX2 is not a compile-time feature, u8x32 is emulated as 2×SSE2 u8x16 — still much faster than scalar. I intend to raise a followup PR for that.

(Optional) Which issue(s) this PR fixes:

Addresses the "runtime CPU detection" item from ROADMAP.md.

(optional) The PR that updates user documentation:

N/A — no user-facing API changes. The runtime detection is transparent and automatic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant