You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
libdynemit leverages the ifunc resolver (supported by both GCC and Clang on Linux) to automatically select optimal SIMD implementations at program startup, delivering portable code without sacrificing performance. Thread-safe SIMD detection and dlopen-safe resolver utilities ensure robust operation in multi-threaded applications and dynamic library loading scenarios.
@@ -27,12 +28,25 @@ entropy_u32(data, n);
27
28
28
29
## Same build, best performance
29
30
30
-
Benchmark charts are generated per-feature and per-CPU under `bench/`. After running benchmarks, you will find:
*Benchmark comparing vector multiplication performance across different CPU architectures using the same build binary. The library automatically detected and utilized each CPU's highest supported SIMD instruction set (AVX-512F, AVX2, AVX or SSE4.2) at runtime. Lower execution time indicates better performance. Each data point represents the median of 10 trials, with error bars showing ±1 standard deviation.*
33
+
34
+
## Forced SIMD instructions without dynamic dispatch
<td align="center"><b>aarch64</b> — ARM Neoverse V2</td>
40
+
</tr>
41
+
<tr>
42
+
<td><img src="bench/cpus/x86_64/amd_ryzen_9_9950x3d/features/max_u32/timing.png" alt="max_u32 SIMD timings on x86_64" width="100%"></td>
43
+
<td><img src="bench/cpus/aarch64/arm_neoverse_v2/features/max_u32/timing.png" alt="max_u32 SIMD timings on aarch64" width="100%"></td>
44
+
</tr>
45
+
</table>
46
+
47
+
*Performance scaling of `max_u32` across SIMD levels on two architectures, x86_64 (Scalar → SSE2 → SSE4.2 → AVX → AVX2 → AVX-512F) and aarch64 (Scalar → NEON → SVE → SVE2). Each implementation is compiled into the same binary and the ifunc resolver selects the best one at startup. Lower execution time is better, each point is the median of 3 trials with ±1 standard deviation error bars.*
31
48
32
-
- **CPU comparison** charts at `bench/features/{variant}/timing.png` and `throughput.png`
33
-
- **SIMD comparison** charts at `bench/cpus/{arch}/{cpu}/features/{variant}/timing.png` and `throughput.png`
34
49
35
-
Run `sudo ./scripts/run_all_benchmarks.sh` to generate all data and charts. See [docs/BENCHMARKING.md](docs/BENCHMARKING.md) for details.
36
50
37
51
## Installation
38
52
@@ -222,7 +236,7 @@ sudo make install
222
236
223
237
Currently the library ships SIMD-accelerated features organized into four categories. Every function automatically dispatches to the best available instruction set at program startup.
224
238
225
-
<detailsopen>
239
+
<details>
226
240
<summary><b>Vector Operations</b></summary>
227
241
228
242
Element-wise operations on `float` arrays.
@@ -256,7 +270,7 @@ Convenience header `<dynemit/stats.h>` includes all of the above.
0 commit comments