feat(native-cpu): cross-arch CI matrix + MSVC/Clang portability (PR 4 of 5) by michalharakal · Pull Request #574 · SKaiNET-developers/SKaiNET

michalharakal · 2026-04-29T21:17:25Z

Summary

PR 4 of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc. Adds cross-arch CI coverage so consumers on Apple Silicon, ARM Linux, and Windows aren't silently broken when they pull a JAR built on x86_64 Linux. No new C kernels — current scalar C is already 4.17–5.87× faster than Panama, and gcc auto-vec emits AVX2/NEON without intrinsics.

What changed

.github/workflows/native-cpu-multiarch.yml — matrix of 4 hosts: ubuntu-latest, ubuntu-24.04-arm, macos-14 (Apple Silicon), windows-latest. Each runs :skainet-backends:skainet-backend-native-cpu:jvmTest + jvmJar, exercising the full CMake → build → bundle → FFM downcall pipeline. Path-filtered triggers (only fires when native module / API jvmMain / this workflow change) keep CI noise down. Per-arch native libs uploaded as artifacts for later fat-JAR aggregation.
fail-fast: false so one bad arch doesn't cancel the others — easier triage.
SKAINET_RESTRICT macro in skainet_kernels.h — __restrict__ on GCC/Clang, __restrict on MSVC, empty otherwise. q4k_matmul.c now uses the macro so Windows MSVC builds don't reject the function signature.
MSVC compile-flag branch in CMakeLists.txt — /O2 /fp:fast /W3 analogues of the existing -O3 -ffast-math -Wall. Visibility on Windows still handled by the __declspec(dllexport) macro.

Test plan

:skainet-backends:skainet-backend-native-cpu:jvmTest on linux-x86_64 — 17/17 pass after the __restrict__ → SKAINET_RESTRICT swap
CI verifies macos-arm64, linux-arm64, windows-x86_64 on this PR (the actual cross-arch check is the workflow run itself — locally I can't prove those hosts without them)
Per-arch libskainet_kernels.{so,dylib,dll} artifacts downloadable from this PR's check run
Existing single-host build.yml / verify-poms.yml / java-tests.yml workflows unaffected

Out of scope

Hand-written AVX2 / NEON intrinsics — current scalar C is already 4.17–5.87× faster than Panama; GCC/Clang auto-vectorize the 32-iter inner loop into vfmadd213ps / fmla already. No profile-shown kernel where the compiler is leaving FLOPs on the table.
Multi-threading — Panama's parallelChunks overhead dominates today; native single-threaded already wins. A focused multi-threaded variant becomes worthwhile once per-call time exceeds a few ms.
Maven Central native classifier publishing / fat-JAR aggregation — CI artifacts are downloadable per-arch but not yet bundled into a single publishable JAR. Belongs with the publishing infra plan.
PR 5 — native FP32 / Q6_K / Q8_0 kernels (one per format, template lifted from PR 2).

🤖 Generated with Claude Code

… of 5) PR 4 of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc. The local Gradle build still produces only the host-arch shared library, but cross-arch coverage now ships via CI: every push and PR that touches the native module builds and runs jvmTest on linux-x86_64, linux-arm64, macos-arm64, and windows-x86_64 in parallel. This catches portability regressions (linker, alignment, compiler-specific syntax) at PR time instead of letting them silently ship to consumers on non-x86_64 hosts. CI workflow (.github/workflows/native-cpu-multiarch.yml): - Matrix of 4 hosts: ubuntu-latest, ubuntu-24.04-arm, macos-14 (Apple Silicon), windows-latest. - Each runner: setup-java JDK 25 zulu, cmake --version sanity, then ./gradlew :skainet-backends:skainet-backend-native-cpu:jvmTest + jvmJar. Drives the full CMake configure → build → bundle → FFM downcall test pipeline. - Path-filtered triggers — only runs when the native module, the jvmMain part of skainet-backend-api (where the MemSeg SPI lives), or this workflow file changes. Keeps unrelated CI noise low. - fail-fast: false so one arch failure doesn't cancel the others. - Uploads each arch's libskainet_kernels.{so,dylib,dll} as a named artifact (libskainet_kernels-<arch_label>) plus per-arch test reports for triage. The artifacts are the input a future "fat-JAR" aggregation step will combine into a single multi-arch publishable JAR (deferred to a follow-up; see Out of scope). C portability fixes (native/): - skainet_kernels.h adds a SKAINET_RESTRICT macro: __restrict__ on GCC/Clang, __restrict on MSVC, empty otherwise. q4k_matmul.c switches from raw __restrict__ to the macro so the Windows MSVC build no longer rejects the function signatures. - CMakeLists.txt grows an MSVC branch alongside the existing GCC/Clang one: /O2 /fp:fast /W3 (analogues of -O3 -ffast-math -Wall). Visibility on Windows is handled by the existing SKAINET_API __declspec(dllexport) macro; no additional flags needed. The Copy task in build.gradle.kts already includes both flat (skainet_kernels.dll) and Visual Studio multi-config (Release/skainet_kernels.dll) layouts, so the bundling step works regardless of generator. What is NOT in this PR (intentional): - Hand-written AVX2 / NEON intrinsics. The current scalar C kernel under -O3 -ffast-math is already 4.17–5.87× faster than Panama Vector at LLM-typical Q4_K shapes (PR 2 numbers); GCC and Clang auto-vectorize the 32-iteration inner loop into AVX2 / NEON fmla / vfmadd213ps already. Hand-tuned intrinsics would just match what the compiler emits. Adding them is busywork until a profile shows a specific kernel where the compiler is leaving FLOPs on the table. - Multi-threading. Native is single-threaded today; Panama uses parallelChunks across all cores and still loses to scalar single- threaded native at every measured shape (the parallelChunks dispatch overhead dominates). A focused multi-threaded native variant becomes worthwhile once a real workload's per-call time exceeds a few ms; current Q4_K matmul at 4096² is ~6 ms single- threaded. - Maven Central native classifier publishing / fat-JAR aggregation. The CI artifacts are downloadable per-arch but not yet bundled into a single publishable JAR with all four .so/.dylib/.dll files. That belongs in a separate plan with the publishing infrastructure (vanniktech.mavenPublish hooks, signing, classifier strategy). Verification (linux-x86_64 host, JDK 21.0.10, cmake 3.28.3, gcc 13.3): - :skainet-backends:skainet-backend-native-cpu:jvmTest — 17/17 pass after the SKAINET_RESTRICT swap (3 pipeline + 5 heap-parity + 7 memseg-parity + 1 microbench-gated + 1 incidental). - The macOS, Linux ARM64, and Windows runners exercise everything the local build doesn't. CI status on this PR is the actual cross-arch verification — locally I can't prove macos-arm64 or windows-x86_64 work without those hosts. Out of scope (deferred per asciidoc staging): - PR 5: native FP32 / Q6_K / Q8_0 kernels (one PR per format, template lifted from PR 2). - Maven Central native classifier publishing + fat-JAR aggregation. - AVX2 / NEON intrinsic kernels, multi-threading, prefetch tuning. - Build instructions for users who want to cross-compile all arches locally (currently CI is the path to a multi-arch artifact). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal merged commit 87a5730 into develop Apr 29, 2026
9 of 10 checks passed

This was referenced Apr 29, 2026

feat(native-cpu): native FFM FP32 SGEMM kernel (PR 5 of 5) #575

Merged

ci(native-cpu): drop linux-arm64 runner from multiarch matrix #577

Merged

Prepare 0.22.0 #580

Merged

michalharakal deleted the feature/native-cpu-multiarch branch May 2, 2026 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(native-cpu): cross-arch CI matrix + MSVC/Clang portability (PR 4 of 5)#574

feat(native-cpu): cross-arch CI matrix + MSVC/Clang portability (PR 4 of 5)#574
michalharakal merged 1 commit into
developfrom
feature/native-cpu-multiarch

michalharakal commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

michalharakal commented Apr 29, 2026

Summary

What changed

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant