feat(native-cpu): cross-arch CI matrix + MSVC/Clang portability (PR 4 of 5)#574
Merged
michalharakal merged 1 commit intodevelopfrom Apr 29, 2026
Merged
Conversation
… of 5)
PR 4 of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc.
The local Gradle build still produces only the host-arch shared
library, but cross-arch coverage now ships via CI: every push and PR
that touches the native module builds and runs jvmTest on
linux-x86_64, linux-arm64, macos-arm64, and windows-x86_64 in
parallel. This catches portability regressions (linker, alignment,
compiler-specific syntax) at PR time instead of letting them silently
ship to consumers on non-x86_64 hosts.
CI workflow (.github/workflows/native-cpu-multiarch.yml):
- Matrix of 4 hosts: ubuntu-latest, ubuntu-24.04-arm, macos-14
(Apple Silicon), windows-latest.
- Each runner: setup-java JDK 25 zulu, cmake --version sanity, then
./gradlew :skainet-backends:skainet-backend-native-cpu:jvmTest +
jvmJar. Drives the full CMake configure → build → bundle → FFM
downcall test pipeline.
- Path-filtered triggers — only runs when the native module, the
jvmMain part of skainet-backend-api (where the MemSeg SPI lives),
or this workflow file changes. Keeps unrelated CI noise low.
- fail-fast: false so one arch failure doesn't cancel the others.
- Uploads each arch's libskainet_kernels.{so,dylib,dll} as a named
artifact (libskainet_kernels-<arch_label>) plus per-arch test
reports for triage. The artifacts are the input a future "fat-JAR"
aggregation step will combine into a single multi-arch publishable
JAR (deferred to a follow-up; see Out of scope).
C portability fixes (native/):
- skainet_kernels.h adds a SKAINET_RESTRICT macro: __restrict__ on
GCC/Clang, __restrict on MSVC, empty otherwise. q4k_matmul.c
switches from raw __restrict__ to the macro so the Windows MSVC
build no longer rejects the function signatures.
- CMakeLists.txt grows an MSVC branch alongside the existing
GCC/Clang one: /O2 /fp:fast /W3 (analogues of -O3 -ffast-math
-Wall). Visibility on Windows is handled by the existing
SKAINET_API __declspec(dllexport) macro; no additional flags
needed. The Copy task in build.gradle.kts already includes both
flat (skainet_kernels.dll) and Visual Studio multi-config
(Release/skainet_kernels.dll) layouts, so the bundling step
works regardless of generator.
What is NOT in this PR (intentional):
- Hand-written AVX2 / NEON intrinsics. The current scalar C kernel
under -O3 -ffast-math is already 4.17–5.87× faster than Panama
Vector at LLM-typical Q4_K shapes (PR 2 numbers); GCC and Clang
auto-vectorize the 32-iteration inner loop into AVX2 / NEON
fmla / vfmadd213ps already. Hand-tuned intrinsics would just match
what the compiler emits. Adding them is busywork until a profile
shows a specific kernel where the compiler is leaving FLOPs on
the table.
- Multi-threading. Native is single-threaded today; Panama uses
parallelChunks across all cores and still loses to scalar single-
threaded native at every measured shape (the parallelChunks
dispatch overhead dominates). A focused multi-threaded native
variant becomes worthwhile once a real workload's per-call time
exceeds a few ms; current Q4_K matmul at 4096² is ~6 ms single-
threaded.
- Maven Central native classifier publishing / fat-JAR aggregation.
The CI artifacts are downloadable per-arch but not yet bundled
into a single publishable JAR with all four .so/.dylib/.dll files.
That belongs in a separate plan with the publishing infrastructure
(vanniktech.mavenPublish hooks, signing, classifier strategy).
Verification (linux-x86_64 host, JDK 21.0.10, cmake 3.28.3, gcc 13.3):
- :skainet-backends:skainet-backend-native-cpu:jvmTest — 17/17 pass
after the SKAINET_RESTRICT swap (3 pipeline + 5 heap-parity +
7 memseg-parity + 1 microbench-gated + 1 incidental).
- The macOS, Linux ARM64, and Windows runners exercise everything
the local build doesn't. CI status on this PR is the actual
cross-arch verification — locally I can't prove macos-arm64 or
windows-x86_64 work without those hosts.
Out of scope (deferred per asciidoc staging):
- PR 5: native FP32 / Q6_K / Q8_0 kernels (one PR per format,
template lifted from PR 2).
- Maven Central native classifier publishing + fat-JAR aggregation.
- AVX2 / NEON intrinsic kernels, multi-threading, prefetch tuning.
- Build instructions for users who want to cross-compile all arches
locally (currently CI is the path to a multi-arch artifact).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 29, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 4 of the staged native-FFM rollout per
docs/.../perf/native-ffm-plan.adoc. Adds cross-arch CI coverage so consumers on Apple Silicon, ARM Linux, and Windows aren't silently broken when they pull a JAR built on x86_64 Linux. No new C kernels — current scalar C is already 4.17–5.87× faster than Panama, and gcc auto-vec emits AVX2/NEON without intrinsics.What changed
.github/workflows/native-cpu-multiarch.yml— matrix of 4 hosts:ubuntu-latest,ubuntu-24.04-arm,macos-14(Apple Silicon),windows-latest. Each runs:skainet-backends:skainet-backend-native-cpu:jvmTest+jvmJar, exercising the full CMake → build → bundle → FFM downcall pipeline. Path-filtered triggers (only fires when native module / API jvmMain / this workflow change) keep CI noise down. Per-arch native libs uploaded as artifacts for later fat-JAR aggregation.fail-fast: falseso one bad arch doesn't cancel the others — easier triage.SKAINET_RESTRICTmacro inskainet_kernels.h—__restrict__on GCC/Clang,__restricton MSVC, empty otherwise.q4k_matmul.cnow uses the macro so Windows MSVC builds don't reject the function signature.CMakeLists.txt—/O2 /fp:fast /W3analogues of the existing-O3 -ffast-math -Wall. Visibility on Windows still handled by the__declspec(dllexport)macro.Test plan
:skainet-backends:skainet-backend-native-cpu:jvmTeston linux-x86_64 — 17/17 pass after the__restrict__→SKAINET_RESTRICTswaplibskainet_kernels.{so,dylib,dll}artifacts downloadable from this PR's check runbuild.yml/verify-poms.yml/java-tests.ymlworkflows unaffectedOut of scope
vfmadd213ps/fmlaalready. No profile-shown kernel where the compiler is leaving FLOPs on the table.parallelChunksoverhead dominates today; native single-threaded already wins. A focused multi-threaded variant becomes worthwhile once per-call time exceeds a few ms.🤖 Generated with Claude Code