Skip to content

feat(native-cpu): cross-arch CI matrix + MSVC/Clang portability (PR 4 of 5)#574

Merged
michalharakal merged 1 commit intodevelopfrom
feature/native-cpu-multiarch
Apr 29, 2026
Merged

feat(native-cpu): cross-arch CI matrix + MSVC/Clang portability (PR 4 of 5)#574
michalharakal merged 1 commit intodevelopfrom
feature/native-cpu-multiarch

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Summary

PR 4 of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc. Adds cross-arch CI coverage so consumers on Apple Silicon, ARM Linux, and Windows aren't silently broken when they pull a JAR built on x86_64 Linux. No new C kernels — current scalar C is already 4.17–5.87× faster than Panama, and gcc auto-vec emits AVX2/NEON without intrinsics.

What changed

  • .github/workflows/native-cpu-multiarch.yml — matrix of 4 hosts: ubuntu-latest, ubuntu-24.04-arm, macos-14 (Apple Silicon), windows-latest. Each runs :skainet-backends:skainet-backend-native-cpu:jvmTest + jvmJar, exercising the full CMake → build → bundle → FFM downcall pipeline. Path-filtered triggers (only fires when native module / API jvmMain / this workflow change) keep CI noise down. Per-arch native libs uploaded as artifacts for later fat-JAR aggregation.
  • fail-fast: false so one bad arch doesn't cancel the others — easier triage.
  • SKAINET_RESTRICT macro in skainet_kernels.h__restrict__ on GCC/Clang, __restrict on MSVC, empty otherwise. q4k_matmul.c now uses the macro so Windows MSVC builds don't reject the function signature.
  • MSVC compile-flag branch in CMakeLists.txt/O2 /fp:fast /W3 analogues of the existing -O3 -ffast-math -Wall. Visibility on Windows still handled by the __declspec(dllexport) macro.

Test plan

  • :skainet-backends:skainet-backend-native-cpu:jvmTest on linux-x86_64 — 17/17 pass after the __restrict__SKAINET_RESTRICT swap
  • CI verifies macos-arm64, linux-arm64, windows-x86_64 on this PR (the actual cross-arch check is the workflow run itself — locally I can't prove those hosts without them)
  • Per-arch libskainet_kernels.{so,dylib,dll} artifacts downloadable from this PR's check run
  • Existing single-host build.yml / verify-poms.yml / java-tests.yml workflows unaffected

Out of scope

  • Hand-written AVX2 / NEON intrinsics — current scalar C is already 4.17–5.87× faster than Panama; GCC/Clang auto-vectorize the 32-iter inner loop into vfmadd213ps / fmla already. No profile-shown kernel where the compiler is leaving FLOPs on the table.
  • Multi-threading — Panama's parallelChunks overhead dominates today; native single-threaded already wins. A focused multi-threaded variant becomes worthwhile once per-call time exceeds a few ms.
  • Maven Central native classifier publishing / fat-JAR aggregation — CI artifacts are downloadable per-arch but not yet bundled into a single publishable JAR. Belongs with the publishing infra plan.
  • PR 5 — native FP32 / Q6_K / Q8_0 kernels (one per format, template lifted from PR 2).

🤖 Generated with Claude Code

… of 5)

PR 4 of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc.
The local Gradle build still produces only the host-arch shared
library, but cross-arch coverage now ships via CI: every push and PR
that touches the native module builds and runs jvmTest on
linux-x86_64, linux-arm64, macos-arm64, and windows-x86_64 in
parallel. This catches portability regressions (linker, alignment,
compiler-specific syntax) at PR time instead of letting them silently
ship to consumers on non-x86_64 hosts.

CI workflow (.github/workflows/native-cpu-multiarch.yml):

- Matrix of 4 hosts: ubuntu-latest, ubuntu-24.04-arm, macos-14
  (Apple Silicon), windows-latest.
- Each runner: setup-java JDK 25 zulu, cmake --version sanity, then
  ./gradlew :skainet-backends:skainet-backend-native-cpu:jvmTest +
  jvmJar. Drives the full CMake configure → build → bundle → FFM
  downcall test pipeline.
- Path-filtered triggers — only runs when the native module, the
  jvmMain part of skainet-backend-api (where the MemSeg SPI lives),
  or this workflow file changes. Keeps unrelated CI noise low.
- fail-fast: false so one arch failure doesn't cancel the others.
- Uploads each arch's libskainet_kernels.{so,dylib,dll} as a named
  artifact (libskainet_kernels-<arch_label>) plus per-arch test
  reports for triage. The artifacts are the input a future "fat-JAR"
  aggregation step will combine into a single multi-arch publishable
  JAR (deferred to a follow-up; see Out of scope).

C portability fixes (native/):

- skainet_kernels.h adds a SKAINET_RESTRICT macro: __restrict__ on
  GCC/Clang, __restrict on MSVC, empty otherwise. q4k_matmul.c
  switches from raw __restrict__ to the macro so the Windows MSVC
  build no longer rejects the function signatures.

- CMakeLists.txt grows an MSVC branch alongside the existing
  GCC/Clang one: /O2 /fp:fast /W3 (analogues of -O3 -ffast-math
  -Wall). Visibility on Windows is handled by the existing
  SKAINET_API __declspec(dllexport) macro; no additional flags
  needed. The Copy task in build.gradle.kts already includes both
  flat (skainet_kernels.dll) and Visual Studio multi-config
  (Release/skainet_kernels.dll) layouts, so the bundling step
  works regardless of generator.

What is NOT in this PR (intentional):

- Hand-written AVX2 / NEON intrinsics. The current scalar C kernel
  under -O3 -ffast-math is already 4.17–5.87× faster than Panama
  Vector at LLM-typical Q4_K shapes (PR 2 numbers); GCC and Clang
  auto-vectorize the 32-iteration inner loop into AVX2 / NEON
  fmla / vfmadd213ps already. Hand-tuned intrinsics would just match
  what the compiler emits. Adding them is busywork until a profile
  shows a specific kernel where the compiler is leaving FLOPs on
  the table.

- Multi-threading. Native is single-threaded today; Panama uses
  parallelChunks across all cores and still loses to scalar single-
  threaded native at every measured shape (the parallelChunks
  dispatch overhead dominates). A focused multi-threaded native
  variant becomes worthwhile once a real workload's per-call time
  exceeds a few ms; current Q4_K matmul at 4096² is ~6 ms single-
  threaded.

- Maven Central native classifier publishing / fat-JAR aggregation.
  The CI artifacts are downloadable per-arch but not yet bundled
  into a single publishable JAR with all four .so/.dylib/.dll files.
  That belongs in a separate plan with the publishing infrastructure
  (vanniktech.mavenPublish hooks, signing, classifier strategy).

Verification (linux-x86_64 host, JDK 21.0.10, cmake 3.28.3, gcc 13.3):

- :skainet-backends:skainet-backend-native-cpu:jvmTest — 17/17 pass
  after the SKAINET_RESTRICT swap (3 pipeline + 5 heap-parity +
  7 memseg-parity + 1 microbench-gated + 1 incidental).

- The macOS, Linux ARM64, and Windows runners exercise everything
  the local build doesn't. CI status on this PR is the actual
  cross-arch verification — locally I can't prove macos-arm64 or
  windows-x86_64 work without those hosts.

Out of scope (deferred per asciidoc staging):

- PR 5: native FP32 / Q6_K / Q8_0 kernels (one PR per format,
  template lifted from PR 2).
- Maven Central native classifier publishing + fat-JAR aggregation.
- AVX2 / NEON intrinsic kernels, multi-threading, prefetch tuning.
- Build instructions for users who want to cross-compile all arches
  locally (currently CI is the path to a multi-arch artifact).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 87a5730 into develop Apr 29, 2026
9 of 10 checks passed
@michalharakal michalharakal deleted the feature/native-cpu-multiarch branch May 2, 2026 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant