docs: SIMD kernels, quantized SIMD, native FFM plan; arc42 architecture#567
Merged
michalharakal merged 2 commits intodevelopfrom Apr 29, 2026
Merged
Conversation
Three new explanation pages under docs/.../explanation/perf/ covering the M5 work that landed in 0.21.0: - simd-kernels.adoc — kernel SPI overview, FloatVector + FMA pattern, tile blocking, ServiceLoader auto-discovery + factory wrappers, KernelMatmulBench numbers (8.6×–10.8× over scalar at 256/512/1024 on Apple Silicon NEON). - quantized-simd-kernels.adoc — per-format pipelines (Q4_0, Q4_K, Q4_K MemSeg, Q6_K, Q8_0): ByteVector → AND/LSHR nibble extract → castShape(B2F) → fused FMA, plus the lazy-dmin trick and the per- format coverage matrix. - native-ffm-plan.adoc — recovers the FFM PRD content from git history (61962de:NATIVE_FFM_KERNEL_PROVIDER.md, dropped from 0.21.0 release per #566). Module layout, FFM binding pattern, staged delivery, success metrics, risks, trigger conditions. architecture.adoc grows from a 4-line stub to an arc42-style reference with focus on the 0.21.0 changes: - Building Block View — module table + kernel SPI ASCII diagram (commonMain api + jvmMain auto-discovery + jvmMain providers). - Runtime View — eager-execution flow from `ctx.ops.matmul` through `chooseQuantizedMatmul` / `chooseMatmul` to the SPI kernel, with the lazy provider resolution and fall-through pattern called out. - Architecture decisions table, quality requirements, risks (Vector API still incubator, no native provider yet, prior reverts). nav.adoc: register the three new explanation pages under .Explanation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
📖 Documentation Preview The documentation has been built successfully for this PR. Generated Files:
Artifacts:
This comment will be updated automatically when the PR is updated. |
Local Antora build (the same docker pipeline GitHub Actions runs) emitted: warn: native-ffm-plan.adoc:237: list item index: expected 1, got 22 The line started with "22." after a wrap, which the asciidoctor parser interpreted as a sibling numbered list item with an out-of- sequence index. Re-wrap so "22." stays mid-sentence. Site rebuild now warning-free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
📖 Documentation Preview The documentation has been built successfully for this PR. Generated Files:
Artifacts:
This comment will be updated automatically when the PR is updated. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Antora docs reflecting the 0.21.0 M5 work: three new explanation pages under
docs/.../explanation/perf/, plus an arc42-style expansion of the previously-stubreference/architecture.adoc.New pages
explanation/perf/simd-kernels.adoc— How the kernel SPI is structured, why it exists, the four core JDK Vector API patterns (SPECIES_PREFERRED,B^Tpacking, FMA +reduceLanes, 8×8×128 tile blocking), and the ServiceLoader / factory-wrapper auto-discovery story. Includes theKernelMatmulBenchnumbers from bench(kernel): KernelMatmulBench — scalar vs Panama (M5 evidence) #558.explanation/perf/quantized-simd-kernels.adoc— Per-format pipelines (Q4_0, Q4_K, Q4_K MemSeg, Q6_K, Q8_0). Walks theByteVector→ AND/LSHR nibble extract →castShape(B2F)→ fused FMA recipe, the lazy-dmintrick, the canonical Q4_K block layout (8 sub-blocks, ggmlget_scale_min_k4), the Q6_Kql + qh6-bit assembly, and the per-format coverage matrix.explanation/perf/native-ffm-plan.adoc— Recovers the FFM PRD content from git history (61962def:NATIVE_FFM_KERNEL_PROVIDER.md, dropped from the 0.21.0 release per chore(release): prepare 0.21.0 #566 to "ship the release first, keep the plan in docs"). Goals, non-goals, module layout, FFM binding pattern, staged delivery (5 PRs), success metrics, 6 risks/open questions, trigger conditions for un-deferring.Architecture
reference/architecture.adocgoes from a 4-line stub to a full arc42-ordered reference, focused on the 0.21.0 changes:ctx.ops.matmulthroughchooseQuantizedMatmul/chooseMatmulto the resolved SPI kernel, with the lazy provider resolution andnull-fall-through patterns called out.nullaccessor pattern, ServiceLoader-deferral rationale, FFM-not-JNI, Antora-not-Wiki.Nav
nav.adocregisters the three new explanation pages under the.Explanationsection, alongside the existingjvm-cpu.adocandjava-25-cpu-backend.adoc.Files changed
.adocfilesreference/architecture.adoc: +283 / −6nav.adoc: +3Test plan
./gradlew :docs:antoraor whatever the local target is) and visually check the three new pages plus the architecture page render correctly with the new ASCII art and tables.xref:explanation/perf/simd-kernels.adoc[]etc.) resolve.nav.adoc.🤖 Generated with Claude Code