feat(dsl): lazy zero-init for parameter placeholders#588
Merged
michalharakal merged 1 commit intodevelopfrom May 2, 2026
Merged
feat(dsl): lazy zero-init for parameter placeholders#588michalharakal merged 1 commit intodevelopfrom
michalharakal merged 1 commit intodevelopfrom
Conversation
The DSL's createLinear / Conv1d / Conv2d / DenseImpl construction paths
called `tensorDataFactory.zeros<T, V>(shape, kClass)` to satisfy each
module's constructor whenever the user had not provided initial weights
or bias. The allocation was eager — a full `FloatArray(shape.volume)`
materialized at module-construction time. For real-world transformers
loaded via downstream weight loaders the call sequence is always:
1. Build the empty network (Llama / Gemma / Apertus / Qwen / ...
`*NetworkLoader → *Network(metadata)`), eagerly allocating zeros
for every Linear's weights and bias.
2. Load weights from disk (~5 GB raw bytes for an 8B Q4_K_S model).
3. Substitute via `WeightMapper.applyWeights`, which sets
`parameter.value = loadedTensor`. The eager zeros are now garbage.
For Apertus-8B (32 layers, 4096 hidden, ~14k FFN, 131k vocab) the eager
zeros amount to ~27 GB of FP32 — peak heap ~32 GB just to construct +
populate the model. Anything under that OOMs at NetworkBuilder.kt:652
during step 1, before weights are even read.
Fix: introduce `TensorDataFactory.placeholder(shape, dtype)`, returning
a `TensorData` whose underlying primitive array materializes lazily on
first read. The default interface implementation falls back to `zeros`
(any custom factory keeps existing behavior); `DenseTensorDataFactory`
overrides with `LazyZeroFloatArrayTensorData` / `LazyZeroIntArrayTensorData`
which back `FloatArrayTensorData<T>` / `IntArrayTensorData<T>` with a
`by lazy { ... }` delegate. Int8 falls back to `zeros` (eager byte
allocation is rarely the dominant cost on real models).
Switch every eager-init call site in `NetworkBuilder.kt`
(`createLinear`, `DenseImpl.create`, `Conv1dImpl.create`, `Conv2dImpl.create`)
plus the matching `ExecutionContext.zeros(...)` paths to call
`placeholder(...)` instead. Behavior is strictly unchanged for any
caller that *reads* the tensor — the lazy materializes to zeros on
first access and is cached. For the WeightMapper substitution path,
the placeholder's lazy never fires because `parameter.value =` swaps
the entire `Tensor`, GC'ing the placeholder unread.
Verified end-to-end against unsloth/Apertus-8B-Instruct-2509-GGUF
(Q4_K_S, 4.7 GB on disk) via the downstream
`SKaiNET-transformers/llm-inference/apertus/.../ApertusRealGgufLoadingTest.kt`:
`ApertusNetworkLoader.fromGguf().load<FP32, Float>(ctx)` now succeeds
in 12 GB heap (previously OOMed at 12 GB), constructs all 35 top-level
modules in 13 s.
Tests:
- New `PlaceholderTensorDataTest` (8 cases) pins the contract:
shape-only access, materialize-to-zeros on first read, write-through,
buffer caching, instance independence, FP32 / FP16 / Int32 paths,
Int8 fallback.
- Full `:skainet-lang:skainet-lang-core:jvmTest` (614 tests) green.
Closes #587.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
📖 Documentation Preview The documentation has been built successfully for this PR. Generated Files:
Artifacts:
This comment will be updated automatically when the PR is updated. |
michalharakal
added a commit
that referenced
this pull request
May 2, 2026
- Quickstart import now pins skainet-bom:0.23.0. - "What's New" rewritten for 0.23.0: placeholder API + DSL OOM fix (PR #588) and the K/N pread random-access fix (PR #591). Older 0.22.0 / 0.22.2 highlights moved out of the README; CHANGELOG.md remains the canonical full history (link already in place). - BOM caveat about 0.22.2 being the first correctly-coordinated publish is retained — still actionable for anyone trying to import older BOMs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #587.
The DSL eagerly allocates zero
FloatArrays for every Linear / Conv1d / Conv2d / DenseImpl parameter at module construction time. Any downstream loader (LlamaNetworkLoader,GemmaNetworkLoader,ApertusNetworkLoader, …) builds the network first and only then substitutes weights viaWeightMapper.applyWeights, so the eager zeros are always immediately discarded — but they determine the JVM's peak heap footprint. For Apertus-8B (32 layers × 6 projections × ~14k × 4k FP32 + 131k × 4k embed + …) that's ~27 GB of zeros allocated and thrown away — anything under 32 GB heap OOMs atNetworkBuilder.kt:652before a single weight is loaded.Fix
Add
TensorDataFactory.placeholder(shape, dtype)returning aTensorDatawhose underlying primitive array materializes lazily on first read.DenseTensorDataFactoryoverrides withLazyZeroFloatArrayTensorData/LazyZeroIntArrayTensorData, which implementFloatArrayTensorData<T>/IntArrayTensorData<T>backed by aby lazy { ... }delegate. The default interface implementation falls back tozeros, preserving behavior for any custom factory.Switch every eager-init call site in
NetworkBuilder.kt(createLinear,DenseImpl.create,Conv1dImpl.create,Conv2dImpl.create) and the matchingExecutionContext.zeros(...)paths to callplaceholder(...)instead. Behavior is strictly unchanged for any caller that reads the tensor — the lazy materializes to zeros on first access and is cached. For the WeightMapper substitution path, the placeholder's lazy never fires becauseparameter.value =swaps the entireTensor, GC'ing the placeholder unread.Verification
:skainet-lang:skainet-lang-core:jvmTest— all 614 tests across 80 suites green.PlaceholderTensorDataTest(8 cases) pins the contract: shape-only access without materialization, materialize-to-zeros on first read with parity tozeros(), write-through, buffer caching, instance independence, FP32/FP16/Int32/Int8 paths.unsloth/Apertus-8B-Instruct-2509-GGUF(Q4_K_S, 4.7 GB on disk) via the downstreamApertusRealGgufLoadingTest.kt:ApertusNetworkLoader.fromGguf().load<FP32, Float>(ctx)now succeeds in 12 GB heap (previously OOMed at 12 GB), constructs all 35 top-level modules in 13 s.Knock-on impact
Combined with
SKaiNET-transformerscleanup commit8a7e0ff(which removedApertusQuantizedRuntime), there was no working memory-efficient path to run Apertus-8B Q4_K_S end-to-end on a normal-sized JVM. This PR unblocksOptimizedLLMRuntime + apertusNetwork()as the canonical quantized-Apertus path.Same fix applies transparently to Gemma, Llama, Qwen, Voxtral — every downstream model that uses the DSL benefits.
Test plan
placeholder(PlaceholderTensorDataTest, 8 cases)ApertusNetworkLoader.fromGguf().load()no longer OOMsWeightParameter.value = tensorsetter is the only path WeightMapper uses (i.e. the lazy never accidentally fires before substitution)🤖 Generated with Claude Code