Skip to content

feat(dsl): lazy zero-init for parameter placeholders#588

Merged
michalharakal merged 1 commit intodevelopfrom
feature/dsl-lazy-zero-init
May 2, 2026
Merged

feat(dsl): lazy zero-init for parameter placeholders#588
michalharakal merged 1 commit intodevelopfrom
feature/dsl-lazy-zero-init

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Summary

Closes #587.

The DSL eagerly allocates zero FloatArrays for every Linear / Conv1d / Conv2d / DenseImpl parameter at module construction time. Any downstream loader (LlamaNetworkLoader, GemmaNetworkLoader, ApertusNetworkLoader, …) builds the network first and only then substitutes weights via WeightMapper.applyWeights, so the eager zeros are always immediately discarded — but they determine the JVM's peak heap footprint. For Apertus-8B (32 layers × 6 projections × ~14k × 4k FP32 + 131k × 4k embed + …) that's ~27 GB of zeros allocated and thrown away — anything under 32 GB heap OOMs at NetworkBuilder.kt:652 before a single weight is loaded.

Fix

Add TensorDataFactory.placeholder(shape, dtype) returning a TensorData whose underlying primitive array materializes lazily on first read. DenseTensorDataFactory overrides with LazyZeroFloatArrayTensorData / LazyZeroIntArrayTensorData, which implement FloatArrayTensorData<T> / IntArrayTensorData<T> backed by a by lazy { ... } delegate. The default interface implementation falls back to zeros, preserving behavior for any custom factory.

Switch every eager-init call site in NetworkBuilder.kt (createLinear, DenseImpl.create, Conv1dImpl.create, Conv2dImpl.create) and the matching ExecutionContext.zeros(...) paths to call placeholder(...) instead. Behavior is strictly unchanged for any caller that reads the tensor — the lazy materializes to zeros on first access and is cached. For the WeightMapper substitution path, the placeholder's lazy never fires because parameter.value = swaps the entire Tensor, GC'ing the placeholder unread.

Verification

  • :skainet-lang:skainet-lang-core:jvmTest — all 614 tests across 80 suites green.
  • ✅ New PlaceholderTensorDataTest (8 cases) pins the contract: shape-only access without materialization, materialize-to-zeros on first read with parity to zeros(), write-through, buffer caching, instance independence, FP32/FP16/Int32/Int8 paths.
  • ✅ End-to-end against unsloth/Apertus-8B-Instruct-2509-GGUF (Q4_K_S, 4.7 GB on disk) via the downstream ApertusRealGgufLoadingTest.kt: ApertusNetworkLoader.fromGguf().load<FP32, Float>(ctx) now succeeds in 12 GB heap (previously OOMed at 12 GB), constructs all 35 top-level modules in 13 s.

Knock-on impact

Combined with SKaiNET-transformers cleanup commit 8a7e0ff (which removed ApertusQuantizedRuntime), there was no working memory-efficient path to run Apertus-8B Q4_K_S end-to-end on a normal-sized JVM. This PR unblocks OptimizedLLMRuntime + apertusNetwork() as the canonical quantized-Apertus path.

Same fix applies transparently to Gemma, Llama, Qwen, Voxtral — every downstream model that uses the DSL benefits.

Test plan

  • Unit-test contract for placeholder (PlaceholderTensorDataTest, 8 cases)
  • Existing skainet-lang-core suite (614 tests, 0 regressions)
  • Real-model integration test against Apertus-8B Q4_K_S — ApertusNetworkLoader.fromGguf().load() no longer OOMs
  • Reviewer sanity-check: confirm WeightParameter.value = tensor setter is the only path WeightMapper uses (i.e. the lazy never accidentally fires before substitution)

🤖 Generated with Claude Code

The DSL's createLinear / Conv1d / Conv2d / DenseImpl construction paths
called `tensorDataFactory.zeros<T, V>(shape, kClass)` to satisfy each
module's constructor whenever the user had not provided initial weights
or bias. The allocation was eager — a full `FloatArray(shape.volume)`
materialized at module-construction time. For real-world transformers
loaded via downstream weight loaders the call sequence is always:

  1. Build the empty network (Llama / Gemma / Apertus / Qwen / ...
     `*NetworkLoader → *Network(metadata)`), eagerly allocating zeros
     for every Linear's weights and bias.
  2. Load weights from disk (~5 GB raw bytes for an 8B Q4_K_S model).
  3. Substitute via `WeightMapper.applyWeights`, which sets
     `parameter.value = loadedTensor`. The eager zeros are now garbage.

For Apertus-8B (32 layers, 4096 hidden, ~14k FFN, 131k vocab) the eager
zeros amount to ~27 GB of FP32 — peak heap ~32 GB just to construct +
populate the model. Anything under that OOMs at NetworkBuilder.kt:652
during step 1, before weights are even read.

Fix: introduce `TensorDataFactory.placeholder(shape, dtype)`, returning
a `TensorData` whose underlying primitive array materializes lazily on
first read. The default interface implementation falls back to `zeros`
(any custom factory keeps existing behavior); `DenseTensorDataFactory`
overrides with `LazyZeroFloatArrayTensorData` / `LazyZeroIntArrayTensorData`
which back `FloatArrayTensorData<T>` / `IntArrayTensorData<T>` with a
`by lazy { ... }` delegate. Int8 falls back to `zeros` (eager byte
allocation is rarely the dominant cost on real models).

Switch every eager-init call site in `NetworkBuilder.kt`
(`createLinear`, `DenseImpl.create`, `Conv1dImpl.create`, `Conv2dImpl.create`)
plus the matching `ExecutionContext.zeros(...)` paths to call
`placeholder(...)` instead. Behavior is strictly unchanged for any
caller that *reads* the tensor — the lazy materializes to zeros on
first access and is cached. For the WeightMapper substitution path,
the placeholder's lazy never fires because `parameter.value =` swaps
the entire `Tensor`, GC'ing the placeholder unread.

Verified end-to-end against unsloth/Apertus-8B-Instruct-2509-GGUF
(Q4_K_S, 4.7 GB on disk) via the downstream
`SKaiNET-transformers/llm-inference/apertus/.../ApertusRealGgufLoadingTest.kt`:
`ApertusNetworkLoader.fromGguf().load<FP32, Float>(ctx)` now succeeds
in 12 GB heap (previously OOMed at 12 GB), constructs all 35 top-level
modules in 13 s.

Tests:
- New `PlaceholderTensorDataTest` (8 cases) pins the contract:
  shape-only access, materialize-to-zeros on first read, write-through,
  buffer caching, instance independence, FP32 / FP16 / Int32 paths,
  Int8 fallback.
- Full `:skainet-lang:skainet-lang-core:jvmTest` (614 tests) green.

Closes #587.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

📖 Documentation Preview

The documentation has been built successfully for this PR.

Generated Files:

  • Operator documentation: docs/modules/operators/_generated_/
  • JSON schema output: operators.json

Artifacts:

  • Download the documentation-preview-588 artifact to view the complete documentation locally.

This comment will be updated automatically when the PR is updated.

@michalharakal michalharakal merged commit 75b82e2 into develop May 2, 2026
10 checks passed
@michalharakal michalharakal deleted the feature/dsl-lazy-zero-init branch May 2, 2026 16:35
michalharakal added a commit that referenced this pull request May 2, 2026
- Quickstart import now pins skainet-bom:0.23.0.
- "What's New" rewritten for 0.23.0: placeholder API + DSL OOM fix (PR #588)
  and the K/N pread random-access fix (PR #591). Older 0.22.0 / 0.22.2
  highlights moved out of the README; CHANGELOG.md remains the canonical
  full history (link already in place).
- BOM caveat about 0.22.2 being the first correctly-coordinated publish is
  retained — still actionable for anyone trying to import older BOMs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DSL eagerly allocates zero tensors for every Linear, OOMs on real models

1 participant