Summary
After PR #102 the load → convert → fromWeights → forward chain runs to completion against unsloth/Apertus-8B-Instruct-2509-GGUF (Q4_K_S) — but the logit vector is all-NaN at the first forward pass. The propagation source is ApertusXIELU.xielu(buf, params) in layer 0.
Diagnosis
ApertusXIELU.kt evaluates:
alpha_p_eff = softplus(alpha_p)
alpha_n_eff = beta + softplus(alpha_n)
if x > 0: alpha_p_eff * x*x + beta * x
else: (expm1(min(x, eps)) - x) * alpha_n_eff + beta * x
with softplus(x) = if (x > 20) x else ln(1 + exp(x)).
The unsloth GGUF stores per-layer xIELU params (verified via gguf Python lib):
| layer |
alpha_p |
alpha_n |
beta |
eps |
| 0 |
166.0 |
40.75 |
0.5 |
-9.98e-7 |
| 1 |
174.0 |
31.625 |
0.5 |
-9.98e-7 |
| 2 |
128.0 |
22.875 |
0.5 |
-9.98e-7 |
| … |
(decaying) |
(decaying) |
0.5 |
-9.98e-7 |
| 11 |
-0.477 |
… |
0.5 |
-9.98e-7 |
| … |
(small / negative) |
… |
0.5 |
-9.98e-7 |
For layer 0, softplus(166) ≈ 166 (the > 20 shortcut applies), so the activation collapses to 166·x² + 0.5·x. Any input value |x| past a few causes the FFN output to overflow, and the post-FFN matmul + residual chain NaN-poisons the rest of the forward.
Hypothesis
Either:
- The unsloth GGUF stores
alpha_p / alpha_n in a different transform space than ApertusXIELU.kt expects. softplus(stored) may already be applied (i.e. these are the post-softplus alpha_p_eff values, not the raw learned parameters). In that case the right thing is to drop the softplus calls in the activation and use the stored values directly — which would still leave layer 0 with alpha_p_eff = 166, and the activation would still overflow for any non-trivial input.
ApertusXIELU.kt's formula diverges from the Apertus-8B reference. Possible candidates:
- The reference may scale
alpha_p by some normalization factor.
- The
if x > 0 branch might use alpha_p_eff * (e^x - x - 1) (an ELU-like form) rather than alpha_p_eff * x².
beta, eps, etc. may participate in a different combination.
- The unsloth converter is upstream-buggy and writes the wrong values in the wrong fields.
What to verify
Where the formula lives
llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusXIELU.kt:27
- The DSL wiring is
apertusNetwork() line 65: stage.xielu(id = \"act_fn\"). The DSL-emitted op is in skainet-lang-core; need to confirm it computes the same thing as ApertusXIELU.xielu reference.
- Per-layer params are loaded by
extractXIELUParams / extractXIELUParamsFromStreamingMeta in ApertusWeightLoader.kt, then converted to scalar tensors named blk.N.mlp.act_fn.{alpha_p,alpha_n,beta,eps} by ApertusNetworkLoader.applyWeightsToNetwork.
Out of scope
This issue is only about the activation. Loader / module construction / Q4_K dispatch are all correct (PR #102 verified loader + fromGguf().load() end-to-end on real Apertus-8B). When the activation is fixed, forward / generate / tool calling smoke tests in ApertusRealGgufLoadingTest should pass without further loader work.
Summary
After PR #102 the load → convert →
fromWeights→ forward chain runs to completion againstunsloth/Apertus-8B-Instruct-2509-GGUF(Q4_K_S) — but the logit vector is all-NaN at the first forward pass. The propagation source isApertusXIELU.xielu(buf, params)in layer 0.Diagnosis
ApertusXIELU.ktevaluates:with
softplus(x) = if (x > 20) x else ln(1 + exp(x)).The unsloth GGUF stores per-layer xIELU params (verified via
ggufPython lib):For layer 0,
softplus(166) ≈ 166(the> 20shortcut applies), so the activation collapses to166·x² + 0.5·x. Any input value|x|past a few causes the FFN output to overflow, and the post-FFN matmul + residual chain NaN-poisons the rest of the forward.Hypothesis
Either:
alpha_p/alpha_nin a different transform space thanApertusXIELU.ktexpects.softplus(stored)may already be applied (i.e. these are the post-softplusalpha_p_effvalues, not the raw learned parameters). In that case the right thing is to drop thesoftpluscalls in the activation and use the stored values directly — which would still leave layer 0 withalpha_p_eff = 166, and the activation would still overflow for any non-trivial input.ApertusXIELU.kt's formula diverges from the Apertus-8B reference. Possible candidates:alpha_pby some normalization factor.if x > 0branch might usealpha_p_eff * (e^x - x - 1)(an ELU-like form) rather thanalpha_p_eff * x².beta,eps, etc. may participate in a different combination.What to verify
swiss-ai/Apertus-8B-Instruct-2509and inspect the actualact_fnparameter values for layer 0 — see whether they match the unsloth-stored values or are very different (e.g. small).ApertusXIELU.xieluagainst the activation forward in the HF Apertus modeling code (modeling_apertus.pyor similar).ApertusXIELUwith a no-op linear (or withsilu) and re-run the forward pass. If logits become finite, the failure is fully inside xIELU.Where the formula lives
llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusXIELU.kt:27apertusNetwork()line 65:stage.xielu(id = \"act_fn\"). The DSL-emitted op is in skainet-lang-core; need to confirm it computes the same thing asApertusXIELU.xielureference.extractXIELUParams/extractXIELUParamsFromStreamingMetainApertusWeightLoader.kt, then converted to scalar tensors namedblk.N.mlp.act_fn.{alpha_p,alpha_n,beta,eps}byApertusNetworkLoader.applyWeightsToNetwork.Out of scope
This issue is only about the activation. Loader / module construction / Q4_K dispatch are all correct (PR #102 verified loader +
fromGguf().load()end-to-end on real Apertus-8B). When the activation is fixed, forward / generate / tool calling smoke tests inApertusRealGgufLoadingTestshould pass without further loader work.