Skip to content

Apertus xIELU formula / param-space mismatch — forward pass NaN's on real Apertus-8B Q4_K_S #103

@michalharakal

Description

@michalharakal

Summary

After PR #102 the load → convert → fromWeights → forward chain runs to completion against unsloth/Apertus-8B-Instruct-2509-GGUF (Q4_K_S) — but the logit vector is all-NaN at the first forward pass. The propagation source is ApertusXIELU.xielu(buf, params) in layer 0.

Diagnosis

ApertusXIELU.kt evaluates:

alpha_p_eff = softplus(alpha_p)
alpha_n_eff = beta + softplus(alpha_n)

if x > 0:  alpha_p_eff * x*x + beta * x
else:      (expm1(min(x, eps)) - x) * alpha_n_eff + beta * x

with softplus(x) = if (x > 20) x else ln(1 + exp(x)).

The unsloth GGUF stores per-layer xIELU params (verified via gguf Python lib):

layer alpha_p alpha_n beta eps
0 166.0 40.75 0.5 -9.98e-7
1 174.0 31.625 0.5 -9.98e-7
2 128.0 22.875 0.5 -9.98e-7
(decaying) (decaying) 0.5 -9.98e-7
11 -0.477 0.5 -9.98e-7
(small / negative) 0.5 -9.98e-7

For layer 0, softplus(166) ≈ 166 (the > 20 shortcut applies), so the activation collapses to 166·x² + 0.5·x. Any input value |x| past a few causes the FFN output to overflow, and the post-FFN matmul + residual chain NaN-poisons the rest of the forward.

Hypothesis

Either:

  1. The unsloth GGUF stores alpha_p / alpha_n in a different transform space than ApertusXIELU.kt expects. softplus(stored) may already be applied (i.e. these are the post-softplus alpha_p_eff values, not the raw learned parameters). In that case the right thing is to drop the softplus calls in the activation and use the stored values directly — which would still leave layer 0 with alpha_p_eff = 166, and the activation would still overflow for any non-trivial input.
  2. ApertusXIELU.kt's formula diverges from the Apertus-8B reference. Possible candidates:
    • The reference may scale alpha_p by some normalization factor.
    • The if x > 0 branch might use alpha_p_eff * (e^x - x - 1) (an ELU-like form) rather than alpha_p_eff * x².
    • beta, eps, etc. may participate in a different combination.
  3. The unsloth converter is upstream-buggy and writes the wrong values in the wrong fields.

What to verify

  • Pull the HF reference checkpoint swiss-ai/Apertus-8B-Instruct-2509 and inspect the actual act_fn parameter values for layer 0 — see whether they match the unsloth-stored values or are very different (e.g. small).
  • Diff ApertusXIELU.xielu against the activation forward in the HF Apertus modeling code (modeling_apertus.py or similar).
  • Quick sanity probe: replace ApertusXIELU with a no-op linear (or with silu) and re-run the forward pass. If logits become finite, the failure is fully inside xIELU.
  • If a different unsloth quant of Apertus is available, repro under it to rule out a unsloth-conversion glitch isolated to Q4_K_S.

Where the formula lives

  • llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusXIELU.kt:27
  • The DSL wiring is apertusNetwork() line 65: stage.xielu(id = \"act_fn\"). The DSL-emitted op is in skainet-lang-core; need to confirm it computes the same thing as ApertusXIELU.xielu reference.
  • Per-layer params are loaded by extractXIELUParams / extractXIELUParamsFromStreamingMeta in ApertusWeightLoader.kt, then converted to scalar tensors named blk.N.mlp.act_fn.{alpha_p,alpha_n,beta,eps} by ApertusNetworkLoader.applyWeightsToNetwork.

Out of scope

This issue is only about the activation. Loader / module construction / Q4_K dispatch are all correct (PR #102 verified loader + fromGguf().load() end-to-end on real Apertus-8B). When the activation is fixed, forward / generate / tool calling smoke tests in ApertusRealGgufLoadingTest should pass without further loader work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions