Apertus xIELU formula / param-space mismatch — forward pass NaN's on real Apertus-8B Q4_K_S

## Summary

After PR #102 the load → convert → `fromWeights` → forward chain runs to completion against `unsloth/Apertus-8B-Instruct-2509-GGUF` (Q4_K_S) — but the logit vector is **all-NaN** at the first forward pass. The propagation source is `ApertusXIELU.xielu(buf, params)` in layer 0.

## Diagnosis

`ApertusXIELU.kt` evaluates:

```
alpha_p_eff = softplus(alpha_p)
alpha_n_eff = beta + softplus(alpha_n)

if x > 0:  alpha_p_eff * x*x + beta * x
else:      (expm1(min(x, eps)) - x) * alpha_n_eff + beta * x
```

with `softplus(x) = if (x > 20) x else ln(1 + exp(x))`.

The unsloth GGUF stores per-layer xIELU params (verified via `gguf` Python lib):

| layer | alpha_p | alpha_n | beta | eps |
|---:|---:|---:|---:|---:|
| 0 | 166.0 | 40.75 | 0.5 | -9.98e-7 |
| 1 | 174.0 | 31.625 | 0.5 | -9.98e-7 |
| 2 | 128.0 | 22.875 | 0.5 | -9.98e-7 |
| … | (decaying) | (decaying) | 0.5 | -9.98e-7 |
| 11 | -0.477 | … | 0.5 | -9.98e-7 |
| … | (small / negative) | … | 0.5 | -9.98e-7 |

For layer 0, `softplus(166) ≈ 166` (the `> 20` shortcut applies), so the activation collapses to `166·x² + 0.5·x`. Any input value `|x|` past a few causes the FFN output to overflow, and the post-FFN matmul + residual chain NaN-poisons the rest of the forward.

## Hypothesis

Either:

1. **The unsloth GGUF stores `alpha_p` / `alpha_n` in a different transform space than `ApertusXIELU.kt` expects.** `softplus(stored)` may already be applied (i.e. these *are* the post-softplus `alpha_p_eff` values, not the raw learned parameters). In that case the right thing is to drop the `softplus` calls in the activation and use the stored values directly — which would still leave layer 0 with `alpha_p_eff = 166`, and the activation would still overflow for any non-trivial input.
2. **`ApertusXIELU.kt`'s formula diverges from the Apertus-8B reference.** Possible candidates:
   - The reference may scale `alpha_p` by some normalization factor.
   - The `if x > 0` branch might use `alpha_p_eff * (e^x - x - 1)` (an ELU-like form) rather than `alpha_p_eff * x²`.
   - `beta`, `eps`, etc. may participate in a different combination.
3. **The unsloth converter is upstream-buggy** and writes the wrong values in the wrong fields.

## What to verify

- [ ] Pull the HF reference checkpoint `swiss-ai/Apertus-8B-Instruct-2509` and inspect the actual `act_fn` parameter values for layer 0 — see whether they match the unsloth-stored values or are very different (e.g. small).
- [ ] Diff `ApertusXIELU.xielu` against the activation forward in the HF Apertus modeling code (`modeling_apertus.py` or similar).
- [ ] Quick sanity probe: replace `ApertusXIELU` with a no-op linear (or with `silu`) and re-run the forward pass. If logits become finite, the failure is fully inside xIELU.
- [ ] If a different unsloth quant of Apertus is available, repro under it to rule out a unsloth-conversion glitch isolated to Q4_K_S.

## Where the formula lives

- `llm-inference/apertus/src/commonMain/kotlin/sk/ainet/models/apertus/ApertusXIELU.kt:27`
- The DSL wiring is `apertusNetwork()` line 65: `stage.xielu(id = \"act_fn\")`. The DSL-emitted op is in skainet-lang-core; need to confirm it computes the same thing as `ApertusXIELU.xielu` reference.
- Per-layer params are loaded by `extractXIELUParams` / `extractXIELUParamsFromStreamingMeta` in `ApertusWeightLoader.kt`, then converted to scalar tensors named `blk.N.mlp.act_fn.{alpha_p,alpha_n,beta,eps}` by `ApertusNetworkLoader.applyWeightsToNetwork`.

## Out of scope

This issue is *only* about the activation. Loader / module construction / Q4_K dispatch are all correct (PR #102 verified loader + `fromGguf().load()` end-to-end on real Apertus-8B). When the activation is fixed, forward / generate / tool calling smoke tests in `ApertusRealGgufLoadingTest` should pass without further loader work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apertus xIELU formula / param-space mismatch — forward pass NaN's on real Apertus-8B Q4_K_S #103

Summary

Diagnosis

Hypothesis

What to verify

Where the formula lives

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

layer	alpha_p	alpha_n	beta	eps
0	166.0	40.75	0.5	-9.98e-7
1	174.0	31.625	0.5	-9.98e-7
2	128.0	22.875	0.5	-9.98e-7
…	(decaying)	(decaying)	0.5	-9.98e-7
11	-0.477	…	0.5	-9.98e-7
…	(small / negative)	…	0.5	-9.98e-7

Apertus xIELU formula / param-space mismatch — forward pass NaN's on real Apertus-8B Q4_K_S #103

Description

Summary

Diagnosis

Hypothesis

What to verify

Where the formula lives

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions