| seed | vanilla loss | geodesic loss | delta | outcome |
|---|---|---|---|---|
| 42 | 2.464 | 2.713 | +10.1% | geodesic worse |
| 7 | 2.507 | 2.479 | −1.1% | geodesic better |
| 123 | 2.272 | 2.620 | +15.3% | geodesic worse |
| mean | 2.414 | 2.604 | +7.9% | 1/3 wins |
Verdict: inconclusive, leaning negative. The PyTorch result that won 3/3 seeds at -0.4% did NOT replicate in this Prometheus run.
While the A/B was training, a code review surfaced a real bug in
prom_attention_forward:
h k = tape_matmul(x_id, K_w); # K_w is a trainable param
h k_val = tape_value(k); # rip out the value
h kt_val = arr_transpose(k_val); # transpose in OMC space
h kt = tape_const(kt_val); # re-inject as a CONSTANT
h scores = tape_matmul(q, kt); # gradient flows ONLY to q
The tape_value → arr_transpose → tape_const sequence severs
gradient flow through K. K_w gets zero gradient from the attention
score path. K is effectively frozen at its random init throughout
training.
This means both arms (A and B) ran broken attention. The
geodesic bias was being added to scores q · K_random^T, not
scores from a learned K. We're testing whether the geodesic bias
helps when keys are random — an entirely different question from
the PyTorch experiment where K was trainable.
In the PyTorch experiment, K was trained alongside Q and V. The geodesic bias added a positional inductive prior on top of learned attention. The model could discover patterns like "attend to nearby positions for short-range dependencies" and the bias nudged it toward Fibonacci-coprime distance metrics specifically.
In our Prometheus run, K is fixed at random. The attention scores have no learned structure. Adding a positional bias to random scores either:
- adds random noise (no benefit) — most likely
- accidentally provides the ONLY structure → tiny effect either direction
The result is consistent with "broken attention plus a bias either hurts (overrides random noise that happened to work) or doesn't help much (random noise was already meaningless)."
- Add
tape_transposeRust builtin. Differentiable transpose so K trains through the score path. ~30 lines forward + backward. - Verify K's gradient is non-zero after one training step.
- Re-run the A/B with both arms having trainable K.
- If geodesic still loses 0/3 at this scale, then we have a real negative — substrate bias doesn't help when corpus is small + model is small + training is short. That's a legit honest finding.
- If geodesic wins or ties, the PyTorch result replicates.
We shipped a layer without testing its end-to-end gradient flow.
test_prometheus.omc has 10 tests covering every other layer and
zero touching attention. That's the regression-prevention gap to
close before any further A/B testing.
The fail-forward path:
- Fix K (add tape_transpose)
- Add
test_attention_backward_flows_to_QKVto lock it - Re-run this A/B
- Report whichever result lands (real win OR real null)