Skip to content

Commit 5275c4d

Browse files
Geodesic attention — derived from what we know
After three falsifications of HBit-tension attention gates (key, score, learned), the common failure pattern was clear: every variant applied substrate metric to a CONTINUOUS LEARNED quantity (key magnitudes or attention scores). Those quantities have no architectural reason to land on Fibonacci attractors. What we know WORKS (CRT-PE, HBit OOD) applies substrate to INTEGER-VALUED quantities (positions, sample-aggregate signals) — quantities that intrinsically live in the substrate's basis. The derivation (full writeup: GEODESIC_ATTENTION_DERIVATION.md): scores[i, j] = (q_i · k_j) / sqrt(d) - alpha * geodesic(i, j) geodesic(i, j) = sum over CRT moduli {5, 8, 13, 21, 34, 55, 89, 144} of circular_distance((i % m), (j % m)) / m This is ALiBi-style additive position bias, but the position distance is computed in the same CRT-Fibonacci lattice that CRT-PE already lives in. Substrate signal applied to POSITIONS (integer, native basis) instead of activations. Properties that distinguish this from the previous three failures: - Substrate metric on integer quantities (vs continuous floats) - Same lattice as CRT-PE (which is the validated substrate win) - Additive pre-softmax bias (composes natively) - Single learnable alpha per block, init 0 (must DISCOVER bias) - Precomputed at construction (no per-batch substrate compute) - Independent of token content (geometry only) Training run kicked off; results in a separate commit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 9c4c158 commit 5275c4d

3 files changed

Lines changed: 309 additions & 16 deletions

File tree

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Geodesic attention — deriving from what we've measured
2+
3+
## What we actually know (not what we hoped)
4+
5+
After CRT-PE (2 wins) + HBit OOD (1 win) + three falsified attention
6+
gates, the empirical map is:
7+
8+
| Where substrate applied | Basis | Result |
9+
|---|---|---|
10+
| Position → CRT-PE | integer position `i` | **WINS** −5.4% / −2.9% |
11+
| Reference-free OOD score | per-sample HBit tension | **WINS** AUROC 1.0 |
12+
| Attention KEY magnitude gate | learned float `\|k\|.mean(-1)` | FAILS 0/3 |
13+
| Attention SCORE gate | learned float `q @ k^T / √d` | FAILS 0/3 |
14+
| Same with learned threshold | same float quantity | FAILS 0/3 |
15+
16+
**The common failure pattern**: every loss applied
17+
`attractor_distance(·)` to a *continuous, Gaussian-ish, learned*
18+
quantity. Those quantities have no architectural reason to land
19+
on Fibonacci attractors — those attractors live in integer ID
20+
space (the basis that CRT-PE actually uses).
21+
22+
**The wins share a pattern**: substrate signal applied to a
23+
quantity that's *intrinsically integer-valued* (positions in
24+
CRT-PE) or *aggregated cross-position* (HBit OOD over a sample).
25+
The substrate's lattice lives in those bases.
26+
27+
## The right basis for attention bias
28+
29+
Attention has TWO sources of structure:
30+
1. **The query/key activations** (continuous, learned, no substrate
31+
structure → all three previous attempts)
32+
2. **The query/key POSITIONS** (integer, indexed 0..T, *is*
33+
meaningful in substrate space — that's why CRT-PE works)
34+
35+
We've been adding the substrate signal to source #1. The right move
36+
is to add it to source #2. Specifically: **attention bias should be
37+
a function of geodesic distance between positions i and j in the
38+
same CRT-Fibonacci-moduli space CRT-PE already uses.**
39+
40+
## The formula
41+
42+
For positions i, j and Fibonacci moduli M = {5, 8, 13, 21, 34, 55, 89, 144}:
43+
44+
```
45+
d_circ(i, j, m) = min(|(i % m) − (j % m)|, m − |(i % m) − (j % m)|)
46+
geodesic(i, j) = Σ_{m ∈ M} d_circ(i, j, m) / m # normalize to [0, ~|M|/2]
47+
```
48+
49+
Each per-modulus term is a circular distance on a ring of size `m`
50+
(positions sharing the same residue contribute 0; antipodal residues
51+
contribute `m/2`). The total is the L1 sum over moduli — the
52+
geodesic length in the CRT-Fibonacci lattice.
53+
54+
Why circular: positions on a ring of size `m` should be treated as
55+
adjacent at the wrap. This matches CRT-PE which uses
56+
`sin(2π·pos%m/m)` — same circularity.
57+
58+
## The attention modification
59+
60+
Pre-softmax additive bias (the form that works for ALiBi):
61+
62+
```
63+
scores_ij = (q_i · k_j) / √d − α · geodesic(i, j)
64+
attn = softmax(scores)
65+
```
66+
67+
α is a learned scalar per head (initialized to 0 — model can disable
68+
substrate signal if loss says to; same fairness as
69+
`hybrid_learned`).
70+
71+
## Why this should work where the previous three failed
72+
73+
| Property | Previous gates | Geodesic |
74+
|---|:-:|:-:|
75+
| Substrate metric applied to integer quantities |||
76+
| Same basis as CRT-PE (proven to work) |||
77+
| Composes additively with softmax | partly ||
78+
| Model can disable via single learnable |||
79+
| Computable once at init (not per-batch) |||
80+
| Independent of token content |||
81+
82+
The last two are important: the geodesic table is `[T, T]`
83+
precomputed at model construction. Forward pass adds the bias
84+
without computing anything per-batch. This is essentially **ALiBi
85+
with substrate-geodesic distances instead of plain absolute
86+
distance** — and ALiBi itself is known to work, so the prior on
87+
this formulation is much stronger than another activation gate.
88+
89+
## Falsifiable prediction
90+
91+
- If geodesic attention WINS vs crt_only on the distractor mix:
92+
substrate IS useful as an attention modulator, but the basis
93+
matters. The transformerless thesis gets a third architectural
94+
win.
95+
- If geodesic attention LOSES: attention modulation in OMC's
96+
substrate is truly dead at this scale, regardless of basis.
97+
Honest pivot to tokenizer-layer substrate becomes the only
98+
remaining substrate-in-attention story.
99+
100+
Either way, this is the final attention-side experiment. After
101+
this we're moving the substrate's role away from attention
102+
unless this works.
103+
104+
## Init details (matters for fair comparison)
105+
106+
- α = 0.0 per head (disabled gate at init — the model has to
107+
*find* the bias useful from gradient signal alone)
108+
- Geodesic table normalized so its mean over (i, j) for i ≠ j
109+
is approximately 1.0 (so α has interpretable units)
110+
- All other hyperparameters identical to
111+
`train_gate_reformulation.py` (d_model=128, n_blocks=4,
112+
seq_len=128, 1500 steps, distractor_frac=0.20, 3 seeds)
113+
114+
The only architectural variable changed from `crt_only` is the
115+
addition of the geodesic bias to attention scores. Everything else
116+
identical.

experiments/transformerless_lm/models.py

Lines changed: 67 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,39 @@ def hbit_tension_gate(keys: torch.Tensor, scale: float = 1.0) -> torch.Tensor:
103103
return 1.0 / (1.0 + scale * attractor_distance(keys))
104104

105105

106+
# Same Fibonacci moduli as CRT-PE. The geodesic distance is computed
107+
# in the same lattice the positional encoding lives in — that's the
108+
# architectural coherence that the previous gate formulations lacked.
109+
_GEODESIC_MODULI = _FIB_MODULI
110+
111+
112+
def geodesic_distance_table(seq_len: int) -> torch.Tensor:
113+
"""Precompute a [seq_len, seq_len] table of CRT-Fibonacci
114+
geodesic distances. For each pair (i, j) and each modulus m,
115+
take the circular distance between residues (i % m) and (j % m)
116+
— `min(d, m - d)` so positions on a ring of size m wrap.
117+
Sum over moduli, normalize by m so each modulus contributes
118+
bounded magnitude.
119+
120+
Returned table is normalized so its mean over i ≠ j is ≈ 1.0,
121+
giving the learned α-bias scalar interpretable units.
122+
"""
123+
table = torch.zeros(seq_len, seq_len, dtype=torch.float32)
124+
pos = torch.arange(seq_len)
125+
for m in _GEODESIC_MODULI:
126+
ri = (pos % m).unsqueeze(1) # [T, 1]
127+
rj = (pos % m).unsqueeze(0) # [1, T]
128+
d = (ri - rj).abs() % m # [T, T]
129+
d_circ = torch.minimum(d, m - d) # circular distance
130+
table = table + d_circ.float() / float(m)
131+
# Normalize so mean of off-diagonal ≈ 1.0.
132+
n_offdiag = seq_len * seq_len - seq_len
133+
mean_offdiag = (table.sum() - torch.diagonal(table).sum()) / max(n_offdiag, 1)
134+
if mean_offdiag.item() > 0:
135+
table = table / mean_offdiag
136+
return table
137+
138+
106139
# ---------------------------------------------------------------------------
107140
# Attention block
108141
# ---------------------------------------------------------------------------
@@ -125,21 +158,32 @@ class Attention(nn.Module):
125158
substrate distance is a useful signal for the task.
126159
"""
127160

128-
def __init__(self, d_model: int, gate_mode: str = "none", dropout: float = 0.0):
161+
def __init__(self, d_model: int, gate_mode: str = "none",
162+
seq_len: int = 128, dropout: float = 0.0):
129163
super().__init__()
130-
if gate_mode not in ("none", "key", "score", "learned"):
164+
if gate_mode not in ("none", "key", "score", "learned", "geodesic"):
131165
raise ValueError(f"unknown gate_mode: {gate_mode}")
132166
self.d_model = d_model
133167
self.qkv = nn.Linear(d_model, 3 * d_model)
134168
self.out = nn.Linear(d_model, d_model)
135169
self.gate_mode = gate_mode
136170
self.dropout = dropout
137171
if gate_mode == "learned":
138-
# Initialize so sigmoid(W*d + b) ≈ 1/(1 + d) near d ≈ 0:
139-
# picking W = -1, b = 0 gives sigmoid(-d) ∈ (0, 0.5], a
140-
# softer version of the falsified gate. Both are learnable.
141172
self.gate_w = nn.Parameter(torch.tensor(-1.0))
142173
self.gate_b = nn.Parameter(torch.tensor(0.0))
174+
if gate_mode == "geodesic":
175+
# ALiBi-style additive position bias, but using CRT-Fibonacci
176+
# geodesic distance instead of plain |i-j|. Precomputed once
177+
# at construction so the forward pass adds a [T,T] tensor
178+
# to scores — no per-batch substrate compute.
179+
self.register_buffer(
180+
"geodesic_bias", geodesic_distance_table(seq_len)
181+
)
182+
# α scalar — initialized to 0 so the model starts as pure
183+
# crt_only and must DISCOVER the bias is useful from
184+
# gradient signal alone. Same fairness condition as
185+
# gate_mode="learned".
186+
self.alpha = nn.Parameter(torch.tensor(0.0))
143187

144188
def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
145189
B, T, D = x.shape
@@ -149,17 +193,18 @@ def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
149193
scores = (q @ k.transpose(-2, -1)) * scale # [B, T, T]
150194

151195
if self.gate_mode == "score":
152-
# Gate on score VALUES (pre-mask). attractor_distance of
153-
# raw scores tells us whether the (q·k) magnitude lands
154-
# on a substrate attractor. Off-attractor scores get
155-
# additively penalized in log-space, so softmax handles
156-
# normalization natively.
157-
d = attractor_distance(scores * 10.0) # [B, T, T]
196+
d = attractor_distance(scores * 10.0)
158197
log_gate = -torch.log1p(d)
159198
scores = scores + log_gate
199+
elif self.gate_mode == "geodesic":
200+
# Subtract α * geodesic(i, j). Larger distance → more
201+
# negative bias → softmax attenuates that pair. α<0 would
202+
# invert (favor distant pairs), so the sign of α is
203+
# itself a learnable architectural choice.
204+
scores = scores - self.alpha * self.geodesic_bias[:T, :T].unsqueeze(0)
160205

161206
scores = scores.masked_fill(mask == 0, float('-inf'))
162-
attn = F.softmax(scores, dim=-1) # [B, T, T]
207+
attn = F.softmax(scores, dim=-1)
163208

164209
if self.gate_mode == "key":
165210
key_mag = k.abs().mean(dim=-1)
@@ -168,7 +213,7 @@ def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
168213
attn = attn / (attn.sum(dim=-1, keepdim=True) + 1e-9)
169214
elif self.gate_mode == "learned":
170215
key_mag = k.abs().mean(dim=-1)
171-
d = attractor_distance(key_mag * 100.0) # [B, T]
216+
d = attractor_distance(key_mag * 100.0)
172217
gate = torch.sigmoid(self.gate_w * d + self.gate_b)
173218
attn = attn * gate.unsqueeze(1)
174219
attn = attn / (attn.sum(dim=-1, keepdim=True) + 1e-9)
@@ -198,9 +243,9 @@ def forward(self, x):
198243

199244

200245
class Block(nn.Module):
201-
def __init__(self, d_model: int, gate_mode: str = "none"):
246+
def __init__(self, d_model: int, gate_mode: str = "none", seq_len: int = 128):
202247
super().__init__()
203-
self.attn = Attention(d_model, gate_mode=gate_mode)
248+
self.attn = Attention(d_model, gate_mode=gate_mode, seq_len=seq_len)
204249
self.ff = FeedForward(d_model)
205250
self.ln1 = nn.LayerNorm(d_model)
206251
self.ln2 = nn.LayerNorm(d_model)
@@ -234,7 +279,7 @@ def __init__(
234279
raise ValueError(f"unknown pe_kind: {pe_kind}")
235280
self.register_buffer("pe", pe)
236281
self.blocks = nn.ModuleList([
237-
Block(d_model, gate_mode=gate_mode) for _ in range(n_blocks)
282+
Block(d_model, gate_mode=gate_mode, seq_len=seq_len) for _ in range(n_blocks)
238283
])
239284
self.ln_f = nn.LayerNorm(d_model)
240285
self.head = nn.Linear(d_model, vocab_size, bias=False)
@@ -282,4 +327,10 @@ def make_model(
282327
return TinyLM(**common, pe_kind="crt", gate_mode="score")
283328
if arch == "hybrid_learned":
284329
return TinyLM(**common, pe_kind="crt", gate_mode="learned")
330+
if arch == "hybrid_geodesic":
331+
# CRT-PE + ALiBi-style additive position bias in CRT-Fibonacci
332+
# geodesic distance. Substrate signal applied to POSITIONS
333+
# (integer, native to the substrate's basis) instead of
334+
# activations (continuous, no substrate structure).
335+
return TinyLM(**common, pe_kind="crt", gate_mode="geodesic")
285336
raise ValueError(f"unknown arch: {arch}")
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
"""Geodesic attention vs crt_only on distractor-mix TinyShakespeare.
2+
3+
The LAST attempt at substrate-as-attention-modulator. See
4+
GEODESIC_ATTENTION_DERIVATION.md for the derivation.
5+
6+
The change vs the three previously falsified gates: substrate metric
7+
is applied to POSITION INDICES (integer, native to the substrate's
8+
basis), not to learned float activations. Implemented as an
9+
ALiBi-style additive pre-softmax bias:
10+
11+
scores[i, j] = (q_i · k_j) / √d − α · geodesic(i, j)
12+
13+
where geodesic(i, j) is the CRT-Fibonacci geodesic distance using
14+
the SAME moduli as CRT-PE (5, 8, 13, 21, 34, 55, 89, 144). The
15+
table is precomputed at construction; α is one learnable scalar
16+
per block, initialized to 0 (model has to discover the bias is
17+
useful from loss gradient alone).
18+
"""
19+
20+
import argparse
21+
import json
22+
import sys
23+
import time
24+
import statistics
25+
from pathlib import Path
26+
27+
sys.path.insert(0, str(Path(__file__).parent))
28+
from corpus import make_dataset
29+
from models import make_model
30+
from train_distractor_mix import (
31+
build_distractor_stream,
32+
train_one,
33+
)
34+
35+
36+
ARCHS = ["crt_only", "hybrid_geodesic"]
37+
38+
39+
def main():
40+
parser = argparse.ArgumentParser()
41+
parser.add_argument("--steps", type=int, default=1500)
42+
parser.add_argument("--batch-size", type=int, default=32)
43+
parser.add_argument("--seq-len", type=int, default=128)
44+
parser.add_argument("--d-model", type=int, default=128)
45+
parser.add_argument("--n-blocks", type=int, default=4)
46+
parser.add_argument("--lr", type=float, default=3e-4)
47+
parser.add_argument("--eval-every", type=int, default=100)
48+
parser.add_argument("--seeds", type=str, default="42,7,123")
49+
parser.add_argument("--distractor-frac", type=float, default=0.20)
50+
parser.add_argument("--out", type=str, default="results_geodesic_attention.json")
51+
args = parser.parse_args()
52+
53+
seeds = [int(s) for s in args.seeds.split(",")]
54+
55+
chars, stoi, itos, encoded = make_dataset(
56+
seq_len=args.seq_len, source="tinyshakespeare",
57+
)
58+
vocab_size = len(chars)
59+
60+
print(f"Geodesic attention — distractor_frac={args.distractor_frac:.2f}")
61+
print(f"Archs: {ARCHS}")
62+
print(f"Corpus: TinyShakespeare ({encoded.numel():,} chars, vocab {vocab_size})")
63+
print(f"Model: d_model={args.d_model}, n_blocks={args.n_blocks}, seq_len={args.seq_len}")
64+
print(f"Training: steps={args.steps}, batch={args.batch_size}, lr={args.lr}, seeds={seeds}",
65+
flush=True)
66+
67+
all_results = {arch: [] for arch in ARCHS}
68+
per_seed_logs = []
69+
for seed in seeds:
70+
print(f"\n=========== seed {seed} ===========", flush=True)
71+
train_split, val_split = build_distractor_stream(
72+
encoded, args.distractor_frac, args.seq_len, seed,
73+
)
74+
seed_record = {"seed": seed, "archs": {}}
75+
for arch in ARCHS:
76+
r = train_one(arch, train_split, val_split, vocab_size, args, seed)
77+
all_results[arch].append(r["final_val"])
78+
seed_record["archs"][arch] = {
79+
"final_val": r["final_val"],
80+
"n_params": r["n_params"],
81+
"time": r["time"],
82+
}
83+
print(f" [seed {seed}] {arch}: final_val={r['final_val']:.4f}", flush=True)
84+
per_seed_logs.append(seed_record)
85+
86+
print()
87+
print("=" * 70)
88+
print(f"{'arch':<18} {'mean_final_val':>16} {'std':>10} {'vs crt_only':>14}")
89+
print("-" * 70)
90+
base = all_results["crt_only"]
91+
base_mean = sum(base) / len(base)
92+
summary = {"distractor_frac": args.distractor_frac, "steps": args.steps,
93+
"seeds": seeds, "per_seed": per_seed_logs, "summary": {}}
94+
for arch in ARCHS:
95+
vals = all_results[arch]
96+
mean = sum(vals) / len(vals)
97+
std = statistics.stdev(vals) if len(vals) > 1 else 0.0
98+
if arch == "crt_only":
99+
tag = "—"
100+
else:
101+
wins = sum(1 for v, b in zip(vals, base) if v < b)
102+
rel = (mean - base_mean) / base_mean * 100
103+
tag = f"{rel:+.1f}% ({wins}/{len(vals)})"
104+
print(f"{arch:<18} {mean:>16.4f} {std:>10.4f} {tag:>14}")
105+
summary["summary"][arch] = {"mean": mean, "std": std, "vals": vals}
106+
107+
print()
108+
print("Interpretation:")
109+
m_geo = sum(all_results["hybrid_geodesic"]) / len(all_results["hybrid_geodesic"])
110+
rel = (m_geo - base_mean) / base_mean * 100
111+
wins = sum(1 for v, b in zip(all_results["hybrid_geodesic"], base) if v < b)
112+
if m_geo < base_mean:
113+
verdict = "GEODESIC EARNS KEEP — substrate works on positions, not activations"
114+
else:
115+
verdict = "GEODESIC ALSO FAILS — substrate is exhausted as attention modulator"
116+
print(f" hybrid_geodesic vs crt_only: {rel:+.1f}%, wins {wins}/{len(base)}")
117+
print(f" → {verdict}")
118+
119+
out_path = Path(__file__).parent / args.out
120+
with open(out_path, "w") as f:
121+
json.dump(summary, f, indent=2)
122+
print(f"\nWrote {out_path}")
123+
124+
125+
if __name__ == "__main__":
126+
main()

0 commit comments

Comments
 (0)