Skip to content

Commit ccb1be7

Browse files
committed
transformerless_lm: Pareto bench results — composed wins, K-extension/scale lose
Full 1500-step bench at TinyShakespeare with lazy-loading. Three blocks each tested a follow-up direction: config params val vs dense compression dense_crt_d128 801,664 2.5778 - 1.0x fibgen_K32_cross_d128 80,640 2.7731 +7.6% 10.0x * fibgen_K48_cross_d128 172,800 3.3806 +31.1% 4.7x fibgen_K64_cross_d128 301,824 3.1486 +22.1% 2.7x dense_crt_d256 3,176,192 2.6098 - 1.0x fibgen_K32_cross_d256 87,552 3.3878 +29.8% 36.5x transformerless_K32_cross 213,752 2.7211 +5.6% 3.8x ** * Pareto sweet spot for plain FibGen ** WINNER -- composed substrate primitives VERDICTS: TOSS Block 1 (K-extension): K=32 is the Pareto sweet spot for plain FibGen. K=48 and K=64 LOST at d=128 (+31% and +22% gap vs K=32's +7.6%). Higher Fibonacci frequencies wrap pseudo-randomly mod d and act as interference, not basis. "More K = more capacity" is falsified. TOSS Block 2 (scale to d=256): FibGen DOES NOT scale to bigger dense models. Compression rises nicely (10x -> 36x) but loss penalty rises with it (+7.6% -> +29.8%). The K=32 basis isn't rich enough for the capacity dense models use at higher d_model. For the substrate to scale to LLM dimensions, K would need to grow roughly as sqrt(d_model). Direct path to "35B in 8GB" via plain FibGen alone is BLOCKED at this scale. KEEP Block 3 (composed transformerless): The full substrate stack WINS: - CRT-Fibonacci positional encoding - FibGen embedding - Fibonacci-offset sparse attention (only F(k)-distance pairs) - FibGen QKV/out weights inside the sparse attention - Zeckendorf-routed FFN (5 specialists, routed by token-id Zeckendorf) - FibGen specialist weights - FibGen LM head Composed val 2.7211 BEATS plain FibGen K=32's 2.7731 -- the Fibonacci-offset attention + Zeckendorf-routed FFN do NOT interfere with FibGen weights; they make it BETTER. Compression is 3.8x (less than plain FibGen's 10x) but the loss/compression Pareto moves up and to the left -- this is the best operating point we have found. NEXT QUESTIONS to investigate: - Why does FibGen lose at d=256? Likely needs K to scale with d. Test FibGen K=64 at d=256 (matching K^2/d^2 ratio). - Can the composed transformerless arch beat dense outright at longer training? At step 1500 the gap is only +5.6%. - Does the composed arch hold its win at d=256?
1 parent fd6d33b commit ccb1be7

1 file changed

Lines changed: 268 additions & 0 deletions

File tree

Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
{
2+
"dense_crt_d128": {
3+
"name": "dense_crt_d128",
4+
"n_params": 801664,
5+
"final_val": 2.5777537748217583,
6+
"wall_time": 29.4541916847229,
7+
"val_history": [
8+
[
9+
0,
10+
92.43370771408081,
11+
0.10480189323425293
12+
],
13+
[
14+
300,
15+
2.98776376247406,
16+
5.749638319015503
17+
],
18+
[
19+
600,
20+
2.749406397342682,
21+
11.443394899368286
22+
],
23+
[
24+
900,
25+
2.6236911714076996,
26+
17.755601406097412
27+
],
28+
[
29+
1200,
30+
2.625833183526993,
31+
23.647001028060913
32+
],
33+
[
34+
1499,
35+
2.5720113068819046,
36+
29.286940813064575
37+
]
38+
]
39+
},
40+
"fibgen_K32_cross_d128": {
41+
"name": "fibgen_K32_cross_d128",
42+
"n_params": 80640,
43+
"final_val": 2.7731429114937782,
44+
"wall_time": 68.29706740379333,
45+
"val_history": [
46+
[
47+
0,
48+
4.959010422229767,
49+
0.2481372356414795
50+
],
51+
[
52+
300,
53+
3.013837695121765,
54+
12.847511768341064
55+
],
56+
[
57+
600,
58+
2.8200577795505524,
59+
26.4186954498291
60+
],
61+
[
62+
900,
63+
2.7545041888952255,
64+
40.29787755012512
65+
],
66+
[
67+
1200,
68+
2.7303915172815323,
69+
53.97985577583313
70+
],
71+
[
72+
1499,
73+
2.7706514596939087,
74+
67.88464403152466
75+
]
76+
]
77+
},
78+
"fibgen_K48_cross_d128": {
79+
"name": "fibgen_K48_cross_d128",
80+
"n_params": 172800,
81+
"final_val": 3.380561262369156,
82+
"wall_time": 67.12795400619507,
83+
"val_history": [
84+
[
85+
0,
86+
5.754797458648682,
87+
0.25063228607177734
88+
],
89+
[
90+
300,
91+
3.3955775797367096,
92+
12.87597131729126
93+
],
94+
[
95+
600,
96+
3.336452305316925,
97+
26.21367383003235
98+
],
99+
[
100+
900,
101+
3.382943570613861,
102+
40.708420515060425
103+
],
104+
[
105+
1200,
106+
3.315191552042961,
107+
53.577661752700806
108+
],
109+
[
110+
1499,
111+
3.368943616747856,
112+
66.65013790130615
113+
]
114+
]
115+
},
116+
"fibgen_K64_cross_d128": {
117+
"name": "fibgen_K64_cross_d128",
118+
"n_params": 301824,
119+
"final_val": 3.14856243878603,
120+
"wall_time": 70.60638308525085,
121+
"val_history": [
122+
[
123+
0,
124+
6.773499310016632,
125+
0.2776825428009033
126+
],
127+
[
128+
300,
129+
3.4181367307901382,
130+
14.125115394592285
131+
],
132+
[
133+
600,
134+
3.462831512093544,
135+
28.143824815750122
136+
],
137+
[
138+
900,
139+
3.3200928270816803,
140+
42.273157358169556
141+
],
142+
[
143+
1200,
144+
3.213900566101074,
145+
56.22510004043579
146+
],
147+
[
148+
1499,
149+
3.1374357640743256,
150+
70.1269679069519
151+
]
152+
]
153+
},
154+
"dense_crt_d256": {
155+
"name": "dense_crt_d256",
156+
"n_params": 3176192,
157+
"final_val": 2.609795369207859,
158+
"wall_time": 56.04413557052612,
159+
"val_history": [
160+
[
161+
0,
162+
161.71250438690186,
163+
0.20412397384643555
164+
],
165+
[
166+
300,
167+
2.849265456199646,
168+
11.362962245941162
169+
],
170+
[
171+
600,
172+
2.712693303823471,
173+
22.455364227294922
174+
],
175+
[
176+
900,
177+
2.6387286484241486,
178+
33.4241201877594
179+
],
180+
[
181+
1200,
182+
2.643457517027855,
183+
44.815019369125366
184+
],
185+
[
186+
1499,
187+
2.6101625114679337,
188+
55.74950313568115
189+
]
190+
]
191+
},
192+
"fibgen_K32_cross_d256": {
193+
"name": "fibgen_K32_cross_d256",
194+
"n_params": 87552,
195+
"final_val": 3.3878000900149345,
196+
"wall_time": 106.72511529922485,
197+
"val_history": [
198+
[
199+
0,
200+
8.412225127220154,
201+
0.37620043754577637
202+
],
203+
[
204+
300,
205+
3.464435502886772,
206+
21.080098867416382
207+
],
208+
[
209+
600,
210+
3.4064883291721344,
211+
41.436336278915405
212+
],
213+
[
214+
900,
215+
3.3971774727106094,
216+
62.56101322174072
217+
],
218+
[
219+
1200,
220+
3.3972227871418,
221+
84.17925453186035
222+
],
223+
[
224+
1499,
225+
3.376964256167412,
226+
106.05141234397888
227+
]
228+
]
229+
},
230+
"transformerless_K32_cross": {
231+
"name": "transformerless_K32_cross",
232+
"n_params": 213752,
233+
"final_val": 2.7210510671138763,
234+
"wall_time": 146.639413356781,
235+
"val_history": [
236+
[
237+
0,
238+
4.975131928920746,
239+
0.5788888931274414
240+
],
241+
[
242+
300,
243+
2.831327900290489,
244+
28.722872734069824
245+
],
246+
[
247+
600,
248+
2.726919010281563,
249+
57.86561107635498
250+
],
251+
[
252+
900,
253+
2.8697600066661835,
254+
87.85960483551025
255+
],
256+
[
257+
1200,
258+
2.6962520480155945,
259+
116.93847155570984
260+
],
261+
[
262+
1499,
263+
2.7141062021255493,
264+
145.68027186393738
265+
]
266+
]
267+
}
268+
}

0 commit comments

Comments
 (0)