Skip to content

Commit 1df4f4e

Browse files
committed
transformerless_lm: FSM long-T bench falsified speed claim + revealed NaN
Tested FSMLM vs dense_crt at T=128 and T=512 without lazy-data (so attention pays full T^2 cost). The hypothesis was: FSM's O(T*d^2) sequential recurrence beats attention's O(T^2*d) at long T, giving a wall-clock win on CPU. Smoke at 50 steps: arch T val wall vs dense dense_crt 128 6.88 5.6s 1.00x fsm 128 NaN 17.6s 3.14x slower dense_crt 512 6.84 26.8s 1.00x fsm 512 NaN 76.8s 2.87x slower Two simultaneous failures: (1) FSM produces NaN. The recurrence h_t = A*h_{t-1} + B*h_{t-2} is numerically unstable — if A or B have eigenvalues > 1 the state explodes exponentially. Random init blows up immediately. (2) No asymptotic speed win. The Python for-loop over T iterations has ~30ms of PyTorch overhead per timestep (tensor stacking, method dispatch). At T=512 the overhead totals ~15s, completely eating the FLOP savings vs attention. Asymptotic wins from O(T*d^2) vs O(T^2*d) require T much larger and/or vectorized parallel-scan implementation. Conclusion at our scale (CPU, T<=512, d<=128): no substrate operator implementation beats dense attention in wall-clock. The asymptotic substrate advantage is real but unreachable in pure PyTorch on CPU with reasonable test budgets. Paths forward: A. Fix FSM (eigenvalue clipping for stability, parallel scan via associative scan or FFT for speed). ~1-2 hours engineering. B. Accept the speed framework as scale-bound. Pivot to the INFERENCE-time story where 100x compression IS validated and wall-clock parity at d=4096+ is measurable. C. Different arch entirely (S4-class state-space with stable parameterization).
1 parent 506a7e3 commit 1df4f4e

2 files changed

Lines changed: 416 additions & 0 deletions

File tree

Lines changed: 262 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,262 @@
1+
[
2+
{
3+
"name": "dense_crt_T128",
4+
"n_params": 801664,
5+
"best_val": 6.878907322883606,
6+
"best_step": 49,
7+
"wall": 5.647505044937134,
8+
"val_history": [
9+
[
10+
0,
11+
90.00937795639038,
12+
0.34468889236450195
13+
],
14+
[
15+
5,
16+
59.70369911193848,
17+
0.967252254486084
18+
],
19+
[
20+
10,
21+
34.17832970619202,
22+
1.5143651962280273
23+
],
24+
[
25+
15,
26+
23.464447617530823,
27+
2.059551954269409
28+
],
29+
[
30+
20,
31+
17.501569867134094,
32+
2.6580398082733154
33+
],
34+
[
35+
25,
36+
13.698769271373749,
37+
3.171241044998169
38+
],
39+
[
40+
30,
41+
11.367690801620483,
42+
3.704589366912842
43+
],
44+
[
45+
35,
46+
9.730982899665833,
47+
4.188647508621216
48+
],
49+
[
50+
40,
51+
8.56847357749939,
52+
4.708504676818848
53+
],
54+
[
55+
45,
56+
7.571676880121231,
57+
5.1946892738342285
58+
],
59+
[
60+
49,
61+
6.878907322883606,
62+
5.647430896759033
63+
]
64+
],
65+
"seq_len": 128
66+
},
67+
{
68+
"name": "fsm_T128",
69+
"n_params": 95104,
70+
"best_val": Infinity,
71+
"best_step": -1,
72+
"wall": 17.574199199676514,
73+
"val_history": [
74+
[
75+
0,
76+
NaN,
77+
0.948652982711792
78+
],
79+
[
80+
5,
81+
NaN,
82+
2.655090808868408
83+
],
84+
[
85+
10,
86+
NaN,
87+
4.310461521148682
88+
],
89+
[
90+
15,
91+
NaN,
92+
5.981438875198364
93+
],
94+
[
95+
20,
96+
NaN,
97+
7.627954483032227
98+
],
99+
[
100+
25,
101+
NaN,
102+
9.317899942398071
103+
],
104+
[
105+
30,
106+
NaN,
107+
11.011768341064453
108+
],
109+
[
110+
35,
111+
NaN,
112+
12.707162618637085
113+
],
114+
[
115+
40,
116+
NaN,
117+
14.398524522781372
118+
],
119+
[
120+
45,
121+
NaN,
122+
16.040210485458374
123+
],
124+
[
125+
49,
126+
NaN,
127+
17.574132204055786
128+
]
129+
],
130+
"seq_len": 128
131+
},
132+
{
133+
"name": "dense_crt_T512",
134+
"n_params": 801664,
135+
"best_val": 6.841115415096283,
136+
"best_step": 49,
137+
"wall": 26.756900310516357,
138+
"val_history": [
139+
[
140+
0,
141+
94.26221895217896,
142+
1.6044504642486572
143+
],
144+
[
145+
5,
146+
63.47826862335205,
147+
4.616751670837402
148+
],
149+
[
150+
10,
151+
35.637489318847656,
152+
7.149997234344482
153+
],
154+
[
155+
15,
156+
23.514615058898926,
157+
9.781777620315552
158+
],
159+
[
160+
20,
161+
17.730465531349182,
162+
12.402127981185913
163+
],
164+
[
165+
25,
166+
13.968753099441528,
167+
14.800205945968628
168+
],
169+
[
170+
30,
171+
11.523635268211365,
172+
17.179736852645874
173+
],
174+
[
175+
35,
176+
9.875965654850006,
177+
19.652122974395752
178+
],
179+
[
180+
40,
181+
8.447864949703217,
182+
22.10032606124878
183+
],
184+
[
185+
45,
186+
7.448811292648315,
187+
24.585896015167236
188+
],
189+
[
190+
49,
191+
6.841115415096283,
192+
26.7568097114563
193+
]
194+
],
195+
"seq_len": 512
196+
},
197+
{
198+
"name": "fsm_T512",
199+
"n_params": 95104,
200+
"best_val": Infinity,
201+
"best_step": -1,
202+
"wall": 76.79512739181519,
203+
"val_history": [
204+
[
205+
0,
206+
NaN,
207+
3.547102212905884
208+
],
209+
[
210+
5,
211+
NaN,
212+
10.795931816101074
213+
],
214+
[
215+
10,
216+
NaN,
217+
18.114710330963135
218+
],
219+
[
220+
15,
221+
NaN,
222+
25.63209295272827
223+
],
224+
[
225+
20,
226+
NaN,
227+
32.88798904418945
228+
],
229+
[
230+
25,
231+
NaN,
232+
40.33910608291626
233+
],
234+
[
235+
30,
236+
NaN,
237+
47.816415309906006
238+
],
239+
[
240+
35,
241+
NaN,
242+
55.28891921043396
243+
],
244+
[
245+
40,
246+
NaN,
247+
62.8094916343689
248+
],
249+
[
250+
45,
251+
NaN,
252+
70.30472254753113
253+
],
254+
[
255+
49,
256+
NaN,
257+
76.79503655433655
258+
]
259+
],
260+
"seq_len": 512
261+
}
262+
]

0 commit comments

Comments
 (0)