-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathz80_compiler_never_guesses.html
More file actions
614 lines (614 loc) · 28.4 KB
/
z80_compiler_never_guesses.html
File metadata and controls
614 lines (614 loc) · 28.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<title>article_draft</title>
<style>
html {
color: #1a1a1a;
background-color: #fdfdfd;
}
body {
margin: 0 auto;
max-width: 36em;
padding-left: 50px;
padding-right: 50px;
padding-top: 50px;
padding-bottom: 50px;
hyphens: auto;
overflow-wrap: break-word;
text-rendering: optimizeLegibility;
font-kerning: normal;
}
@media (max-width: 600px) {
body {
font-size: 0.9em;
padding: 12px;
}
h1 {
font-size: 1.8em;
}
}
@media print {
html {
background-color: white;
}
body {
background-color: transparent;
color: black;
font-size: 12pt;
}
p, h2, h3 {
orphans: 3;
widows: 3;
}
h2, h3, h4 {
page-break-after: avoid;
}
}
p {
margin: 1em 0;
}
a {
color: #1a1a1a;
}
a:visited {
color: #1a1a1a;
}
img {
max-width: 100%;
}
h1, h2, h3, h4, h5, h6 {
margin-top: 1.4em;
}
h5, h6 {
font-size: 1em;
font-style: italic;
}
h6 {
font-weight: normal;
}
ol, ul {
padding-left: 1.7em;
margin-top: 1em;
}
li > ol, li > ul {
margin-top: 0;
}
blockquote {
margin: 1em 0 1em 1.7em;
padding-left: 1em;
border-left: 2px solid #e6e6e6;
color: #606060;
}
code {
font-family: Menlo, Monaco, Consolas, 'Lucida Console', monospace;
font-size: 85%;
margin: 0;
hyphens: manual;
}
pre {
margin: 1em 0;
overflow: auto;
}
pre code {
padding: 0;
overflow: visible;
overflow-wrap: normal;
}
.sourceCode {
background-color: transparent;
overflow: visible;
}
hr {
background-color: #1a1a1a;
border: none;
height: 1px;
margin: 1em 0;
}
table {
margin: 1em 0;
border-collapse: collapse;
width: 100%;
overflow-x: auto;
display: block;
font-variant-numeric: lining-nums tabular-nums;
}
table caption {
margin-bottom: 0.75em;
}
tbody {
margin-top: 0.5em;
border-top: 1px solid #1a1a1a;
border-bottom: 1px solid #1a1a1a;
}
th {
border-top: 1px solid #1a1a1a;
padding: 0.25em 0.5em 0.25em 0.5em;
}
td {
padding: 0.125em 0.5em 0.25em 0.5em;
}
header {
margin-bottom: 4em;
text-align: center;
}
#TOC li {
list-style: none;
}
#TOC ul {
padding-left: 1.3em;
}
#TOC > ul {
padding-left: 0;
}
#TOC a:not(:hover) {
text-decoration: none;
}
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
div.columns{display: flex; gap: min(4vw, 1.5em);}
div.column{flex: auto; overflow-x: auto;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
/* The extra [class] is a hack that increases specificity enough to
override a similar rule in reveal.js */
ul.task-list[class]{list-style: none;}
ul.task-list li input[type="checkbox"] {
font-size: inherit;
width: 0.8em;
margin: 0 0.8em 0.2em -1.6em;
vertical-align: middle;
}
.display.math{display: block; text-align: center; margin: 0.5rem auto;}
/* CSS for syntax highlighting */
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
pre > code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
{ counter-reset: source-line 0; }
pre.numberSource code > span
{ position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
{ content: counter(source-line);
position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; display: inline-block;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
color: #aaaaaa;
}
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
div.sourceCode
{ }
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
code span.al { color: #ff0000; font-weight: bold; } /* Alert */
code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code span.at { color: #7d9029; } /* Attribute */
code span.bn { color: #40a070; } /* BaseN */
code span.bu { color: #008000; } /* BuiltIn */
code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4070a0; } /* Char */
code span.cn { color: #880000; } /* Constant */
code span.co { color: #60a0b0; font-style: italic; } /* Comment */
code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code span.do { color: #ba2121; font-style: italic; } /* Documentation */
code span.dt { color: #902000; } /* DataType */
code span.dv { color: #40a070; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #40a070; } /* Float */
code span.fu { color: #06287e; } /* Function */
code span.im { color: #008000; font-weight: bold; } /* Import */
code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
code span.kw { color: #007020; font-weight: bold; } /* Keyword */
code span.op { color: #666666; } /* Operator */
code span.ot { color: #007020; } /* Other */
code span.pp { color: #bc7a00; } /* Preprocessor */
code span.sc { color: #4070a0; } /* SpecialChar */
code span.ss { color: #bb6688; } /* SpecialString */
code span.st { color: #4070a0; } /* String */
code span.va { color: #19177c; } /* Variable */
code span.vs { color: #4070a0; } /* VerbatimString */
code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
</style>
</head>
<body>
<h1 id="the-z80-compiler-that-never-guesses">The Z80 Compiler That Never
Guesses</h1>
<p><em>How GPU brute-force proved optimal Z80 code — and why 97.7% of
real functions can’t be register-allocated</em></p>
<hr />
<p>In 1997, a teenager named Dark from the demoscene group X-Trade
published an article about Z80 multiplication in his electronic magazine
<em>Spectrum Expert</em>. His shift-and-add loop ran in 196-204 T-states
and powered the wireframe rotations in his award-winning demo
<em>Illusion</em>. For nearly thirty years, this was the state of the
art.</p>
<p>In 2026, we pointed a GPU at the problem and found that for any
<em>specific</em> constant, the optimal sequence is 4-50× shorter than
Dark’s general loop. We found this not by cleverness, but by trying
every possible instruction sequence and proving which one is shortest.
The GPU doesn’t guess. It <em>knows</em>.</p>
<p>This is the story of what happens when you stop trying to be smart
and let the machine enumerate everything.</p>
<h2 id="part-1-the-brute-force-philosophy">Part 1: The Brute-Force
Philosophy</h2>
<h3 id="why-it-works-on-z80">Why It Works on Z80</h3>
<p>The Z80 CPU has an 11-byte state: seven 8-bit registers (A through
L), a flag register F, a stack pointer SP, and a virtual memory byte M.
That’s small enough to fit in a GPU register file. Each instruction
transforms this state deterministically. Two instruction sequences are
equivalent if and only if they produce identical 11-byte output for
<em>every</em> possible 11-byte input.</p>
<p>This is a finite, deterministic problem. There are roughly 2^88
possible input states — too many to enumerate. But a trick makes it
tractable: test a few carefully chosen inputs first.</p>
<h3 id="the-three-stage-pipeline">The Three-Stage Pipeline</h3>
<p><strong>QuickCheck</strong> tests 8 inputs and rejects 99.99% of
candidates. If two sequences disagree on <em>any</em> of these 8 inputs,
they’re definitely not equivalent. This costs microseconds per
candidate.</p>
<p><strong>MidCheck</strong> adds 24 more inputs targeting BIT/SET/RES
instructions whose effects are bit-position-specific. This catches
another 23% of false positives.</p>
<p><strong>ExhaustiveCheck</strong> sweeps all possible inputs for the
~0.01% of survivors. On GPU, 256 threads per candidate — thread
<em>i</em> handles A=<em>i</em> and loops over the remaining
registers.</p>
<p>Running on a pair of RTX 4060 Ti GPUs, this pipeline processes 17.8
million instruction pairs in about six minutes. The result: 739,575
provably correct peephole optimization rules. Each one is a mathematical
theorem: “these two instruction sequences produce identical output for
all possible inputs.”</p>
<p>Some are obvious: <code>LD A, B : LD B, A → LD A, B</code> (the
second instruction is redundant). Others are deeply surprising:</p>
<pre><code>SLA A : RR A → OR A (save 12 T-states, 3 bytes)</code></pre>
<p>Shift left then rotate right equals… OR with self? It works because
the shift clears bit 0, the rotate puts the old bit 7 into bit 0 through
carry, and the combined flag effect matches OR A exactly. No human would
design this rule. The GPU doesn’t need to understand <em>why</em> — it
just proves <em>that</em>.</p>
<h2 id="part-2-constant-multiplication-when-3-ops-suffice">Part 2:
Constant Multiplication — When 3 Ops Suffice</h2>
<h3 id="the-14-op-discovery">The 14-Op Discovery</h3>
<p>For 8-bit multiply (<code>A × K → A</code>), we started with 21
candidate instructions: shifts, rotates, adds, subtracts, NEG, carry
manipulations. The GPU found optimal sequences for 103 constants at
length ≤8.</p>
<p>Then we analyzed which instructions actually <em>appeared</em> in any
optimal solution. Seven never did: - <code>SLA A</code> — identical to
<code>ADD A,A</code> but costs 8T instead of 4T (strictly dominated) -
<code>RLC A</code>, <code>RRC A</code> — CB-prefix versions of RLCA/RRCA
(8T vs 4T) - <code>OR A</code>, <code>SCF</code>, <code>EX AF,AF'</code>
— theoretically useful, never optimal</p>
<p>Removing them shrinks the search from 21^N to 14^N candidates. At
length 9, that’s 38× fewer. At length 10, 57× fewer. This single insight
— <strong>empirical pool reduction</strong> — is the most powerful
technique in the entire project.</p>
<p>With 14 ops, the GPU found 164 constants at length ≤9, 249 at length
≤10, and the final 5 (170, 171, 173, 179, 181) at length 11. <strong>All
254 non-trivial constants have provably optimal DIRECT sequences — no
composition needed.</strong> Every entry in the table was found by GPU
brute-force search, and each is proven optimal at its length.</p>
<p>The most beautiful: <code>×255 = NEG</code> — one instruction, 8
T-states. Because 255 = -1 mod 256, and NEG computes <code>-A</code>.
The GPU doesn’t know number theory. It just found that NEG produces the
right output for all 256 inputs.</p>
<h3 id="the-3-op-miracle">The 3-Op Miracle</h3>
<p>For 16-bit multiply (<code>A × K → HL</code>), the story is even more
dramatic. We started with 23 ops. The GPU found all 254 constants… using
only <strong>three instructions</strong>:</p>
<ul>
<li><code>ADD HL,HL</code> — double the 16-bit accumulator</li>
<li><code>ADD HL,BC</code> — add the original input</li>
<li><code>LD C,A</code> — save the input to C (so BC = input with
B=0)</li>
</ul>
<p>That’s it. Every 16-bit constant multiply on the Z80 is a sequence of
doubles and adds. The search space collapsed from 23^N to 3^N — a factor
of <strong>13,600× at length 8</strong>. The entire table took 30
seconds.</p>
<p>We later added <code>SWAP_HL</code> (byte swap = ×256) and
<code>SUB HL,BC</code> (subtract). These improved 88 of the 254
constants (35%), with ×255 dropping from 15 instructions to 3:</p>
<div class="sourceCode" id="cb2"><pre
class="sourceCode asm"><code class="sourceCode fasm"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>SWAP_HL <span class="co">; HL = input × 256</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>LD C<span class="op">,</span>A <span class="co">; save input</span></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="bu">SUB</span> HL<span class="op">,</span>BC <span class="co">; HL = input×256 - input = input×255</span></span></code></pre></div>
<p>The byte swap trick: multiply by 256 then subtract. Three virtual
operations, five real Z80 instructions, 30 T-states.</p>
<h2 id="part-3-division-when-abstract-chains-guide-the-gpu">Part 3:
Division — When Abstract Chains Guide the GPU</h2>
<h3 id="the-reciprocal-trick">The Reciprocal Trick</h3>
<p>Division by a constant K uses the identity
<code>n/K = (n × M) >> S</code> where M ≈ 2^S/K. This converts
division into multiplication by a “magic constant” followed by a right
shift. The technique is well-known from Hacker’s Delight.</p>
<p>What’s new: we automate the entire process. An abstract chain solver
(running on CPU, 8 seconds for all 254 constants) finds the shortest
multiply-then-shift pattern. Then a focused GPU kernel — with only 6 ops
instead of 37 — searches the Z80-specific materialization.</p>
<p>The result: <strong>all 247 non-power-of-2 divisors</strong> found.
The GPU’s div10 sequence (14 instructions, 124 T-states) exactly matches
the hand-optimized Hacker’s Delight solution. But the GPU found it
automatically. Division is now 247/247 COMPLETE.</p>
<p>The fastest: <code>÷171 = 4 instructions, 27 T-states</code>. The
most useful: <code>÷10 = 124T, ÷100 = 105T, ÷3 = 130T</code>.</p>
<h3 id="guided-brute-force">Guided Brute-Force</h3>
<p>The key architectural insight: <strong>separate the mathematical
structure from the ISA-specific implementation</strong>. The abstract
chain says <em>what</em> to compute (multiply by 171, shift right 9).
The GPU finds <em>how</em> to compute it on Z80 (which registers, which
order).</p>
<p>This “guided brute-force” reduces the search space by millions. Pure
brute-force at 37 ops would need 37^14 ≈ 10^22 evaluations — years on
any GPU. Guided search with 6 ops needs 6^16 ≈ 10^12 — eleven
seconds.</p>
<h2 id="part-4-the-idiom-zoo">Part 4: The Idiom Zoo</h2>
<p>Fifteen branchless idioms, all discovered by GPU exhaustive search
with a 37-op pool:</p>
<table>
<colgroup>
<col style="width: 23%" />
<col style="width: 33%" />
<col style="width: 20%" />
<col style="width: 23%" />
</colgroup>
<thead>
<tr class="header">
<th>Idiom</th>
<th>Sequence</th>
<th>Cost</th>
<th>Trick</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>bool(A)</td>
<td><code>LD B,A : NEG : ADC A,B</code></td>
<td>16T</td>
<td>NEG sets carry if nonzero; ADC adds carry to cancelled result</td>
</tr>
<tr class="even">
<td>NOT(A)</td>
<td><code>NEG : SBC A,A : INC A</code></td>
<td>16T</td>
<td>Carry-to-mask (0xFF) then increment (0xFF→0, 0→1)</td>
</tr>
<tr class="odd">
<td>ABS(A)</td>
<td><code>LD B,A : RLCA : SBC A,A : XOR B : SBC A,B : ADC A,B</code></td>
<td>24T</td>
<td>Sign→carry→mask→conditional complement</td>
</tr>
<tr class="even">
<td>sign-extend A→HL</td>
<td><code>ADC A,L : SBC A,A : LD H,A</code></td>
<td>12T</td>
<td>Double overflows if ≥128 → carry → 0xFF mask</td>
</tr>
<tr class="odd">
<td>half(A)</td>
<td><code>RRA</code></td>
<td>4T</td>
<td>Not SRL A (8T) — RRA is non-CB prefix, faster!</td>
</tr>
<tr class="even">
<td>complement(A)</td>
<td><code>CPL</code></td>
<td>4T</td>
<td>We forgot to include CPL in the initial pool…</td>
</tr>
</tbody>
</table>
<p>The <code>SBC A,A</code> carry-to-mask trick appears everywhere. It
converts a single carry bit into a full-byte mask: 0x00 if carry clear,
0xFF if carry set. Combined with <code>RLCA</code> (which puts the sign
bit into carry), this enables instant branchless sign detection.</p>
<p>The GPU also found that <code>CPL</code> (complement, 1 instruction,
4 T-states) is optimal for bitwise NOT — but only after we realized we’d
forgotten to include it in the instruction pool. Pool design
matters.</p>
<h2 id="part-5-register-allocation-the-feasibility-cliff">Part 5:
Register Allocation — The Feasibility Cliff</h2>
<h3 id="million-exhaustive-solutions">83.6 Million Exhaustive
Solutions</h3>
<p>For every theoretically possible combination of virtual register
count, widths, location constraints, and interference graph up to 6
virtual registers, we computed the provably optimal physical register
assignment — or proved that no valid assignment exists.</p>
<p>The results reveal a dramatic phase transition:</p>
<pre><code>2 vregs: 95.9% feasible
3 vregs: 88.5% feasible
4 vregs: 78.7% feasible
5 vregs: 67.7% feasible
6 vregs: 0.9% feasible ← cliff</code></pre>
<p>At 6 virtual registers, 99.1% of all possible constraint shapes have
NO valid Z80 register assignment. The register file “fills up” — there
simply aren’t enough physical registers for most configurations.</p>
<h3 id="infeasibility-for-real-programs">97.7% Infeasibility for Real
Programs</h3>
<p>We tested 129 unique function signatures from a real compiler corpus
(8 programming languages, 1,567 functions). For functions with 7-15
virtual registers:</p>
<ul>
<li><strong>3 feasible</strong> — provably optimal assignments
found</li>
<li><strong>126 infeasible</strong> — proven that no valid assignment
exists</li>
<li><strong>97.7% infeasibility rate</strong></li>
</ul>
<p>This means the Z80 compiler MUST decompose almost every non-trivial
function into smaller pieces (“islands”) that fit in the register file.
Island decomposition isn’t an optimization — it’s a mathematical
necessity.</p>
<h3 id="the-five-level-pipeline">The Five-Level Pipeline</h3>
<p>No single method handles everything:</p>
<ol type="1">
<li><strong>Table lookup</strong> (83.6M entries, O(1)) — covers
≤6v</li>
<li><strong>Graph decomposition</strong> at cut vertices — compose from
smaller solved pieces</li>
<li><strong>GPU brute-force</strong> — for ≤12v shapes</li>
<li><strong>CPU backtracking</strong> with pattern-aware pruning
(1000-4000× faster) — ≤15v</li>
<li><strong>Island decomposition</strong> + Z3 solver — for >15v
functions</li>
</ol>
<p>Every function hits one of these levels. The pipeline is
complete.</p>
<h2 id="part-6-the-packed-arithmetic-library">Part 6: The Packed
Arithmetic Library</h2>
<h3 id="multi-entry-fall-through-code">Multi-Entry Fall-Through
Code</h3>
<p>All 254 multiply sequences share instruction prefixes. By arranging
them in reverse order with multiple entry labels, we eliminate all
redundancy:</p>
<div class="sourceCode" id="cb4"><pre
class="sourceCode asm"><code class="sourceCode fasm"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="fu">mul104:</span> <span class="co">; enter here for ×104</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="fu">mul52:</span> <span class="bu">ADD</span> A<span class="op">,</span>A <span class="co">; enter here for ×52 (×104 = ×52 × 2)</span></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="fu">mul26:</span> <span class="bu">ADD</span> A<span class="op">,</span>A <span class="co">; enter here for ×26</span></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a><span class="fu">mul24:</span> <span class="bu">ADD</span> A<span class="op">,</span>B <span class="co">; enter here for ×24</span></span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a><span class="fu">mul12:</span> <span class="bu">ADD</span> A<span class="op">,</span>A <span class="co">; enter here for ×12</span></span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a><span class="fu">mul6:</span> LD B<span class="op">,</span>A <span class="co">; enter here for ×6</span></span>
<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a> <span class="bu">ADD</span> A<span class="op">,</span>B</span>
<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a> <span class="bu">ADD</span> A<span class="op">,</span>B</span>
<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a><span class="fu">mul2:</span> RLA <span class="co">; enter here for ×2</span></span>
<span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a> <span class="cf">RET</span> <span class="co">; shared return — 7 constants, 9 instructions</span></span></code></pre></div>
<p>For rotations, the same principle yields an instruction sled:</p>
<div class="sourceCode" id="cb5"><pre
class="sourceCode asm"><code class="sourceCode fasm"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="fu">rot7:</span> RLCA <span class="co">; 7 rotations left (9 bytes total, 7 entry points)</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="fu">rot6:</span> RLCA <span class="co">; 6 rotations</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="fu">rot5:</span> RLCA <span class="co">; 5</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a><span class="fu">rot4:</span> RLCA <span class="co">; = nibble swap!</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="fu">rot3:</span> RLCA <span class="co">; 3</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a><span class="fu">rot2:</span> RLCA <span class="co">; 2</span></span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a><span class="fu">rot1:</span> RLCA <span class="co">; 1</span></span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a> <span class="cf">RET</span></span></code></pre></div>
<p>Combined: 254 multiplies (all direct, no composition) + 247 divisions
(all complete) + rotation/shift sleds = approximately
<strong>2KB</strong> of Z80 code. For a ZX Spectrum with 48KB RAM,
that’s 4%. For a ROM-based system, it fits alongside any program.</p>
<p>Runtime dispatch: a page-aligned 256-byte jump table.
<code>LD H, page / LD L, K / JP (HL)</code> — single-instruction
dispatch to the provably optimal sequence.</p>
<h2 id="part-7-universal-computation-chains">Part 7: Universal
Computation Chains</h2>
<h3 id="isa-independent-search">ISA-Independent Search</h3>
<p>The deepest insight: the mathematical structure of multiplication and
division is ISA-independent. An “addition-subtraction chain” — a
sequence of {double, add, subtract, save, negate} — finds the shortest
path from 1 to K for any processor.</p>
<p>We built a chain solver that finds all 254 multiply chains in 8
seconds on CPU. These chains then materialize to any ISA:</p>
<table>
<thead>
<tr class="header">
<th>Chain op</th>
<th>Z80</th>
<th>6502</th>
<th>RISC-V</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>dbl</td>
<td>ADD A,A (4T)</td>
<td>ASL A (2cy)</td>
<td>SLLI rd,1 (1cy)</td>
</tr>
<tr class="even">
<td>add</td>
<td>ADD A,B (4T)</td>
<td>CLC:ADC zp (5cy)</td>
<td>ADD rd,rs (1cy)</td>
</tr>
<tr class="odd">
<td>save</td>
<td>LD B,A (4T)</td>
<td>TAX (2cy)</td>
<td>MV rd,rs (1cy)</td>
</tr>
<tr class="even">
<td>neg</td>
<td>NEG (8T)</td>
<td>EOR#FF:ADC#1 (4cy)</td>
<td>NEG rd (1cy)</td>
</tr>
</tbody>
</table>
<p>One search, every processor. The same chain that gives ×10 on Z80
also gives ×10 on 6502 — just with different instruction encodings and
cycle counts.</p>
<h2 id="part-8-cross-platform-verification">Part 8: Cross-Platform
Verification</h2>
<p>We verified every result across five platforms:</p>
<ul>
<li>NVIDIA RTX 4060 Ti × 2 (CUDA)</li>
<li>NVIDIA RTX 2070 (CUDA)</li>
<li>AMD Radeon RX 580 (OpenCL via Mesa rusticl + Vulkan via RADV)</li>
<li>Apple M2 MacBook Air (Metal + OpenCL)</li>
<li>CPU (Python, all 256 inputs)</li>
</ul>
<p>All platforms produce identical results. Three GPU vendors, four
APIs, one truth.</p>
<p>The AMD verification deserves special mention: ROCm 6.x dropped
support for the RX 580’s gfx803 architecture. ROCm 5.7 installs but the
kernel fusion driver doesn’t register the GPU. The solution: Mesa
provides OpenCL 3.0 and Vulkan through the open-source RADV driver — no
proprietary runtime needed.</p>
<h2 id="part-9-for-the-compiler">Part 9: For the Compiler</h2>
<p>Three Go packages ship ready for integration:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode go"><code class="sourceCode go"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co">// Multiply: O(1) lookup → inline optimal sequence</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>seq <span class="op">:=</span> mulopt<span class="op">.</span>Emit8<span class="op">(</span><span class="dv">42</span><span class="op">,</span> bPreserve<span class="op">)</span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="co">// → ["RLA", "LD B,A", "ADD A,B", "ADD A,A", "ADD A,B", ...]</span></span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a><span class="co">// Register allocation: O(1) lookup from 83.6M table</span></span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a>entry <span class="op">:=</span> table<span class="op">.</span>Lookup<span class="op">(</span>regalloc<span class="op">.</span>IndexOf<span class="op">(</span>shape<span class="op">))</span></span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a><span class="co">// → {Cost: 46, Assignment: [A, HL, DE, B]}</span></span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a><span class="co">// Peephole: proven-correct replacement</span></span>
<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a>rule <span class="op">:=</span> peephole<span class="op">.</span>Lookup<span class="op">(</span><span class="st">"SLA A : RR A"</span><span class="op">)</span></span>
<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a><span class="co">// → {Replacement: "OR A", CyclesSaved: 12, BytesSaved: 3}</span></span></code></pre></div>
<p>MinZ v0.23.0 ships with 498 inline arithmetic sequences from our
tables. Every constant multiply and divide produces provably optimal
code with zero runtime overhead.</p>
<h2 id="conclusion-the-compiler-that-never-guesses">Conclusion: The
Compiler That Never Guesses</h2>
<p>Dark wrote the best general multiply in 1997. The GPU proved what’s
optimal for each specific case in 2026 — and it’s 8× faster on average.
Not because GPU is smarter than Dark, but because it doesn’t need to be.
It just tries everything.</p>
<p>The surprising findings aren’t the individual optimizations — it’s
the structural results: - Only 3 of 23 instructions matter for 16-bit
multiply - 97.7% of real functions above 6 variables can’t be
register-allocated on Z80 - Abstract computation chains are
ISA-independent - 2KB of packed code covers ALL optimal arithmetic</p>
<p>The Z80 was designed in 1976 for serial I/O controllers. Fifty years
later, a cluster of GPUs exhaustively enumerated its optimal instruction
sequences. Every result is a mathematical proof. The compiler that uses
these tables doesn’t guess, doesn’t heuristic, doesn’t approximate. It
looks up the GPU-proven answer.</p>
<p>That’s what brute force buys you: certainty.</p>
<hr />
<p><em>Repository: https://github.com/oisee/z80-optimizer (v1.0.0)</em>
<em>Built during a birthday marathon, March 26, 2026. 🎂</em> <em>5
machines, 4 GPU APIs, 3 vendors, ~60 commits, one cake.</em></p>
</body>
</html>