Phase 5 Results

Generated: 2026-04-19T17:58:43.345Z

Auto-generated by scripts/analyze.js from docs/PHASE5_PART1_RESULTS.jsonl. Statistical methodology pinned in METHODOLOGY.md (Mann–Whitney U, Cohen's d, seeded 1000-resample percentile bootstrap).

Run provenance

Total runs recorded: 120
Successful: 120
Failed: 0 (0.00%)
Reps per (workload × scheduler) cell: 10

Per-workload × scheduler summary

constant

Scheduler	n	Jank mean [95% CI]	P95 mean [95% CI]	Mean dt [95% CI]
B0	10	0.00% [0.00%, 0.00%]	5.10 ms [5.10 ms, 5.10 ms]	5.03 ms [5.03 ms, 5.03 ms]
B1	10	0.00% [0.00%, 0.00%]	5.10 ms [5.10 ms, 5.10 ms]	5.03 ms [5.03 ms, 5.03 ms]
Predictor	10	0.00% [0.00%, 0.00%]	5.00 ms [5.00 ms, 5.00 ms]	4.17 ms [4.17 ms, 4.17 ms]

sawtooth

Scheduler	n	Jank mean [95% CI]	P95 mean [95% CI]	Mean dt [95% CI]
B0	10	11.69% [11.68%, 11.71%]	18.88 ms [18.83 ms, 18.93 ms]	10.34 ms [10.34 ms, 10.34 ms]
B1	10	1.66% [1.66%, 1.66%]	15.89 ms [15.85 ms, 15.93 ms]	9.38 ms [9.38 ms, 9.38 ms]
Predictor	10	6.62% [6.45%, 6.77%]	18.03 ms [18.00 ms, 18.06 ms]	9.56 ms [9.55 ms, 9.57 ms]

burst

Scheduler	n	Jank mean [95% CI]	P95 mean [95% CI]	Mean dt [95% CI]
B0	10	5.55% [5.55%, 5.56%]	30.00 ms [30.00 ms, 30.00 ms]	5.59 ms [5.59 ms, 5.59 ms]
B1	10	5.56% [5.56%, 5.56%]	21.00 ms [21.00 ms, 21.00 ms]	5.28 ms [5.28 ms, 5.29 ms]
Predictor	10	5.45% [5.44%, 5.47%]	21.00 ms [21.00 ms, 21.00 ms]	5.23 ms [5.23 ms, 5.24 ms]

scroll

Scheduler	n	Jank mean [95% CI]	P95 mean [95% CI]	Mean dt [95% CI]
B0	10	14.68% [14.63%, 14.76%]	18.00 ms [18.00 ms, 18.00 ms]	11.83 ms [11.83 ms, 11.83 ms]
B1	10	3.38% [3.37%, 3.39%]	17.29 ms [17.27 ms, 17.30 ms]	10.47 ms [10.47 ms, 10.47 ms]
Predictor	10	7.08% [6.98%, 7.19%]	17.80 ms [17.80 ms, 17.80 ms]	10.01 ms [9.98 ms, 10.04 ms]

Go/No-Go — Predictor vs B1 (primary) / Predictor vs B0 (secondary)

Decision gate: Go requires p < 0.05 AND |d| ≥ 0.5 on jankRate, versus B1.

Budget-exceeding workloads (sawtooth / burst / scroll) are primary; constant is sanity-only.

Workload	Vs B1: U	Vs B1: p	Vs B1: d	Vs B1: verdict	Vs B0: U	Vs B0: p	Vs B0: d	Vs B0: verdict
constant	50	1.0000	0.0000	NO-GO (p✗ d✗)	50	1.0000	0.0000	NO-GO (p✗ d✗)
sawtooth	0	1.69e-4	25.1831	GO (p✓ d✓) (Pred higher)	0	1.78e-4	-25.6168	GO (p✓ d✓) (Pred lower)
burst	0	1.80e-4	-6.9659	GO (p✓ d✓) (Pred lower)	0	1.82e-4	-6.3979	GO (p✓ d✓) (Pred lower)
scroll	0	1.79e-4	27.9833	GO (p✓ d✓) (Pred higher)	0	1.82e-4	-49.9915	GO (p✓ d✓) (Pred lower)

Summary verdict

Workload	Category	Predictor vs B1
constant	sanity	NO-GO
sawtooth	primary	GO (opposite direction!)
burst	primary	GO (Predictor reduces jank)
scroll	primary	GO (opposite direction!)

Primary workloads with GO verdict (Predictor lower than B1): 1/3

Overall: NO-GO — Predictor does not consistently outperform B1 on primary workloads.

Secondary — Shadow prediction quality

Truth: the active scheduler's executed dt crossed the jank threshold. Prediction (per shadow scheduler): decision ∈ {reduce, degrade} counts as positive; full counts as negative.

constant

Scheduler	FP	TN	Precision	Recall	F1
B0	0	381521	—	—	—
B1	0	381521	—	—	—
Predictor	376292	5229	0.000	—	—

sawtooth

Scheduler	TP	FP	TN	FN	Precision	Recall	F1
B0	0	0	172306	11976	—	0.000	—
B1	11958	32278	140028	18	0.270	0.998	0.425
Predictor	7403	123838	48468	4573	0.056	0.618	0.103

burst

Scheduler	TP	FP	TN	FN	Precision	Recall	F1
B0	0	0	316378	18491	—	0.000	—
B1	14509	4892	311486	3982	0.748	0.785	0.766
Predictor	14437	253492	62886	4054	0.054	0.781	0.101

scroll

Scheduler	TP	FP	TN	FN	Precision	Recall	F1
B0	0	0	153894	13585	—	0.000	—
B1	13559	39253	114641	26	0.257	0.998	0.408
Predictor	8470	115227	38667	5115	0.068	0.623	0.123

Phase 5 Part 2 — Pretrained vs Scratch

Added 2026-04-20. Benchmark: 88 runs (40 Pretrained+Online, 40 Pretrained+Frozen, 8 B1 drift-check), 1h42m38s, 0 errors.

Full machine-generated tables in PHASE5_PART2_COMPARE.md. Drift report in PHASE5_PART2_DRIFT.md.

Part 2 question

Part 1 found Predictor loses to B1 on the ramping workloads (sawtooth, scroll) and ties on burst. The question Part 2 asks:

Is that deficit a cold-start artifact (fresh He-init Predictor hasn't seen enough samples), or is it structural (the 353-parameter MLP cannot encode B1's decision surface in this data distribution)?

We test this by freezing a Predictor initialized from pretrained weights (334,510 samples from Part 1's B0-active shadow log) and comparing against Part 1's scratch+online cell and against B1's frozen hand-crafted EMA prior.

B1 drift check — PASS

Aggregate: PASS. All four workloads: 0 outliers on 2 drift runs, |mean shift| ≤ 0.05pp (well inside the 1.0pp STOP threshold). Part 1's B1 cell and Part 2's 8 drift runs are statistically indistinguishable — the environment did not drift across the 17 days between Part 1 (2026-04-18) and Part 2 (2026-04-20).

Workload	Part 1 B1 μ	Part 2 B1 μ	Shift
constant	0.00%	0.00%	+0.00pp
sawtooth	1.66%	1.67%	+0.01pp
burst	5.56%	5.56%	−0.00pp
scroll	3.38%	3.41%	+0.03pp

Three comparisons

(a) Scratch (Part 1) vs Pretrained + Online (Part 2) — Init-quality contribution

Workload	Scratch	Pretrained+Online	Δ (pp)	p	Verdict
constant	0.00%	0.00%	0.00	0.37	NO-GO
sawtooth	6.62%	4.59%	−2.03	1.8e-4	GO (init helps)
burst	5.45%	5.51%	+0.06	0.006	borderline
scroll	7.08%	6.78%	−0.30	0.04	GO (init helps)

Pretrained init lowers sawtooth jank by 2pp and scroll by 0.3pp. Burst moves slightly the other way — scratch+online had already converged to the workload's structural floor (B0 and B1 also sit at ≈5.5% on burst), so pretrained init offers no room to improve.

(b) Pretrained+Online vs Pretrained+Frozen — Online-learning marginal value

Workload	Pretrained+Online	Pretrained+Frozen	Δ (pp)	p	Verdict
constant	0.00%	0.00%	0.00	1.00	NO-GO
sawtooth	4.59%	11.68%	+7.09	1.8e-4	GO (online helps)
burst	5.51%	5.52%	+0.01	0.79	NO-GO
scroll	6.78%	14.61%	+7.83	1.8e-4	GO (online helps)

Freezing the pretrained weights costs 7–8 percentage points of jank on the ramping workloads (d ≈ −34 on scroll, d ≈ −80 on sawtooth). The pretrained prior alone is not enough — online adaptation is where most of the gap closes.

(c) B1 (hand-crafted frozen prior) vs Pretrained+Frozen (data-learned frozen prior) — Blog headline match

Workload	B1	Pretrained+Frozen	Δ (pp)	p	Winner
constant	0.00%	0.00%	0.00	0.32	tie
sawtooth	1.66%	11.68%	+10.02	8.2e-5	B1 wins by 10pp
burst	5.56%	5.52%	−0.04	6.7e-4	tie (effect < 0.05pp)
scroll	3.38%	14.61%	+11.23	8.5e-5	B1 wins by 11pp

Both are frozen priors. B1 is 3 EMA thresholds; Pretrained+Frozen is a 353-parameter MLP trained on 334,510 samples. On the two ramping workloads B1 was originally hand-crafted for, B1 destroys the learned prior (d ≈ −114 on scroll, d ≈ −649 on sawtooth — effect sizes so large they mainly reflect how tight B1's deterministic variance is).

On effect sizes

The d values above reach magnitudes (up to −649) that would be implausible under Cohen's conventional "large = 0.8" calibration. Those conventions come from human-subjects research where run-to-run SD is a sizable fraction of the mean; this benchmark's controlled headless environment produces SDs in the 0.001–0.3 pp range because B1 and Pretrained+Frozen are both deterministic at the chosen seed. The manual recomputation in PHASE5_COHENS_D_VALIDATION.md matches analyze.js bit-for-bit — the d values are mathematically correct, but the practical magnitude of each comparison is the |Δ| in percentage points, not the d. Treat d here as a "not-measurement-noise" qualifier and read the pp figures for the real story.

Structural conclusion

Part 1's deficit is not a cold-start artifact. It is structural.

Cold start matters some, but less than the residual gap. Pretrained init closes 2pp on sawtooth and 0.3pp on scroll (table a). That is real progress — but Part 1's Predictor-vs-B1 gap on sawtooth was 4.96pp, and pretrained+online (4.59%) still sits 2.93pp above B1's 1.66%. Init is a factor; it is not THE factor.
Online learning is still essential. Freezing the pretrained prior loses 7–8pp to online on ramping workloads (table b). The 334k-sample offline pass does not produce a prior that stands on its own.
B1's inductive bias beats the MLP's capacity. With both priors frozen (table c), B1's 3-parameter EMA beats the 353-parameter MLP by 10pp on sawtooth and 11pp on scroll. B1 encodes something about "normalized dt trend → degrade/reduce threshold" that the MLP cannot extract from this data distribution.
Burst remains Predictor's niche. All approaches converge to 5.45%–5.56% on burst. This is the workload where B1's recent-trend heuristic gets no advantage (spikes are independent of recent history) and the MLP is at parity with B1 regardless of init.

Implications for Phase 6

The blog post should reframe the Part 1 "NO-GO" as a structural scheduler dichotomy, not a failure. The Predictor is not beating B1 on ramping workloads even with ideal pretraining — but B1 cannot be pretrained, tuned per user, or generalize to unseen workload structures. They live on different branches of a design tree.
Ablation suggestions that are now DEAD ENDS: "more training data", "longer online exposure", "better initialization". None of these close the ramping-workload gap to B1 at this architecture.
Ablation suggestions that remain LIVE: different architecture (recurrent, trend-aware inputs), explicit trend features injected into x, or hybrid (B1-style EMA as a feature fed to the MLP).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 5 Results

Run provenance

Per-workload × scheduler summary

constant

sawtooth

burst

scroll

Go/No-Go — Predictor vs B1 (primary) / Predictor vs B0 (secondary)

Summary verdict

Secondary — Shadow prediction quality

constant

sawtooth

burst

scroll

Phase 5 Part 2 — Pretrained vs Scratch

Part 2 question

B1 drift check — PASS

Three comparisons

(a) Scratch (Part 1) vs Pretrained + Online (Part 2) — Init-quality contribution

(b) Pretrained+Online vs Pretrained+Frozen — Online-learning marginal value

(c) B1 (hand-crafted frozen prior) vs Pretrained+Frozen (data-learned frozen prior) — Blog headline match

On effect sizes

Structural conclusion

Implications for Phase 6

FilesExpand file tree

RESULTS.md

Latest commit

History

RESULTS.md

File metadata and controls

Phase 5 Results

Run provenance

Per-workload × scheduler summary

constant

sawtooth

burst

scroll

Go/No-Go — Predictor vs B1 (primary) / Predictor vs B0 (secondary)

Summary verdict

Secondary — Shadow prediction quality

constant

sawtooth

burst

scroll

Phase 5 Part 2 — Pretrained vs Scratch

Part 2 question

B1 drift check — PASS

Three comparisons

(a) Scratch (Part 1) vs Pretrained + Online (Part 2) — Init-quality contribution

(b) Pretrained+Online vs Pretrained+Frozen — Online-learning marginal value

(c) B1 (hand-crafted frozen prior) vs Pretrained+Frozen (data-learned frozen prior) — Blog headline match

On effect sizes

Structural conclusion

Implications for Phase 6