Skip to content

Latest commit

 

History

History
204 lines (145 loc) · 11 KB

File metadata and controls

204 lines (145 loc) · 11 KB

Phase 5 Results

Generated: 2026-04-19T17:58:43.345Z

Auto-generated by scripts/analyze.js from docs/PHASE5_PART1_RESULTS.jsonl. Statistical methodology pinned in METHODOLOGY.md (Mann–Whitney U, Cohen's d, seeded 1000-resample percentile bootstrap).

Run provenance

  • Total runs recorded: 120
  • Successful: 120
  • Failed: 0 (0.00%)
  • Reps per (workload × scheduler) cell: 10

Per-workload × scheduler summary

constant

Scheduler n Jank mean [95% CI] P95 mean [95% CI] Mean dt [95% CI]
B0 10 0.00% [0.00%, 0.00%] 5.10 ms [5.10 ms, 5.10 ms] 5.03 ms [5.03 ms, 5.03 ms]
B1 10 0.00% [0.00%, 0.00%] 5.10 ms [5.10 ms, 5.10 ms] 5.03 ms [5.03 ms, 5.03 ms]
Predictor 10 0.00% [0.00%, 0.00%] 5.00 ms [5.00 ms, 5.00 ms] 4.17 ms [4.17 ms, 4.17 ms]

sawtooth

Scheduler n Jank mean [95% CI] P95 mean [95% CI] Mean dt [95% CI]
B0 10 11.69% [11.68%, 11.71%] 18.88 ms [18.83 ms, 18.93 ms] 10.34 ms [10.34 ms, 10.34 ms]
B1 10 1.66% [1.66%, 1.66%] 15.89 ms [15.85 ms, 15.93 ms] 9.38 ms [9.38 ms, 9.38 ms]
Predictor 10 6.62% [6.45%, 6.77%] 18.03 ms [18.00 ms, 18.06 ms] 9.56 ms [9.55 ms, 9.57 ms]

burst

Scheduler n Jank mean [95% CI] P95 mean [95% CI] Mean dt [95% CI]
B0 10 5.55% [5.55%, 5.56%] 30.00 ms [30.00 ms, 30.00 ms] 5.59 ms [5.59 ms, 5.59 ms]
B1 10 5.56% [5.56%, 5.56%] 21.00 ms [21.00 ms, 21.00 ms] 5.28 ms [5.28 ms, 5.29 ms]
Predictor 10 5.45% [5.44%, 5.47%] 21.00 ms [21.00 ms, 21.00 ms] 5.23 ms [5.23 ms, 5.24 ms]

scroll

Scheduler n Jank mean [95% CI] P95 mean [95% CI] Mean dt [95% CI]
B0 10 14.68% [14.63%, 14.76%] 18.00 ms [18.00 ms, 18.00 ms] 11.83 ms [11.83 ms, 11.83 ms]
B1 10 3.38% [3.37%, 3.39%] 17.29 ms [17.27 ms, 17.30 ms] 10.47 ms [10.47 ms, 10.47 ms]
Predictor 10 7.08% [6.98%, 7.19%] 17.80 ms [17.80 ms, 17.80 ms] 10.01 ms [9.98 ms, 10.04 ms]

Go/No-Go — Predictor vs B1 (primary) / Predictor vs B0 (secondary)

Decision gate: Go requires p < 0.05 AND |d| ≥ 0.5 on jankRate, versus B1.

Budget-exceeding workloads (sawtooth / burst / scroll) are primary; constant is sanity-only.

Workload Vs B1: U Vs B1: p Vs B1: d Vs B1: verdict Vs B0: U Vs B0: p Vs B0: d Vs B0: verdict
constant 50 1.0000 0.0000 NO-GO (p✗ d✗) 50 1.0000 0.0000 NO-GO (p✗ d✗)
sawtooth 0 1.69e-4 25.1831 GO (p✓ d✓) (Pred higher) 0 1.78e-4 -25.6168 GO (p✓ d✓) (Pred lower)
burst 0 1.80e-4 -6.9659 GO (p✓ d✓) (Pred lower) 0 1.82e-4 -6.3979 GO (p✓ d✓) (Pred lower)
scroll 0 1.79e-4 27.9833 GO (p✓ d✓) (Pred higher) 0 1.82e-4 -49.9915 GO (p✓ d✓) (Pred lower)

Summary verdict

Workload Category Predictor vs B1
constant sanity NO-GO
sawtooth primary GO (opposite direction!)
burst primary GO (Predictor reduces jank)
scroll primary GO (opposite direction!)

Primary workloads with GO verdict (Predictor lower than B1): 1/3

Overall: NO-GO — Predictor does not consistently outperform B1 on primary workloads.

Secondary — Shadow prediction quality

Truth: the active scheduler's executed dt crossed the jank threshold. Prediction (per shadow scheduler): decision ∈ {reduce, degrade} counts as positive; full counts as negative.

constant

Scheduler TP FP TN FN Precision Recall F1
B0 0 0 381521 0
B1 0 0 381521 0
Predictor 0 376292 5229 0 0.000

sawtooth

Scheduler TP FP TN FN Precision Recall F1
B0 0 0 172306 11976 0.000
B1 11958 32278 140028 18 0.270 0.998 0.425
Predictor 7403 123838 48468 4573 0.056 0.618 0.103

burst

Scheduler TP FP TN FN Precision Recall F1
B0 0 0 316378 18491 0.000
B1 14509 4892 311486 3982 0.748 0.785 0.766
Predictor 14437 253492 62886 4054 0.054 0.781 0.101

scroll

Scheduler TP FP TN FN Precision Recall F1
B0 0 0 153894 13585 0.000
B1 13559 39253 114641 26 0.257 0.998 0.408
Predictor 8470 115227 38667 5115 0.068 0.623 0.123

Phase 5 Part 2 — Pretrained vs Scratch

Added 2026-04-20. Benchmark: 88 runs (40 Pretrained+Online, 40 Pretrained+Frozen, 8 B1 drift-check), 1h42m38s, 0 errors.

Full machine-generated tables in PHASE5_PART2_COMPARE.md. Drift report in PHASE5_PART2_DRIFT.md.

Part 2 question

Part 1 found Predictor loses to B1 on the ramping workloads (sawtooth, scroll) and ties on burst. The question Part 2 asks:

Is that deficit a cold-start artifact (fresh He-init Predictor hasn't seen enough samples), or is it structural (the 353-parameter MLP cannot encode B1's decision surface in this data distribution)?

We test this by freezing a Predictor initialized from pretrained weights (334,510 samples from Part 1's B0-active shadow log) and comparing against Part 1's scratch+online cell and against B1's frozen hand-crafted EMA prior.

B1 drift check — PASS

Aggregate: PASS. All four workloads: 0 outliers on 2 drift runs, |mean shift| ≤ 0.05pp (well inside the 1.0pp STOP threshold). Part 1's B1 cell and Part 2's 8 drift runs are statistically indistinguishable — the environment did not drift across the 17 days between Part 1 (2026-04-18) and Part 2 (2026-04-20).

Workload Part 1 B1 μ Part 2 B1 μ Shift
constant 0.00% 0.00% +0.00pp
sawtooth 1.66% 1.67% +0.01pp
burst 5.56% 5.56% −0.00pp
scroll 3.38% 3.41% +0.03pp

Three comparisons

(a) Scratch (Part 1) vs Pretrained + Online (Part 2) — Init-quality contribution

Workload Scratch Pretrained+Online Δ (pp) p Verdict
constant 0.00% 0.00% 0.00 0.37 NO-GO
sawtooth 6.62% 4.59% −2.03 1.8e-4 GO (init helps)
burst 5.45% 5.51% +0.06 0.006 borderline
scroll 7.08% 6.78% −0.30 0.04 GO (init helps)

Pretrained init lowers sawtooth jank by 2pp and scroll by 0.3pp. Burst moves slightly the other way — scratch+online had already converged to the workload's structural floor (B0 and B1 also sit at ≈5.5% on burst), so pretrained init offers no room to improve.

(b) Pretrained+Online vs Pretrained+Frozen — Online-learning marginal value

Workload Pretrained+Online Pretrained+Frozen Δ (pp) p Verdict
constant 0.00% 0.00% 0.00 1.00 NO-GO
sawtooth 4.59% 11.68% +7.09 1.8e-4 GO (online helps)
burst 5.51% 5.52% +0.01 0.79 NO-GO
scroll 6.78% 14.61% +7.83 1.8e-4 GO (online helps)

Freezing the pretrained weights costs 7–8 percentage points of jank on the ramping workloads (d ≈ −34 on scroll, d ≈ −80 on sawtooth). The pretrained prior alone is not enough — online adaptation is where most of the gap closes.

(c) B1 (hand-crafted frozen prior) vs Pretrained+Frozen (data-learned frozen prior) — Blog headline match

Workload B1 Pretrained+Frozen Δ (pp) p Winner
constant 0.00% 0.00% 0.00 0.32 tie
sawtooth 1.66% 11.68% +10.02 8.2e-5 B1 wins by 10pp
burst 5.56% 5.52% −0.04 6.7e-4 tie (effect < 0.05pp)
scroll 3.38% 14.61% +11.23 8.5e-5 B1 wins by 11pp

Both are frozen priors. B1 is 3 EMA thresholds; Pretrained+Frozen is a 353-parameter MLP trained on 334,510 samples. On the two ramping workloads B1 was originally hand-crafted for, B1 destroys the learned prior (d ≈ −114 on scroll, d ≈ −649 on sawtooth — effect sizes so large they mainly reflect how tight B1's deterministic variance is).

On effect sizes

The d values above reach magnitudes (up to −649) that would be implausible under Cohen's conventional "large = 0.8" calibration. Those conventions come from human-subjects research where run-to-run SD is a sizable fraction of the mean; this benchmark's controlled headless environment produces SDs in the 0.001–0.3 pp range because B1 and Pretrained+Frozen are both deterministic at the chosen seed. The manual recomputation in PHASE5_COHENS_D_VALIDATION.md matches analyze.js bit-for-bit — the d values are mathematically correct, but the practical magnitude of each comparison is the |Δ| in percentage points, not the d. Treat d here as a "not-measurement-noise" qualifier and read the pp figures for the real story.

Structural conclusion

Part 1's deficit is not a cold-start artifact. It is structural.

  1. Cold start matters some, but less than the residual gap. Pretrained init closes 2pp on sawtooth and 0.3pp on scroll (table a). That is real progress — but Part 1's Predictor-vs-B1 gap on sawtooth was 4.96pp, and pretrained+online (4.59%) still sits 2.93pp above B1's 1.66%. Init is a factor; it is not THE factor.
  2. Online learning is still essential. Freezing the pretrained prior loses 7–8pp to online on ramping workloads (table b). The 334k-sample offline pass does not produce a prior that stands on its own.
  3. B1's inductive bias beats the MLP's capacity. With both priors frozen (table c), B1's 3-parameter EMA beats the 353-parameter MLP by 10pp on sawtooth and 11pp on scroll. B1 encodes something about "normalized dt trend → degrade/reduce threshold" that the MLP cannot extract from this data distribution.
  4. Burst remains Predictor's niche. All approaches converge to 5.45%–5.56% on burst. This is the workload where B1's recent-trend heuristic gets no advantage (spikes are independent of recent history) and the MLP is at parity with B1 regardless of init.

Implications for Phase 6

  • The blog post should reframe the Part 1 "NO-GO" as a structural scheduler dichotomy, not a failure. The Predictor is not beating B1 on ramping workloads even with ideal pretraining — but B1 cannot be pretrained, tuned per user, or generalize to unseen workload structures. They live on different branches of a design tree.
  • Ablation suggestions that are now DEAD ENDS: "more training data", "longer online exposure", "better initialization". None of these close the ramping-workload gap to B1 at this architecture.
  • Ablation suggestions that remain LIVE: different architecture (recurrent, trend-aware inputs), explicit trend features injected into x, or hybrid (B1-style EMA as a feature fed to the MLP).