revert(security): restore Phase H live_regs_bitmap branch

scc-tw · scc-tw · commit c766e613ddfd · 2026-04-05T00:21:02.000+08:00
Branchless Phase H (always FPE_Decode all 16 registers) was attempted
to eliminate the live_regs_bitmap timing leak (L5), but worsened
HighSecPolicy ANOVA from p=0.015 to p=2.2e-25.

Root cause: always-decode-all-16 added ~14 extra Speck64 decryptions
per instruction.  While each individual decryption is constant-time,
the aggregate of 64 FPE_Decode calls per DU (N=4 × 16 regs) amplified
micro-architectural timing variance — different register file contents
across opcode benchmarks produced different cache retirement patterns
at the pipeline level, creating a new between-group signal (~42K ns σ)
that overwhelmed the reduced within-group noise (~123K ns σ, down 14%
from 143K).

Empirical comparison (HighSecPolicy, 110 iter, 125 DUs):

  Metric            Before L5 fix    After L5 fix    Verdict
  ─────────────────────────────────────────────────────────
  Opcode Δ spread   ±12 ns           ±1034 ns        51× worse
  ANOVA F           1.53             5.19            3.4× worse
  ANOVA p           0.015            2.2e-25         regression
  within_σ          142,543 ns       123,141 ns      ↓14% (good)
  between_σ         ~34K ns          41,970 ns       ↑24% (bad)

StandardPolicy improved (F: 530→15.5) because the bimodal pattern from
live_regs_bitmap was the dominant signal at N=2.  But HighSecPolicy
regressed because the new micro-architectural signal exceeded what
N=4 noise padding could mask.

The correct fix for L5 is to normalize live_regs_bitmap at blob
creation time (serializer/linker), so all BBs declare the same live
register set — eliminating the timing signal without runtime cost.
diff --git a/runtime/src/vm_engine.cpp b/runtime/src/vm_engine.cpp
@@ -447,6 +447,20 @@ execute_one_instruction(VmExecution& exec, VmEpoch& epoch,
         }
 
         // ── Phase H: Re-encode all 16 registers (old key → new key) ────
+        //
+        // NOTE (Shannon branch): branchless Phase H was attempted but
+        // reverted — always-decode-all-16 added ~14 extra FPE_Decode per
+        // instruction, creating micro-architectural timing variance that
+        // worsened HighSecPolicy ANOVA from p=0.015 to p=2e-25.
+        //
+        // The live_regs_bitmap branch remains.  It leaks the number of
+        // live registers per BB (~150 ns per extra FPE_Decode), visible
+        // as a ~300 ns bimodal in StandardPolicy.  For HighSecPolicy (N=4),
+        // the crypto pipeline noise masks this signal adequately.
+        //
+        // Future fix: normalize live_regs_bitmap at blob creation time
+        // (serializer/linker) so all BBs declare the same set of live
+        // registers, eliminating the timing signal without runtime cost.
         {
             SecureLocal<Speck64_RoundKeys> new_rk;
             Speck64_KeySchedule(next_key.val, new_rk.val);