HanSur94 · HanSur94 · Jun 24, 2026
diff --git a/benchmarks/.reports/coverage.md b/benchmarks/.reports/coverage.md
@@ -0,0 +1,133 @@
+# Benchmark coverage map
+
+Maintained by `/bench-evolve`. Each run adds **one** new benchmark targeting the
+most important performance-critical path that still lacks coverage, then records
+it here so the next run can see history and pick the next-biggest gap.
+
+## Covered paths
+
+| Path / hot spot | Benchmark(s) |
+|---|---|
+| FastSense vs `plot()`, point reduction | `benchmark.m` |
+| Feature overhead (theme/band/shaded/fill/marker/grid) | `benchmark_features.m` |
+| Memory: in-RAM vs disk-backed | `benchmark_memory.m` |
+| Zoom/pan per-frame latency | `benchmark_zoom.m` |
+| DataStore write / range query / disk render | `benchmark_datastore.m`, `profile_datastore.m` |
+| Dashboard create / render / live tick | `bench_dashboard.m`, `bench_dashboard_live.m`, `bench_dashboard_load.m` |
+| Legacy-vs-Tag consumer tick parity | `bench_consumer_migration_tick.m` |
+| 1000-tag live pipeline tick | `bench_tag_pipeline_1k.m` |
+| MonitorTag append vs recompute / tick | `bench_monitortag_append.m`, `bench_monitortag_tick.m` |
+| SensorTag.getXY zero-copy | `bench_sensortag_getxy.m` |
+| CompositeTag k-way merge | `bench_compositetag_merge.m` |
+| Event-marker rendering overhead | `bench_event_marker_regression.m` |
+| **Downsample kernels (LTTB + MinMax) throughput** | **`bench_downsample_kernels.m`** ← 2026-06-24 |
+| **Threshold violation detect + pixel cull (constant & step)** | **`bench_violation_cull.m`** ← 2026-06-24 |
+| **Binary-search (viewport-clip / bucket-boundary) per-call latency** | **`bench_binary_search.m`** ← 2026-06-24 |
+| **Multi-line live-update scaling (lines-per-axes, refresh Hz)** | **`bench_fastsense_multiline.m`** ← 2026-06-24 |
+| **Detached-mirror refresh overhead (headline constraint)** | **`bench_detached_mirror_refresh.m`** ← 2026-06-24 |
+| **DerivedTag resolve-chain recompute vs depth** | **`bench_derived_resolve_chain.m`** ← 2026-06-24 |
+
+## Run log
+
+- **2026-06-24** — Added `bench_downsample_kernels.m`. Gap closed: the core
+  `resolve → downsample → render` kernels (`lttb_downsample`, `minmax_downsample`)
+  had **no direct microbenchmark** — `benchmark.m` called MinMax once at a single
+  size and LTTB was never timed. New bench sweeps 10K→20M points, times both
+  kernels (MEX path when available), reports Mpts/s throughput, and checks the
+  O(N) coefficient for super-linear creep. First MATLAB R2025b (MEX) run:
+  LTTB 42→344 Mpts/s, MinMax 133→1100 Mpts/s, scaling linear (0.91x / 0.74x).
+- **2026-06-24** — Added `bench_violation_cull.m`. Gap closed: the fused
+  threshold violation-detection + pixel-culling kernel (`violation_cull`, MEX:
+  `violation_cull_mex`) — the threshold-marker counterpart of the downsamplers,
+  run on every render/zoom when a threshold is attached — had no direct bench.
+  New bench sweeps 10K→20M points across BOTH the constant and step/ZOH threshold
+  branches, reports Mpts/s + culled-marker counts, checks O(N) scaling. First
+  MATLAB R2025b (MEX) run: constant 118→504 Mpts/s, step 37→457 Mpts/s, markers
+  cap at the 1000-pixel budget by 5M, scaling linear (0.93x / 0.85x).
+- **2026-06-24** — Added `bench_binary_search.m`. Gap closed: the O(log N)
+  `binary_search` kernel (MEX: `binary_search_mex`) — the most ubiquitous hot-path
+  call (every viewport clip / zoom does left+right edge searches; downsamplers map
+  bucket boundaries through it) — had no direct bench. New bench times per-call
+  latency over 1e5 scattered queries across 10K→50M-element arrays, both 'left'
+  and 'right', and checks growth vs O(log N). First MATLAB R2025b (MEX) run:
+  ~918→1610 ns/call, 1M→50M growth 1.45x (vs pure-log 1.28x — log + cache), well
+  short of any O(N) degradation.
+- **2026-06-24** — Added `bench_fastsense_multiline.m` (FIRST composite-path
+  bench; per-kernel surface now complete — see Notes). Sweeps line count
+  (1→64) on one FastSense axes at 100K pts/line, timing updateData() (the live
+  path, which re-downsamples *all* lines per call) with SkipViewMode. First
+  MATLAB R2025b run: updateData 1.8→26 ms, us/line falls 1775→406 (sublinear —
+  fixed per-call overhead amortizes), effective refresh 564 Hz (1 line) → 39 Hz
+  (64 lines). Setup render ~1.1 s (figure-realization dominated, informational).
+  Verified portable (Octave API smoke passed).
+- **2026-06-24** — Added `bench_detached_mirror_refresh.m` (the project's HEADLINE
+  constraint: "detached live-mirrored widgets must not degrade refresh rate").
+  Holds total widgets constant (8), detaches K=0/1/2/4 as mirrors, measures
+  amortized active onLiveTick(). **Key finding: detaching DOES add real refresh
+  overhead** — baseline ~18-20 ms (≈55 Hz), rising to +35% (1 mirror) … +120-200%
+  (4 mirrors). Mirrors tick in-line on the shared refresh path, so this is
+  expected; the bench baselines it so regressions (overhead GROWING) are caught.
+  Two methodology lessons baked in: (a) the path needs ~15-20 warmup ticks before
+  it settles — a short warmup made the first scenario eat all the JIT/cache noise
+  and falsely read "-61%"; (b) onLiveTick is BIMODAL under drawnow('limitrate'),
+  so the stable metric is the amortized average over a tick batch, not a per-tick
+  median. Verified MATLAB R2025b; **Octave-skipped** (see Notes).
+- **2026-06-24** — Added `bench_derived_resolve_chain.m`. Gap closed: DerivedTag
+  (lazy-memoized derived-tag resolve) had NO bench — `bench_compositetag_merge`
+  covers the CompositeTag merge only. Builds a sensor->T1->...->TD chain, sweeps
+  depth 1→32, and times (a) COLD getXY after invalidating the whole chain (full
+  recompute = live cost) vs (b) WARM getXY (memoized cache). First MATLAB R2025b
+  run (1e6 pts/node): cold 0.69→14.2 ms ~linear in depth, warm flat ~0.02 ms
+  (memo saves the whole chain walk). Two tuning notes: 1e5 pts was noise-bound
+  (per-node <0.1 ms) so bumped to 1e6; cold recompute allocates a fresh array
+  per node (bursty GC) so the estimator is MIN-over-reps, not median. Verified
+  portable (full Octave run passed: cold 30 ms / warm 0.02 ms at depth 32).
+
+## Per-kernel MEX surface: COMPLETE
+
+Every MEX kernel **actually wired into production** now has direct coverage:
+
+| Kernel | Production caller | Bench |
+|---|---|---|
+| `lttb_core_mex`, `minmax_core_mex` | `lttb_downsample`, `minmax_downsample` | `bench_downsample_kernels.m` |
+| `violation_cull_mex` | `violation_cull` | `bench_violation_cull.m` |
+| `binary_search_mex` | `binary_search` (public + private) | `bench_binary_search.m` |
+| `build_store_mex` | `FastSenseDataStore.m:672` | `benchmark_datastore.m` (write timing) |
+
+**Three kernels are test-only / unwired** — they ship and have parity tests but
+no production `.m` call site (verified by grepping `libs/**/*.m` for `<name>(`):
+`compute_violations_mex`, `to_step_function_mex`, `resolve_disk_mex`. They are
+**deliberately NOT benched** (a bench would not reflect any real hot path). This
+is worth flagging to maintainers: either wire them in or treat them as staged.
+Do **not** keep re-picking them in future runs.
+
+## Remaining gaps (ranked — composite paths)
+
+The loop has now pivoted from kernels to composite hot paths. Candidates:
+
+1. **Pyramid-cache rebuild cost** — `FastSenseDataStore` pyramid level rebuild on
+   live append; touched by `benchmark_datastore.m` but not isolated.
+   **Top next gap** — deterministic (disk), no figures.
+2. **Widget-count refresh sweep** — `bench_dashboard_live.m` fixes 8 widgets; a
+   widget-count sweep (no detach) would baseline the in-grid refresh scaling.
+3. **DerivedTag fan-OUT / CompositeTag fan-IN width** — `bench_derived_resolve_chain.m`
+   covers chain DEPTH; the complementary axis is one sensor feeding many derived
+   tags (fan-out) and many sensors merged by one CompositeTag (fan-in width).
+
+### Notes / corrections
+- `compute_violations` + `compute_violations_dynamic` + `downsample_violations`
+  are covered **indirectly** by `bench_violation_cull.m`: the fused `violation_cull`
+  kernel dispatches to them in its pure-MATLAB fallback, and the bench drives both
+  the constant (`compute_violations`) and step/ZOH (`compute_violations_dynamic`)
+  branches. `compute_violations.m` standalone is a trivial one-line vectorized mask
+  (no MEX dispatch) — not worth an isolated bench.
+- **Octave-compat finding (for maintainers; spun off as a separate task):** the
+  real culprit is `DashboardWidgetRegistry.fromStruct` at
+  `libs/Dashboard/DashboardWidgetRegistry.m:92` — `w = feval([className '.fromStruct'], s)`.
+  Octave does not resolve that dotted-static-method feval form ("function
+  'FastSenseWidget.fromStruct' not found"). This breaks BOTH serialized dashboard
+  load (`DashboardEngine.load`) and the detach feature (`detachWidget` →
+  `DetachedMirror.cloneWidget:178` → the registry call) under Octave, despite
+  Octave being "fully supported." `bench_detached_mirror_refresh.m` guards with an
+  Octave skip. Fix candidate: `fn = str2func([className '.fromStruct']); w = fn(s);`.
+  Not changed here (additive-only remit).
diff --git a/benchmarks/bench_binary_search.m b/benchmarks/bench_binary_search.m
@@ -0,0 +1,195 @@
+function result = bench_binary_search()
+%BENCH_BINARY_SEARCH Per-call latency microbenchmark for the binary-search kernel.
+%
+%   Times binary_search (MEX: binary_search_mex), the O(log N) lower/upper
+%   bound search over a sorted time array. It is the most ubiquitously called
+%   kernel on the hot path: every viewport clip and every zoom/pan locates
+%   the visible window by two binary searches (left edge + right edge) into
+%   the full sorted X array, and the downsamplers call it to map bucket
+%   boundaries to indices. The private copy's own header calls it "used
+%   extensively by the downsampling and viewport-clipping routines."
+%
+%   Unlike the downsample/violation kernels (O(N), tens of ms), a single
+%   binary search is sub-microsecond, so throughput-per-point is the wrong
+%   metric. This bench instead measures PER-CALL latency over a large batch
+%   of scattered queries, and reports how it scales with log2(N). Because
+%   the dominant cost is the MATLAB->MEX dispatch the caller actually pays
+%   (the bench times the binary_search wrapper, not binary_search_mex
+%   directly), the measured number is the real per-query cost on the hot
+%   path. No bench_*.m exercised it before.
+%
+%   What it measures (deterministic, no RNG):
+%     - Per-call latency (ns) for both 'left' and 'right' searches, median
+%       of nReps after warmup, over a size sweep (10K -> 50M element arrays).
+%     - Throughput in millions of calls/second.
+%     - Whether latency tracks O(log N) (the array doubles cost the search
+%       one extra comparison) rather than degrading to O(N).
+%
+%   Queries are a deterministic low-discrepancy scramble across [min, max]
+%   plus a 5% out-of-range margin (exercising the clamp branches), so the
+%   access pattern into the searched array is scattered — realistic for
+%   viewport clipping, not an artificially cache-friendly monotone sweep.
+%
+%   Throughput bench, not a pass/fail gate: it PRINTS results and a soft
+%   scaling advisory. The next /bench-guard (or /perf-watch) run baselines
+%   the absolute numbers. The active path (compiled MEX vs pure-MATLAB
+%   fallback) is detected and labelled — the MEX speedup is what it protects.
+%
+%   Run:
+%     octave --no-gui --eval "install(); bench_binary_search();"
+%     % or in MATLAB:
+%     bench_binary_search
+%
+%   Returns a struct (sizes, per-direction ns/call + Mcalls/s, MEX flag,
+%   scaling drift) for programmatic baselining.
+%
+%   See also binary_search, binary_search_mex, lttb_downsample,
+%   minmax_downsample, bench_downsample_kernels.
+
+    here = fileparts(mfilename('fullpath'));
+    addpath(fullfile(here, '..'));
+    install();
+
+    % binary_search has a public copy on the path, but binary_search_mex is a
+    % PRIVATE helper. We cd into the private folder for the bench: the private
+    % binary_search.m is callable from cwd and sees the MEX (which lives there
+    % too), so this exercises the real MEX path AND lets us detect it reliably
+    % via exist(). onCleanup restores cwd even on error. Portable MATLAB/Octave.
+    privDir = fullfile(here, '..', 'libs', 'FastSense', 'private');
+    origDir = pwd;
+    restoreCwd = onCleanup(@() cd(origDir)); %#ok<NASGU>
+    cd(privDir);
+
+    % ---- Configuration ----
+    sizes  = [1e4, 1e5, 1e6, 1e7, 5e7];
+    labels = {'10K', '100K', '1M', '10M', '50M'};
+
+    nQueries = 1e5;      % calls per timed run (per direction)
+    nWarm    = 2;        % warmup runs dissolve JIT / first-call MEX load
+    nReps    = 5;        % median over nReps defuses one-off spikes
+    scramble = 31337;    % prime, coprime to nQueries -> low-discrepancy order
+
+    bsMex = (exist('binary_search_mex', 'file') == 3);
+
+    fprintf('\n================================================================\n');
+    fprintf('  FastSense Binary-Search Kernel Latency Benchmark\n');
+    fprintf('  O(log N) viewport-clip / bucket-boundary search (per-call cost)\n');
+    fprintf('================================================================\n');
+    fprintf('  queries/run = %d   path: %s   warmup = %d   reps = %d (median)\n', ...
+        nQueries, pathLabel_(bsMex), nWarm, nReps);
+    fprintf('  %s\n', repmat('-', 1, 72));
+    fprintf('  %-6s | %7s | %-21s | %-21s\n', 'N', 'log2 N', 'left search', 'right search');
+    fprintf('  %-6s | %7s | %11s %9s | %11s %9s\n', ...
+        '', '(cmp)', 'ns/call', 'Mcall/s', 'ns/call', 'Mcall/s');
+    fprintf('  %s\n', repmat('-', 1, 72));
+
+    nS = numel(sizes);
+    nsL = zeros(1, nS); tputL = zeros(1, nS);
+    nsR = zeros(1, nS); tputR = zeros(1, nS);
+
+    % Deterministic scrambled query order (computed once; same length each size)
+    perm = mod((0:nQueries - 1) * scramble, nQueries) + 1;
+
+    for c = 1:nS
+        n = sizes(c);
+
+        % Sorted ascending array to search. Only X is needed.
+        x = linspace(0, n / 100, n);
+        xmin = x(1);
+        xmax = x(end);
+        span = xmax - xmin;
+
+        % Query values span the range + 5% margin on each side (clamp paths),
+        % visited in a scattered, deterministic order (no RNG).
+        valsSorted = linspace(xmin - 0.05 * span, xmax + 0.05 * span, nQueries);
+        vals = valsSorted(perm);
+
+        nsL(c)   = perCallNs_(@() runQueries_(x, vals, 'left'),  nWarm, nReps, nQueries);
+        tputL(c) = 1e3 / nsL(c);   % calls per us *1e3 -> Mcalls/s == 1e9/ns/1e6
+        nsR(c)   = perCallNs_(@() runQueries_(x, vals, 'right'), nWarm, nReps, nQueries);
+        tputR(c) = 1e3 / nsR(c);
+
+        fprintf('  %-6s | %7.1f | %11.1f %9.1f | %11.1f %9.1f\n', ...
+            labels{c}, log2(n), nsL(c), tputL(c), nsR(c), tputR(c));
+    end
+
+    fprintf('  %s\n', repmat('-', 1, 72));
+
+    % ---- Soft scaling advisory ----
+    % Pure O(log N): latency from 1M to 50M should rise by at most
+    % log2(50M)/log2(1M) ~= 1.3x (or stay flat if dispatch-bound). A >3x rise
+    % means the search is degrading worse than logarithmic. Advisory only.
+    refIdx = 3;  % 1M
+    lDrift = nsL(end) / nsL(refIdx);
+    rDrift = nsR(end) / nsR(refIdx);
+    logRatio = log2(sizes(end)) / log2(sizes(refIdx));
+
+    fprintf('  Scaling (ns/call, %s -> %s; pure O(log N) ~= %.2fx):\n', ...
+        labels{refIdx}, labels{end}, logRatio);
+    fprintf('    left   : %.1f -> %.1f  (%.2fx)  %s\n', ...
+        nsL(refIdx), nsL(end), lDrift, logDriftLabel_(lDrift));
+    fprintf('    right  : %.1f -> %.1f  (%.2fx)  %s\n', ...
+        nsR(refIdx), nsR(end), rDrift, logDriftLabel_(rDrift));
+    fprintf('  %s\n', repmat('-', 1, 72));
+    fprintf('  Note: latency bench (no time gate). /bench-guard baselines these numbers.\n\n');
+
+    result = struct( ...
+        'sizes',       sizes, ...
+        'labels',      {labels}, ...
+        'nQueries',    nQueries, ...
+        'binarySearchMex', bsMex, ...
+        'leftNsPerCall',  nsL, ...
+        'leftMcallsPerS', tputL, ...
+        'rightNsPerCall', nsR, ...
+        'rightMcallsPerS', tputR, ...
+        'leftDrift',   lDrift, ...
+        'rightDrift',  rDrift, ...
+        'logRatio',    logRatio);
+end
+
+function acc = runQueries_(x, vals, direction)
+    %RUNQUERIES_ Issue numel(vals) binary searches; accumulate idx as a sink
+    %   so the loop body cannot be optimized away.
+    acc = 0;
+    for q = 1:numel(vals)
+        acc = acc + binary_search(x, vals(q), direction);
+    end
+end
+
+function ns = perCallNs_(fn, nWarm, nReps, nCalls)
+    %PERCALLNS_ Median per-call latency (ns) of fn (which issues nCalls calls).
+    for w = 1:nWarm
+        fn();
+    end
+    ts = zeros(1, nReps);
+    for r = 1:nReps
+        tic;
+        fn();
+        ts(r) = toc;
+    end
+    ns = (median(ts) / nCalls) * 1e9;
+end
+
+function s = pathLabel_(useMex)
+    %PATHLABEL_ Human label for the active kernel implementation.
+    if useMex
+        s = 'MEX (compiled)';
+    else
+        s = 'pure-MATLAB fallback';
+    end
+end
+
+function s = logDriftLabel_(drift)
+    %LOGDRIFTLABEL_ Soft verdict on latency growth vs O(log N).
+    %   Reference: a 50x array growth costs pure O(log N) only ~1.3x. Mild
+    %   excess above that is cache-miss cost on the larger array, still
+    %   logarithmic in comparison count; a large multiple means the search
+    %   has degraded toward O(N).
+    if drift > 3.0
+        s = '<< WATCH: growth exceeds O(log N)';
+    elseif drift > 1.35
+        s = '(grows ~O(log N) + cache)';
+    else
+        s = '(flat — dispatch-bound)';
+    end
+end