diff --git a/benchmarks/.reports/coverage.md b/benchmarks/.reports/coverage.md
new file mode 100644
index 00000000..fc24fe84
--- /dev/null
+++ b/benchmarks/.reports/coverage.md
@@ -0,0 +1,133 @@
+# Benchmark coverage map
+
+Maintained by `/bench-evolve`. Each run adds **one** new benchmark targeting the
+most important performance-critical path that still lacks coverage, then records
+it here so the next run can see history and pick the next-biggest gap.
+
+## Covered paths
+
+| Path / hot spot | Benchmark(s) |
+|---|---|
+| FastSense vs `plot()`, point reduction | `benchmark.m` |
+| Feature overhead (theme/band/shaded/fill/marker/grid) | `benchmark_features.m` |
+| Memory: in-RAM vs disk-backed | `benchmark_memory.m` |
+| Zoom/pan per-frame latency | `benchmark_zoom.m` |
+| DataStore write / range query / disk render | `benchmark_datastore.m`, `profile_datastore.m` |
+| Dashboard create / render / live tick | `bench_dashboard.m`, `bench_dashboard_live.m`, `bench_dashboard_load.m` |
+| Legacy-vs-Tag consumer tick parity | `bench_consumer_migration_tick.m` |
+| 1000-tag live pipeline tick | `bench_tag_pipeline_1k.m` |
+| MonitorTag append vs recompute / tick | `bench_monitortag_append.m`, `bench_monitortag_tick.m` |
+| SensorTag.getXY zero-copy | `bench_sensortag_getxy.m` |
+| CompositeTag k-way merge | `bench_compositetag_merge.m` |
+| Event-marker rendering overhead | `bench_event_marker_regression.m` |
+| **Downsample kernels (LTTB + MinMax) throughput** | **`bench_downsample_kernels.m`** ← 2026-06-24 |
+| **Threshold violation detect + pixel cull (constant & step)** | **`bench_violation_cull.m`** ← 2026-06-24 |
+| **Binary-search (viewport-clip / bucket-boundary) per-call latency** | **`bench_binary_search.m`** ← 2026-06-24 |
+| **Multi-line live-update scaling (lines-per-axes, refresh Hz)** | **`bench_fastsense_multiline.m`** ← 2026-06-24 |
+| **Detached-mirror refresh overhead (headline constraint)** | **`bench_detached_mirror_refresh.m`** ← 2026-06-24 |
+| **DerivedTag resolve-chain recompute vs depth** | **`bench_derived_resolve_chain.m`** ← 2026-06-24 |
+
+## Run log
+
+- **2026-06-24** — Added `bench_downsample_kernels.m`. Gap closed: the core
+  `resolve → downsample → render` kernels (`lttb_downsample`, `minmax_downsample`)
+  had **no direct microbenchmark** — `benchmark.m` called MinMax once at a single
+  size and LTTB was never timed. New bench sweeps 10K→20M points, times both
+  kernels (MEX path when available), reports Mpts/s throughput, and checks the
+  O(N) coefficient for super-linear creep. First MATLAB R2025b (MEX) run:
+  LTTB 42→344 Mpts/s, MinMax 133→1100 Mpts/s, scaling linear (0.91x / 0.74x).
+- **2026-06-24** — Added `bench_violation_cull.m`. Gap closed: the fused
+  threshold violation-detection + pixel-culling kernel (`violation_cull`, MEX:
+  `violation_cull_mex`) — the threshold-marker counterpart of the downsamplers,
+  run on every render/zoom when a threshold is attached — had no direct bench.
+  New bench sweeps 10K→20M points across BOTH the constant and step/ZOH threshold
+  branches, reports Mpts/s + culled-marker counts, checks O(N) scaling. First
+  MATLAB R2025b (MEX) run: constant 118→504 Mpts/s, step 37→457 Mpts/s, markers
+  cap at the 1000-pixel budget by 5M, scaling linear (0.93x / 0.85x).
+- **2026-06-24** — Added `bench_binary_search.m`. Gap closed: the O(log N)
+  `binary_search` kernel (MEX: `binary_search_mex`) — the most ubiquitous hot-path
+  call (every viewport clip / zoom does left+right edge searches; downsamplers map
+  bucket boundaries through it) — had no direct bench. New bench times per-call
+  latency over 1e5 scattered queries across 10K→50M-element arrays, both 'left'
+  and 'right', and checks growth vs O(log N). First MATLAB R2025b (MEX) run:
+  ~918→1610 ns/call, 1M→50M growth 1.45x (vs pure-log 1.28x — log + cache), well
+  short of any O(N) degradation.
+- **2026-06-24** — Added `bench_fastsense_multiline.m` (FIRST composite-path
+  bench; per-kernel surface now complete — see Notes). Sweeps line count
+  (1→64) on one FastSense axes at 100K pts/line, timing updateData() (the live
+  path, which re-downsamples *all* lines per call) with SkipViewMode. First
+  MATLAB R2025b run: updateData 1.8→26 ms, us/line falls 1775→406 (sublinear —
+  fixed per-call overhead amortizes), effective refresh 564 Hz (1 line) → 39 Hz
+  (64 lines). Setup render ~1.1 s (figure-realization dominated, informational).
+  Verified portable (Octave API smoke passed).
+- **2026-06-24** — Added `bench_detached_mirror_refresh.m` (the project's HEADLINE
+  constraint: "detached live-mirrored widgets must not degrade refresh rate").
+  Holds total widgets constant (8), detaches K=0/1/2/4 as mirrors, measures
+  amortized active onLiveTick(). **Key finding: detaching DOES add real refresh
+  overhead** — baseline ~18-20 ms (≈55 Hz), rising to +35% (1 mirror) … +120-200%
+  (4 mirrors). Mirrors tick in-line on the shared refresh path, so this is
+  expected; the bench baselines it so regressions (overhead GROWING) are caught.
+  Two methodology lessons baked in: (a) the path needs ~15-20 warmup ticks before
+  it settles — a short warmup made the first scenario eat all the JIT/cache noise
+  and falsely read "-61%"; (b) onLiveTick is BIMODAL under drawnow('limitrate'),
+  so the stable metric is the amortized average over a tick batch, not a per-tick
+  median. Verified MATLAB R2025b; **Octave-skipped** (see Notes).
+- **2026-06-24** — Added `bench_derived_resolve_chain.m`. Gap closed: DerivedTag
+  (lazy-memoized derived-tag resolve) had NO bench — `bench_compositetag_merge`
+  covers the CompositeTag merge only. Builds a sensor->T1->...->TD chain, sweeps
+  depth 1→32, and times (a) COLD getXY after invalidating the whole chain (full
+  recompute = live cost) vs (b) WARM getXY (memoized cache). First MATLAB R2025b
+  run (1e6 pts/node): cold 0.69→14.2 ms ~linear in depth, warm flat ~0.02 ms
+  (memo saves the whole chain walk). Two tuning notes: 1e5 pts was noise-bound
+  (per-node <0.1 ms) so bumped to 1e6; cold recompute allocates a fresh array
+  per node (bursty GC) so the estimator is MIN-over-reps, not median. Verified
+  portable (full Octave run passed: cold 30 ms / warm 0.02 ms at depth 32).
+
+## Per-kernel MEX surface: COMPLETE
+
+Every MEX kernel **actually wired into production** now has direct coverage:
+
+| Kernel | Production caller | Bench |
+|---|---|---|
+| `lttb_core_mex`, `minmax_core_mex` | `lttb_downsample`, `minmax_downsample` | `bench_downsample_kernels.m` |
+| `violation_cull_mex` | `violation_cull` | `bench_violation_cull.m` |
+| `binary_search_mex` | `binary_search` (public + private) | `bench_binary_search.m` |
+| `build_store_mex` | `FastSenseDataStore.m:672` | `benchmark_datastore.m` (write timing) |
+
+**Three kernels are test-only / unwired** — they ship and have parity tests but
+no production `.m` call site (verified by grepping `libs/**/*.m` for `<name>(`):
+`compute_violations_mex`, `to_step_function_mex`, `resolve_disk_mex`. They are
+**deliberately NOT benched** (a bench would not reflect any real hot path). This
+is worth flagging to maintainers: either wire them in or treat them as staged.
+Do **not** keep re-picking them in future runs.
+
+## Remaining gaps (ranked — composite paths)
+
+The loop has now pivoted from kernels to composite hot paths. Candidates:
+
+1. **Pyramid-cache rebuild cost** — `FastSenseDataStore` pyramid level rebuild on
+   live append; touched by `benchmark_datastore.m` but not isolated.
+   **Top next gap** — deterministic (disk), no figures.
+2. **Widget-count refresh sweep** — `bench_dashboard_live.m` fixes 8 widgets; a
+   widget-count sweep (no detach) would baseline the in-grid refresh scaling.
+3. **DerivedTag fan-OUT / CompositeTag fan-IN width** — `bench_derived_resolve_chain.m`
+   covers chain DEPTH; the complementary axis is one sensor feeding many derived
+   tags (fan-out) and many sensors merged by one CompositeTag (fan-in width).
+
+### Notes / corrections
+- `compute_violations` + `compute_violations_dynamic` + `downsample_violations`
+  are covered **indirectly** by `bench_violation_cull.m`: the fused `violation_cull`
+  kernel dispatches to them in its pure-MATLAB fallback, and the bench drives both
+  the constant (`compute_violations`) and step/ZOH (`compute_violations_dynamic`)
+  branches. `compute_violations.m` standalone is a trivial one-line vectorized mask
+  (no MEX dispatch) — not worth an isolated bench.
+- **Octave-compat finding (for maintainers; spun off as a separate task):** the
+  real culprit is `DashboardWidgetRegistry.fromStruct` at
+  `libs/Dashboard/DashboardWidgetRegistry.m:92` — `w = feval([className '.fromStruct'], s)`.
+  Octave does not resolve that dotted-static-method feval form ("function
+  'FastSenseWidget.fromStruct' not found"). This breaks BOTH serialized dashboard
+  load (`DashboardEngine.load`) and the detach feature (`detachWidget` →
+  `DetachedMirror.cloneWidget:178` → the registry call) under Octave, despite
+  Octave being "fully supported." `bench_detached_mirror_refresh.m` guards with an
+  Octave skip. Fix candidate: `fn = str2func([className '.fromStruct']); w = fn(s);`.
+  Not changed here (additive-only remit).
diff --git a/benchmarks/bench_binary_search.m b/benchmarks/bench_binary_search.m
new file mode 100644
index 00000000..87849a22
--- /dev/null
+++ b/benchmarks/bench_binary_search.m
@@ -0,0 +1,195 @@
+function result = bench_binary_search()
+%BENCH_BINARY_SEARCH Per-call latency microbenchmark for the binary-search kernel.
+%
+%   Times binary_search (MEX: binary_search_mex), the O(log N) lower/upper
+%   bound search over a sorted time array. It is the most ubiquitously called
+%   kernel on the hot path: every viewport clip and every zoom/pan locates
+%   the visible window by two binary searches (left edge + right edge) into
+%   the full sorted X array, and the downsamplers call it to map bucket
+%   boundaries to indices. The private copy's own header calls it "used
+%   extensively by the downsampling and viewport-clipping routines."
+%
+%   Unlike the downsample/violation kernels (O(N), tens of ms), a single
+%   binary search is sub-microsecond, so throughput-per-point is the wrong
+%   metric. This bench instead measures PER-CALL latency over a large batch
+%   of scattered queries, and reports how it scales with log2(N). Because
+%   the dominant cost is the MATLAB->MEX dispatch the caller actually pays
+%   (the bench times the binary_search wrapper, not binary_search_mex
+%   directly), the measured number is the real per-query cost on the hot
+%   path. No bench_*.m exercised it before.
+%
+%   What it measures (deterministic, no RNG):
+%     - Per-call latency (ns) for both 'left' and 'right' searches, median
+%       of nReps after warmup, over a size sweep (10K -> 50M element arrays).
+%     - Throughput in millions of calls/second.
+%     - Whether latency tracks O(log N) (the array doubles cost the search
+%       one extra comparison) rather than degrading to O(N).
+%
+%   Queries are a deterministic low-discrepancy scramble across [min, max]
+%   plus a 5% out-of-range margin (exercising the clamp branches), so the
+%   access pattern into the searched array is scattered — realistic for
+%   viewport clipping, not an artificially cache-friendly monotone sweep.
+%
+%   Throughput bench, not a pass/fail gate: it PRINTS results and a soft
+%   scaling advisory. The next /bench-guard (or /perf-watch) run baselines
+%   the absolute numbers. The active path (compiled MEX vs pure-MATLAB
+%   fallback) is detected and labelled — the MEX speedup is what it protects.
+%
+%   Run:
+%     octave --no-gui --eval "install(); bench_binary_search();"
+%     % or in MATLAB:
+%     bench_binary_search
+%
+%   Returns a struct (sizes, per-direction ns/call + Mcalls/s, MEX flag,
+%   scaling drift) for programmatic baselining.
+%
+%   See also binary_search, binary_search_mex, lttb_downsample,
+%   minmax_downsample, bench_downsample_kernels.
+
+    here = fileparts(mfilename('fullpath'));
+    addpath(fullfile(here, '..'));
+    install();
+
+    % binary_search has a public copy on the path, but binary_search_mex is a
+    % PRIVATE helper. We cd into the private folder for the bench: the private
+    % binary_search.m is callable from cwd and sees the MEX (which lives there
+    % too), so this exercises the real MEX path AND lets us detect it reliably
+    % via exist(). onCleanup restores cwd even on error. Portable MATLAB/Octave.
+    privDir = fullfile(here, '..', 'libs', 'FastSense', 'private');
+    origDir = pwd;
+    restoreCwd = onCleanup(@() cd(origDir)); %#ok<NASGU>
+    cd(privDir);
+
+    % ---- Configuration ----
+    sizes  = [1e4, 1e5, 1e6, 1e7, 5e7];
+    labels = {'10K', '100K', '1M', '10M', '50M'};
+
+    nQueries = 1e5;      % calls per timed run (per direction)
+    nWarm    = 2;        % warmup runs dissolve JIT / first-call MEX load
+    nReps    = 5;        % median over nReps defuses one-off spikes
+    scramble = 31337;    % prime, coprime to nQueries -> low-discrepancy order
+
+    bsMex = (exist('binary_search_mex', 'file') == 3);
+
+    fprintf('\n================================================================\n');
+    fprintf('  FastSense Binary-Search Kernel Latency Benchmark\n');
+    fprintf('  O(log N) viewport-clip / bucket-boundary search (per-call cost)\n');
+    fprintf('================================================================\n');
+    fprintf('  queries/run = %d   path: %s   warmup = %d   reps = %d (median)\n', ...
+        nQueries, pathLabel_(bsMex), nWarm, nReps);
+    fprintf('  %s\n', repmat('-', 1, 72));
+    fprintf('  %-6s | %7s | %-21s | %-21s\n', 'N', 'log2 N', 'left search', 'right search');
+    fprintf('  %-6s | %7s | %11s %9s | %11s %9s\n', ...
+        '', '(cmp)', 'ns/call', 'Mcall/s', 'ns/call', 'Mcall/s');
+    fprintf('  %s\n', repmat('-', 1, 72));
+
+    nS = numel(sizes);
+    nsL = zeros(1, nS); tputL = zeros(1, nS);
+    nsR = zeros(1, nS); tputR = zeros(1, nS);
+
+    % Deterministic scrambled query order (computed once; same length each size)
+    perm = mod((0:nQueries - 1) * scramble, nQueries) + 1;
+
+    for c = 1:nS
+        n = sizes(c);
+
+        % Sorted ascending array to search. Only X is needed.
+        x = linspace(0, n / 100, n);
+        xmin = x(1);
+        xmax = x(end);
+        span = xmax - xmin;
+
+        % Query values span the range + 5% margin on each side (clamp paths),
+        % visited in a scattered, deterministic order (no RNG).
+        valsSorted = linspace(xmin - 0.05 * span, xmax + 0.05 * span, nQueries);
+        vals = valsSorted(perm);
+
+        nsL(c)   = perCallNs_(@() runQueries_(x, vals, 'left'),  nWarm, nReps, nQueries);
+        tputL(c) = 1e3 / nsL(c);   % calls per us *1e3 -> Mcalls/s == 1e9/ns/1e6
+        nsR(c)   = perCallNs_(@() runQueries_(x, vals, 'right'), nWarm, nReps, nQueries);
+        tputR(c) = 1e3 / nsR(c);
+
+        fprintf('  %-6s | %7.1f | %11.1f %9.1f | %11.1f %9.1f\n', ...
+            labels{c}, log2(n), nsL(c), tputL(c), nsR(c), tputR(c));
+    end
+
+    fprintf('  %s\n', repmat('-', 1, 72));
+
+    % ---- Soft scaling advisory ----
+    % Pure O(log N): latency from 1M to 50M should rise by at most
+    % log2(50M)/log2(1M) ~= 1.3x (or stay flat if dispatch-bound). A >3x rise
+    % means the search is degrading worse than logarithmic. Advisory only.
+    refIdx = 3;  % 1M
+    lDrift = nsL(end) / nsL(refIdx);
+    rDrift = nsR(end) / nsR(refIdx);
+    logRatio = log2(sizes(end)) / log2(sizes(refIdx));
+
+    fprintf('  Scaling (ns/call, %s -> %s; pure O(log N) ~= %.2fx):\n', ...
+        labels{refIdx}, labels{end}, logRatio);
+    fprintf('    left   : %.1f -> %.1f  (%.2fx)  %s\n', ...
+        nsL(refIdx), nsL(end), lDrift, logDriftLabel_(lDrift));
+    fprintf('    right  : %.1f -> %.1f  (%.2fx)  %s\n', ...
+        nsR(refIdx), nsR(end), rDrift, logDriftLabel_(rDrift));
+    fprintf('  %s\n', repmat('-', 1, 72));
+    fprintf('  Note: latency bench (no time gate). /bench-guard baselines these numbers.\n\n');
+
+    result = struct( ...
+        'sizes',       sizes, ...
+        'labels',      {labels}, ...
+        'nQueries',    nQueries, ...
+        'binarySearchMex', bsMex, ...
+        'leftNsPerCall',  nsL, ...
+        'leftMcallsPerS', tputL, ...
+        'rightNsPerCall', nsR, ...
+        'rightMcallsPerS', tputR, ...
+        'leftDrift',   lDrift, ...
+        'rightDrift',  rDrift, ...
+        'logRatio',    logRatio);
+end
+
+function acc = runQueries_(x, vals, direction)
+    %RUNQUERIES_ Issue numel(vals) binary searches; accumulate idx as a sink
+    %   so the loop body cannot be optimized away.
+    acc = 0;
+    for q = 1:numel(vals)
+        acc = acc + binary_search(x, vals(q), direction);
+    end
+end
+
+function ns = perCallNs_(fn, nWarm, nReps, nCalls)
+    %PERCALLNS_ Median per-call latency (ns) of fn (which issues nCalls calls).
+    for w = 1:nWarm
+        fn();
+    end
+    ts = zeros(1, nReps);
+    for r = 1:nReps
+        tic;
+        fn();
+        ts(r) = toc;
+    end
+    ns = (median(ts) / nCalls) * 1e9;
+end
+
+function s = pathLabel_(useMex)
+    %PATHLABEL_ Human label for the active kernel implementation.
+    if useMex
+        s = 'MEX (compiled)';
+    else
+        s = 'pure-MATLAB fallback';
+    end
+end
+
+function s = logDriftLabel_(drift)
+    %LOGDRIFTLABEL_ Soft verdict on latency growth vs O(log N).
+    %   Reference: a 50x array growth costs pure O(log N) only ~1.3x. Mild
+    %   excess above that is cache-miss cost on the larger array, still
+    %   logarithmic in comparison count; a large multiple means the search
+    %   has degraded toward O(N).
+    if drift > 3.0
+        s = '<< WATCH: growth exceeds O(log N)';
+    elseif drift > 1.35
+        s = '(grows ~O(log N) + cache)';
+    else
+        s = '(flat — dispatch-bound)';
+    end
+end
diff --git a/benchmarks/bench_derived_resolve_chain.m b/benchmarks/bench_derived_resolve_chain.m
new file mode 100644
index 00000000..f280fb99
--- /dev/null
+++ b/benchmarks/bench_derived_resolve_chain.m
@@ -0,0 +1,173 @@
+function result = bench_derived_resolve_chain()
+%BENCH_DERIVED_RESOLVE_CHAIN Recompute cost of a DerivedTag dependency chain.
+%
+%   Measures the resolve fan-out cost of DerivedTag — the lazy-memoized
+%   derived/composite tag whose getXY() recomputes from its parents on
+%   demand. When a base sensor updates in live mode, the invalidation
+%   cascades down every derived tag that depends on it, and the next refresh
+%   recomputes the whole chain. That full-chain recompute is the live cost of
+%   derived tags, and no benchmark exercised it: bench_compositetag_merge.m
+%   covers the CompositeTag k-way merge specifically, bench_tag_pipeline_1k.m
+%   measures aggregate pipeline tickOnce throughput, and bench_dashboard_load.m
+%   times getXY on a populated dashboard — none isolate how resolve cost
+%   scales with dependency-chain DEPTH.
+%
+%   Topology: a base SensorTag T0 feeds a linear chain
+%       T0 (sensor) -> T1=f(T0) -> T2=f(T1) -> ... -> TD=f(T(D-1))
+%   where each f is a cheap O(N) elementwise transform. Resolving the leaf TD
+%   after invalidating the chain recomputes all D nodes top-to-bottom.
+%
+%   What it measures (deterministic, no RNG; no figures):
+%     - COLD getXY: invalidate every node, then time leaf getXY() — the full
+%       D-node recompute (the live-refresh cost).
+%     - WARM getXY: time leaf getXY() again with nothing dirty — returns the
+%       memoized cache (shows the lazy-memo payoff).
+%     - "us per node" for the cold path — flat => recompute is linear in chain
+%       depth; a rise would signal super-linear fan-out overhead.
+%
+%   Throughput bench, not a pass/fail gate: it PRINTS results and a soft
+%   scaling advisory. The next /bench-guard (or /perf-watch) run baselines the
+%   numbers.
+%
+%   Run:
+%     octave --no-gui --eval "install(); bench_derived_resolve_chain();"
+%     % or in MATLAB:
+%     bench_derived_resolve_chain
+%
+%   Returns a struct (depths, cold/warm ms, us/node, scaling drift) for baselining.
+%
+%   See also DerivedTag, CompositeTag, bench_compositetag_merge,
+%   bench_tag_pipeline_1k, bench_dashboard_load.
+
+    here = fileparts(mfilename('fullpath'));
+    addpath(fullfile(here, '..'));
+    install();
+
+    % ---- Configuration ----
+    depths = [1, 2, 4, 8, 16, 32];
+    nPts   = 1e6;       % points in the base sensor (flows through every node).
+                        % Sized so each node's O(N) recompute dominates timer
+                        % noise — at 1e5 the per-node cost was sub-0.1 ms and
+                        % the sweep was noise-bound.
+    nWarm  = 2;
+    nReps  = 10;        % cold recompute allocates a fresh array per node, so
+                        % deep chains see bursty GC; MIN over reps is the
+                        % GC-robust compute-time estimator (standard for CPU
+                        % microbenchmarks) and keeps the sweep monotonic.
+
+    baseX = linspace(0, 1000, nPts);
+    baseY = sin(baseX / 7) + 0.2 * sin(baseX * 1.3);
+
+    fprintf('\n================================================================\n');
+    fprintf('  FastSense DerivedTag Resolve-Chain Benchmark\n');
+    fprintf('  Full-chain recompute cost vs dependency depth (live resolve)\n');
+    fprintf('================================================================\n');
+    fprintf('  points = %d (flows through every node)   warmup = %d   reps = %d (min)\n', ...
+        nPts, nWarm, nReps);
+    fprintf('  %s\n', repmat('-', 1, 72));
+    fprintf('  %-6s | %-9s | %-13s | %-10s | %-13s\n', ...
+        'depth', 'nodes', 'cold ms', 'us/node', 'warm (cached) ms');
+    fprintf('  %s\n', repmat('-', 1, 72));
+
+    nD = numel(depths);
+    coldMs   = zeros(1, nD);
+    warmMs   = zeros(1, nD);
+    usPerNode = zeros(1, nD);
+
+    for c = 1:nD
+        D = depths(c);
+
+        % Build the chain: sensor T0, then D derived tags each transforming
+        % its single parent. @chainStep_ is shared by every node.
+        t0 = SensorTag('chain-src', 'X', baseX, 'Y', baseY);
+        dt = cell(1, D);
+        prev = t0;
+        for i = 1:D
+            dt{i} = DerivedTag(sprintf('chain-d%d', i), {prev}, @chainStep_);
+            prev = dt{i};
+        end
+        leaf = dt{D};
+
+        % Warmup (populate caches, dissolve JIT).
+        for w = 1:nWarm
+            invalidateChain_(dt);
+            leaf.getXY();
+        end
+
+        % COLD: invalidate every node, then time the full-chain recompute.
+        cold = zeros(1, nReps);
+        for r = 1:nReps
+            invalidateChain_(dt);
+            tic;
+            [xo, ~] = leaf.getXY(); %#ok<ASGLU>
+            cold(r) = toc;
+        end
+        coldMs(c)   = min(cold) * 1000;   % GC-robust compute-time estimate
+        usPerNode(c) = (coldMs(c) * 1000) / D;
+
+        % WARM: nothing dirty -> returns memoized cache.
+        warm = zeros(1, nReps);
+        for r = 1:nReps
+            tic;
+            leaf.getXY();
+            warm(r) = toc;
+        end
+        warmMs(c) = min(warm) * 1000;
+
+        % Correctness touch: leaf length must match the source.
+        assert(numel(xo) == nPts, 'bench_derived_resolve_chain:badOutput', ...
+            'leaf getXY returned %d points, expected %d', numel(xo), nPts);
+
+        fprintf('  %-6d | %-9d | %13.3f | %10.1f | %13.4f\n', ...
+            D, D, coldMs(c), usPerNode(c), warmMs(c));
+    end
+
+    fprintf('  %s\n', repmat('-', 1, 72));
+
+    % ---- Soft scaling advisory ----
+    % Each node does O(N) work, so the cold recompute should be ~linear in
+    % depth => us/node flat. Compare us/node at the deepest chain against the
+    % shallowest; a large rise means super-linear fan-out overhead.
+    perNodeDrift = usPerNode(end) / usPerNode(1);
+
+    fprintf('  Scaling (depth %d -> %d):\n', depths(1), depths(end));
+    fprintf('    cold us/node : %.1f -> %.1f  (%.2fx)  %s\n', ...
+        usPerNode(1), usPerNode(end), perNodeDrift, perNodeLabel_(perNodeDrift));
+    fprintf('    warm cache hit stays ~flat: %.4f -> %.4f ms\n', warmMs(1), warmMs(end));
+    fprintf('  %s\n', repmat('-', 1, 72));
+    fprintf('  Note: throughput bench (no time gate). /bench-guard baselines these numbers.\n\n');
+
+    result = struct( ...
+        'depths',       depths, ...
+        'nPts',         nPts, ...
+        'coldMs',       coldMs, ...
+        'warmMs',       warmMs, ...
+        'usPerNode',    usPerNode, ...
+        'perNodeDrift', perNodeDrift);
+end
+
+function [x, y] = chainStep_(parents)
+    %CHAINSTEP_ One derived node: pull the single parent's (X,Y) and apply a
+    %   cheap O(N) elementwise transform. Deterministic, no RNG.
+    [x, y] = parents{1}.getXY();
+    y = y * 0.999 + 0.001;
+end
+
+function invalidateChain_(dt)
+    %INVALIDATECHAIN_ Mark every node dirty so the next leaf getXY recomputes
+    %   the full chain (not just the leaf).
+    for i = 1:numel(dt)
+        dt{i}.invalidate();
+    end
+end
+
+function s = perNodeLabel_(drift)
+    %PERNODELABEL_ Soft verdict on per-node recompute growth.
+    if drift > 2.0
+        s = '<< WATCH: super-linear fan-out cost';
+    elseif drift > 1.5
+        s = '(mild per-node rise)';
+    else
+        s = '(linear in depth — O(depth * N))';
+    end
+end
diff --git a/benchmarks/bench_detached_mirror_refresh.m b/benchmarks/bench_detached_mirror_refresh.m
new file mode 100644
index 00000000..79e8e2e1
--- /dev/null
+++ b/benchmarks/bench_detached_mirror_refresh.m
@@ -0,0 +1,217 @@
+function result = bench_detached_mirror_refresh()
+%BENCH_DETACHED_MIRROR_REFRESH Refresh-rate cost of detached live mirrors.
+%
+%   Directly exercises the project's headline performance constraint:
+%   "detached live-mirrored widgets must not degrade dashboard refresh rate."
+%   No committed benchmark covered it — bench_dashboard_live.m times
+%   onLiveTick() for a fixed all-in-grid dashboard, never with a detached
+%   mirror attached.
+%
+%   DashboardEngine.detachWidget() pops a widget into a standalone figure as
+%   a DetachedMirror, and onLiveTick() ticks every mirror in-line on the same
+%   refresh path (DashboardEngine.onLiveTick -> the DetachedMirrors loop ->
+%   DetachedMirror.tick). So a mirror's per-tick cost is paid by the live
+%   dashboard tick itself — this bench measures and baselines that cost.
+%
+%   Experiment design (isolates the mirror variable):
+%     - Fixed total widget count N. Each scenario detaches K of them
+%       (K = 0,1,2,4) and re-measures active refresh latency.
+%     - A detached widget leaves the grid but its mirror still ticks, so
+%       TOTAL widgets serviced per tick is constant; only how many are
+%       mirrored changes. Rising refresh time => mirrors add cost.
+%     - Per-tag data size is held CONSTANT every tick (fresh Y on a fixed X,
+%       no array growth) so data volume is not a confound.
+%
+%   Measurement method (learned from the data, not assumed):
+%     - The path needs a LONG warmup: per-tick cost decays over ~15-20 ticks
+%       (JIT + render-data caches). A large GLOBAL warmup precedes all
+%       measurement so scenario ordering cannot bias the result.
+%     - onLiveTick() time is BIMODAL because drawnow('limitrate') throttles
+%       figure flushes — some ticks flush (expensive), some are coalesced
+%       (cheap). A per-tick median is unstable on bimodal data, so this bench
+%       reports the AMORTIZED average over a fixed tick batch (total / nTicks)
+%       — which is exactly the effective refresh rate and is stable. (Same
+%       amortization bench_dashboard_live.m uses.)
+%
+%   What it measures (deterministic, no RNG; headless):
+%     - amortized active refresh time (ms/tick) per mirror count.
+%     - effective refresh rate (Hz) and overhead vs the 0-mirror baseline.
+%
+%   Throughput bench, not a pass/fail gate: it PRINTS results and a soft
+%   advisory. The next /bench-guard (or /perf-watch) run baselines the
+%   numbers and — most importantly — flags if the per-mirror overhead GROWS
+%   over time. Complements the /refresh-budget watchdog with a committed,
+%   baseline-able file.
+%
+%   Run (MATLAB only — see note):
+%     bench_detached_mirror_refresh
+%
+%   Note: the DetachedMirror path is currently MATLAB-only — under Octave,
+%   DetachedMirror.cloneWidget calls feval('FastSenseWidget.fromStruct', ...)
+%   and Octave does not resolve that dotted-static-method feval form
+%   ("function 'FastSenseWidget.fromStruct' not found"). This bench detects
+%   Octave and skips cleanly so CI/watchdog sweeps do not crash. (The Octave
+%   gap is a library-side compat issue worth flagging to maintainers.)
+%
+%   Returns a struct (mirror counts, amortized ms, Hz, overhead %) for baselining.
+%
+%   See also DashboardEngine.detachWidget, DashboardEngine.onLiveTick,
+%   DetachedMirror, bench_dashboard_live, bench_fastsense_multiline.
+
+    here = fileparts(mfilename('fullpath'));
+    addpath(fullfile(here, '..'));
+    install();
+
+    % DetachedMirror.cloneWidget uses feval('FastSenseWidget.fromStruct', ...),
+    % which Octave does not resolve. Skip cleanly rather than crash a sweep.
+    if exist('OCTAVE_VERSION', 'builtin') ~= 0
+        fprintf('\n[bench_detached_mirror_refresh] SKIPPED on Octave: the DetachedMirror\n');
+        fprintf('  path requires MATLAB (feval ''FastSenseWidget.fromStruct'' unsupported).\n\n');
+        result = struct('skipped', true, 'reason', 'octave-detach-unsupported');
+        return;
+    end
+
+    % ---- Configuration ----
+    N_WIDGETS    = 8;
+    N_PTS        = 2e4;          % points per tag (held constant every tick)
+    N_GLOBAL_WARM = 20;          % global warmup — path settles over ~15-20 ticks
+    N_WARM       = 12;           % per-scenario settle (new mirror figures need
+                                 % several flushes before their cost stabilizes)
+    N_TICKS      = 30;           % amortized batch per scenario (total / N_TICKS)
+    detachCounts = [0, 1, 2, 4];
+
+    baseX = linspace(0, 1000, N_PTS);
+
+    % ---- Build N Tag-bound FastSense widgets (deterministic data) ----
+    tags = cell(1, N_WIDGETS);
+    for i = 1:N_WIDGETS
+        yi = sin(baseX / 7 + i) + 0.2 * sin(baseX * 1.3 + i);
+        tags{i} = SensorTag(sprintf('mir-tag-%d', i), 'X', baseX, 'Y', yi);
+    end
+
+    d = DashboardEngine('BenchMirror');
+    for i = 1:N_WIDGETS
+        col = mod(i - 1, 2) * 12 + 1;
+        row = ceil(i / 2);
+        d.addWidget('fastsense', ...
+            'Title', sprintf('Tag %d', i), ...
+            'Position', [col, row, 12, 2], ...
+            'Tag', tags{i});
+    end
+
+    % Render headless; mute warnings only around render (e.g. legend caps).
+    wsR = warning('off', 'all');
+    d.render();
+    warning(wsR);
+
+    widgets = d.activePageWidgets();   % capture handles in order (pre-detach)
+
+    % ---- Global warmup so scenario ordering cannot bias the baseline ----
+    phase = 0;
+    for w = 1:N_GLOBAL_WARM
+        phase = phase + 0.01;
+        doTick_(d, tags, baseX, phase);
+    end
+
+    fprintf('\n================================================================\n');
+    fprintf('  FastSense Detached-Mirror Refresh Benchmark\n');
+    fprintf('  Constraint: detaching a live mirror must NOT degrade refresh\n');
+    fprintf('================================================================\n');
+    fprintf('  widgets = %d   points/tag = %d (constant/tick)   amortized over %d ticks\n', ...
+        N_WIDGETS, N_PTS, N_TICKS);
+    fprintf('  global warmup = %d ticks   (onLiveTick is bimodal; amortized avg is the stable metric)\n', ...
+        N_GLOBAL_WARM);
+    fprintf('  %s\n', repmat('-', 1, 72));
+    fprintf('  %-8s | %-7s | %-7s | %-13s | %-9s | %-10s\n', ...
+        'mirrors', 'in-grid', 'total', 'refresh ms', 'refresh', 'vs base');
+    fprintf('  %s\n', repmat('-', 1, 72));
+
+    nC = numel(detachCounts);
+    tickMs    = zeros(1, nC);
+    refreshHz = zeros(1, nC);
+    overhead  = zeros(1, nC);
+    detachedSoFar = 0;
+
+    for c = 1:nC
+        target = detachCounts(c);
+
+        % Detach progressively up to the target count.
+        while detachedSoFar < target
+            detachedSoFar = detachedSoFar + 1;
+            wsD = warning('off', 'all');
+            d.detachWidget(widgets{detachedSoFar});
+            warning(wsD);
+        end
+
+        % Per-scenario settle (mirror figures need a first flush after detach).
+        for w = 1:N_WARM
+            phase = phase + 0.01;
+            doTick_(d, tags, baseX, phase);
+        end
+
+        % Amortized active refresh: total wall time over N_TICKS / N_TICKS.
+        % Amortization absorbs the bimodal drawnow-limitrate flush pattern.
+        tBatch = tic;
+        for k = 1:N_TICKS
+            phase = phase + 0.01;
+            doTick_(d, tags, baseX, phase);
+        end
+        tickMs(c)    = toc(tBatch) * 1000 / N_TICKS;
+        refreshHz(c) = 1000 / tickMs(c);
+        overhead(c)  = (tickMs(c) / tickMs(1) - 1) * 100;
+
+        fprintf('  %-8d | %-7d | %-7d | %13.3f | %6.0fHz | %+9.1f%%\n', ...
+            target, N_WIDGETS - target, N_WIDGETS, tickMs(c), refreshHz(c), overhead(c));
+    end
+
+    fprintf('  %s\n', repmat('-', 1, 72));
+
+    maxOver = max(overhead);
+    fprintf('  Baseline (0 mirrors): %.3f ms (%.0f Hz)\n', tickMs(1), refreshHz(1));
+    fprintf('  Refresh overhead from mirrors (max): %+.1f%%  %s\n', ...
+        maxOver, constraintLabel_(maxOver));
+    fprintf('  %s\n', repmat('-', 1, 72));
+    fprintf('  Note: composite bench (no time gate). /bench-guard baselines these + watches for growth.\n\n');
+
+    % ---- Cleanup: close mirror figures + dashboard ----
+    try
+        for i = 1:numel(d.DetachedMirrors)
+            try, delete(d.DetachedMirrors{i}); catch, end
+        end
+        close(d.hFigure);
+    catch
+    end
+
+    result = struct( ...
+        'detachCounts', detachCounts, ...
+        'nWidgets',     N_WIDGETS, ...
+        'nPts',         N_PTS, ...
+        'tickMs',       tickMs, ...
+        'refreshHz',    refreshHz, ...
+        'overheadPct',  overhead, ...
+        'maxOverheadPct', maxOver);
+end
+
+function doTick_(d, tags, baseX, phase)
+    %DOTICK_ Replace every tag's data (fixed size, fresh Y) then run one
+    %   onLiveTick(). Constant per-tag size isolates mirror overhead from data
+    %   volume. The whole call is timed in batch by the caller (amortized).
+    for i = 1:numel(tags)
+        newY = sin(baseX / 7 + i + phase) + 0.2 * sin(baseX * 1.3 + i + phase);
+        tags{i}.updateData(baseX, newY);
+    end
+    d.onLiveTick();
+end
+
+function s = constraintLabel_(maxOverPct)
+    %CONSTRAINTLABEL_ Soft verdict on the mirror refresh overhead.
+    %   Mirrors necessarily add draw work to the shared tick, so some overhead
+    %   is expected; the point is to baseline it and catch regressions.
+    if maxOverPct > 150
+        s = '<< notable: mirrors add heavy refresh cost — baseline + watch';
+    elseif maxOverPct > 50
+        s = '(mirrors add meaningful refresh cost — expected; watch for growth)';
+    else
+        s = '(refresh largely preserved)';
+    end
+end
diff --git a/benchmarks/bench_downsample_kernels.m b/benchmarks/bench_downsample_kernels.m
new file mode 100644
index 00000000..23a2561a
--- /dev/null
+++ b/benchmarks/bench_downsample_kernels.m
@@ -0,0 +1,195 @@
+function result = bench_downsample_kernels()
+%BENCH_DOWNSAMPLE_KERNELS Throughput microbenchmark for the core downsamplers.
+%
+%   Isolates and times the two kernels at the heart of FastSense's
+%   resolve -> downsample -> render hot path:
+%
+%     lttb_downsample    (LTTB, MEX: lttb_core_mex)   — shape-preserving
+%     minmax_downsample  (MinMax, MEX: minmax_core_mex) — envelope-preserving
+%
+%   Both run on EVERY render and EVERY zoom/pan re-downsample, scanning the
+%   full N-point input (millions of samples) down to a ~2000-point display
+%   budget. They touch more data per frame than any other operation on the
+%   live path, so their per-point throughput is the dominant cost of
+%   plotting large series. Yet until now no benchmark exercised them
+%   directly: benchmark.m called minmax_downsample exactly once at a single
+%   size (as a side-measurement of the FastSense-vs-plot() comparison) and
+%   LTTB was never timed at all. benchmark_zoom.m measures full per-frame
+%   latency, but that number is dominated by drawnow/getframe GPU flush and
+%   hides the kernel's contribution. This bench fills that gap.
+%
+%   What it measures (deterministic, no RNG):
+%     - Per-call wall time (median of nReps after warmup) for each kernel
+%       across a size sweep (10K -> 20M points).
+%     - Per-point throughput in millions of points/second (Mpts/s).
+%     - "ms per million points" — the O(N) cost coefficient. For a correct
+%       linear-time kernel this stays roughly FLAT as N grows; a creeping
+%       value is the early signature of an accidental super-linear (e.g.
+%       O(N log N) sort, repeated allocation) regression.
+%
+%   It is a throughput bench, not a pass/fail gate: it PRINTS results and a
+%   soft scaling advisory rather than asserting a machine-specific time
+%   budget. The next /bench-guard (or /perf-watch) run baselines the
+%   absolute numbers and flags regressions against that local baseline.
+%
+%   Each kernel's active path (compiled MEX vs pure-MATLAB fallback) is
+%   detected and labelled, since the MEX speedup is exactly what this bench
+%   protects.
+%
+%   Run:
+%     octave --no-gui --eval "install(); bench_downsample_kernels();"
+%     % or in MATLAB:
+%     bench_downsample_kernels
+%
+%   Returns a struct (sizes, per-kernel ms / throughput / output counts,
+%   MEX flags, scaling coefficients) for programmatic baselining.
+%
+%   See also lttb_downsample, minmax_downsample, benchmark_zoom, benchmark,
+%   bench_sensortag_getxy.
+
+    here = fileparts(mfilename('fullpath'));
+    addpath(fullfile(here, '..'));
+    install();
+
+    % lttb_downsample / minmax_downsample are PRIVATE helpers of FastSense.
+    % MATLAB refuses private directories on the path, so instead we cd into
+    % the private folder for the duration of the bench: current-folder
+    % functions are always callable, and the compiled MEX kernels
+    % (lttb_core_mex / minmax_core_mex) live there too, so this exercises the
+    % real MEX path. onCleanup restores the original cwd even on error. This
+    % is portable across MATLAB and Octave.
+    privDir = fullfile(here, '..', 'libs', 'FastSense', 'private');
+    origDir = pwd;
+    restoreCwd = onCleanup(@() cd(origDir)); %#ok<NASGU>
+    cd(privDir);
+
+    % ---- Configuration ----
+    sizes  = [1e4, 1e5, 1e6, 5e6, 2e7];
+    labels = {'10K', '100K', '1M', '5M', '20M'};
+
+    numOut     = 2000;   % LTTB output budget (representative display target)
+    numBuckets = 1000;   % MinMax buckets -> ~2000 output points (comparable)
+
+    nWarm = 2;           % warmup calls dissolve JIT / first-call MEX load
+    nReps = 5;           % median over nReps defuses one-off spikes
+
+    lttbMex   = (exist('lttb_core_mex',   'file') == 3);
+    minmaxMex = (exist('minmax_core_mex', 'file') == 3);
+
+    fprintf('\n================================================================\n');
+    fprintf('  FastSense Downsample Kernel Throughput Benchmark\n');
+    fprintf('  Core resolve->downsample->render hot path (per-frame cost)\n');
+    fprintf('================================================================\n');
+    fprintf('  LTTB    target numOut     = %d   path: %s\n', numOut, pathLabel_(lttbMex));
+    fprintf('  MinMax  target numBuckets = %d   path: %s\n', numBuckets, pathLabel_(minmaxMex));
+    fprintf('  warmup = %d   reps = %d (median)   signal: deterministic 3-tone\n', nWarm, nReps);
+    fprintf('  %s\n', repmat('-', 1, 76));
+    fprintf('  %-6s | %-28s | %-28s\n', 'N', 'LTTB', 'MinMax');
+    fprintf('  %-6s | %10s %9s %6s | %10s %9s %6s\n', ...
+        '', 'ms/call', 'Mpts/s', 'out', 'ms/call', 'Mpts/s', 'out');
+    fprintf('  %s\n', repmat('-', 1, 76));
+
+    nS = numel(sizes);
+    lttbMs   = zeros(1, nS); lttbTput = zeros(1, nS); lttbN = zeros(1, nS);
+    mmMs     = zeros(1, nS); mmTput   = zeros(1, nS); mmN   = zeros(1, nS);
+
+    for c = 1:nS
+        n = sizes(c);
+
+        % Deterministic ascending X (~100 Hz) and a 3-tone signal whose
+        % high-frequency component keeps per-bucket min/max non-degenerate.
+        % No RNG: identical workload on MATLAB and Octave, every run.
+        x = linspace(0, n / 100, n);
+        y = sin(x * 0.1) + 0.3 * sin(x * 1.7) + 0.2 * sin(x * 13.0);
+
+        % --- LTTB (linear mode -> MEX when available) ---
+        [lx, ~] = lttb_downsample(x, y, numOut);   %#ok<ASGLU> correctness touch
+        tL = medianTime_(@() lttb_downsample(x, y, numOut), nWarm, nReps);
+        lttbMs(c)   = tL * 1000;
+        lttbTput(c) = (n / tL) / 1e6;
+        lttbN(c)    = numel(lx);
+
+        % --- MinMax (NaN-free fast path -> MEX when available) ---
+        [mx, ~] = minmax_downsample(x, y, numBuckets, false);  %#ok<ASGLU>
+        tM = medianTime_(@() minmax_downsample(x, y, numBuckets, false), nWarm, nReps);
+        mmMs(c)   = tM * 1000;
+        mmTput(c) = (n / tM) / 1e6;
+        mmN(c)    = numel(mx);
+
+        fprintf('  %-6s | %10.3f %9.1f %6d | %10.3f %9.1f %6d\n', ...
+            labels{c}, lttbMs(c), lttbTput(c), lttbN(c), ...
+            mmMs(c), mmTput(c), mmN(c));
+    end
+
+    fprintf('  %s\n', repmat('-', 1, 76));
+
+    % ---- Soft scaling advisory ----
+    % "ms per million points" is the O(N) coefficient. Compare the largest
+    % size against the 1M reference (index 3); flat => linear. A >2x rise is
+    % the early signature of super-linear creep — advisory only, not a gate.
+    refIdx = 3;  % 1M — past fixed-overhead, below memory-pressure regime
+    lttbCoef = lttbMs ./ (sizes / 1e6);
+    mmCoef   = mmMs   ./ (sizes / 1e6);
+    lttbDrift = lttbCoef(end) / lttbCoef(refIdx);
+    mmDrift   = mmCoef(end)   / mmCoef(refIdx);
+
+    fprintf('  Scaling (ms per 1M pts, %s -> %s):\n', labels{refIdx}, labels{end});
+    fprintf('    LTTB   : %.3f -> %.3f  (%.2fx)  %s\n', ...
+        lttbCoef(refIdx), lttbCoef(end), lttbDrift, driftLabel_(lttbDrift));
+    fprintf('    MinMax : %.3f -> %.3f  (%.2fx)  %s\n', ...
+        mmCoef(refIdx), mmCoef(end), mmDrift, driftLabel_(mmDrift));
+    fprintf('  %s\n', repmat('-', 1, 76));
+    fprintf('  Note: throughput bench (no time gate). /bench-guard baselines these numbers.\n\n');
+
+    result = struct( ...
+        'sizes',         sizes, ...
+        'labels',        {labels}, ...
+        'numOut',        numOut, ...
+        'numBuckets',    numBuckets, ...
+        'lttbMex',       lttbMex, ...
+        'minmaxMex',     minmaxMex, ...
+        'lttbMs',        lttbMs, ...
+        'lttbTputMpts',  lttbTput, ...
+        'lttbOutN',      lttbN, ...
+        'minmaxMs',      mmMs, ...
+        'minmaxTputMpts', mmTput, ...
+        'minmaxOutN',    mmN, ...
+        'lttbCoefMsPerM', lttbCoef, ...
+        'minmaxCoefMsPerM', mmCoef, ...
+        'lttbDrift',     lttbDrift, ...
+        'minmaxDrift',   mmDrift);
+end
+
+function t = medianTime_(fn, nWarm, nReps)
+    %MEDIANTIME_ Median wall time of fn() over nReps after nWarm warmups.
+    for w = 1:nWarm
+        fn();
+    end
+    ts = zeros(1, nReps);
+    for r = 1:nReps
+        tic;
+        fn();
+        ts(r) = toc;
+    end
+    t = median(ts);
+end
+
+function s = pathLabel_(useMex)
+    %PATHLABEL_ Human label for the active kernel implementation.
+    if useMex
+        s = 'MEX (compiled)';
+    else
+        s = 'pure-MATLAB fallback';
+    end
+end
+
+function s = driftLabel_(drift)
+    %DRIFTLABEL_ Soft verdict on the O(N) coefficient drift.
+    if drift > 2.0
+        s = '<< WATCH: possible super-linear creep';
+    elseif drift > 1.5
+        s = '(mild rise — memory pressure expected at large N)';
+    else
+        s = '(linear — O(N) holds)';
+    end
+end
diff --git a/benchmarks/bench_fastsense_multiline.m b/benchmarks/bench_fastsense_multiline.m
new file mode 100644
index 00000000..8eae28ce
--- /dev/null
+++ b/benchmarks/bench_fastsense_multiline.m
@@ -0,0 +1,181 @@
+function result = bench_fastsense_multiline()
+%BENCH_FASTSENSE_MULTILINE Live-update scaling vs line count on one axes.
+%
+%   Measures how the live refresh path scales as a SINGLE FastSense axes
+%   accumulates lines — the multi-sensor overlay case. The focus is
+%   updateData(), the live hot path: per its own contract it re-downsamples
+%   *all* lines on every call, so a one-line live update costs O(lineCount).
+%   That per-call cost is what the project's refresh-rate constraint rides
+%   on, so the headline metric is the achievable refresh rate (Hz) as line
+%   count grows.
+%
+%   Why this is the gap: the kernel microbenches (bench_downsample_kernels,
+%   bench_violation_cull, bench_binary_search) now cover every MEX kernel
+%   actually wired into production. What was NOT covered is the COMPOSITE
+%   scaling — how the whole update path grows with line count on one axes.
+%   benchmark.m and benchmark_zoom.m use a single line; the dashboard benches
+%   vary widget count, not lines-per-axes. This bench isolates that axis.
+%
+%   What it measures (deterministic, no RNG; headless invisible figure):
+%     - updateData() wall time (median, SkipViewMode to isolate the
+%       re-downsample cost from view-mode / xlim adjustment).
+%     - "us per line" — if it stays flat the live cost is linear in line
+%       count; if it FALLS, a fixed per-call overhead is amortizing (good);
+%       a sharp RISE would signal super-linear per-line overhead.
+%     - effective refresh rate (Hz = 1000 / updateData ms).
+%     - one-time render() setup cost (single sample, figure-realization
+%       dominated — reported for context, not a clean scaling signal).
+%
+%   The setup render is run with ShowProgress disabled so the console
+%   progress bar does not pollute the timing or the output.
+%
+%   Throughput bench, not a pass/fail gate: it PRINTS results and a soft
+%   scaling advisory. The next /bench-guard (or /perf-watch) run baselines
+%   the absolute numbers.
+%
+%   Run:
+%     octave --no-gui --eval "install(); bench_fastsense_multiline();"
+%     % or in MATLAB:
+%     bench_fastsense_multiline
+%
+%   Returns a struct (line counts, update ms, us/line, Hz, setup ms,
+%   scaling drift) for programmatic baselining.
+%
+%   See also FastSense, FastSense.updateData, FastSense.render,
+%   bench_downsample_kernels, bench_dashboard_live.
+
+    here = fileparts(mfilename('fullpath'));
+    addpath(fullfile(here, '..'));
+    install();
+
+    % ---- Configuration ----
+    lineCounts = [1, 4, 16, 64];
+    nPerLine   = 1e5;       % points per line (fixed — isolates line-count axis)
+    nWarm      = 2;         % updateData warmups
+    nUpdateRep = 7;         % updateData measurements (median)
+
+    fprintf('\n================================================================\n');
+    fprintf('  FastSense Multi-Line Live-Update Scaling Benchmark\n');
+    fprintf('  updateData() re-downsamples all lines -> refresh cost vs lines\n');
+    fprintf('================================================================\n');
+    fprintf('  points/line = %d   update reps = %d (median)   SkipViewMode = true\n', ...
+        nPerLine, nUpdateRep);
+    fprintf('  %s\n', repmat('-', 1, 76));
+    fprintf('  %-6s | %-9s | %-11s | %-13s | %-9s | %-8s\n', ...
+        'lines', 'total pts', 'setup ms', 'updateData ms', 'us/line', 'refresh');
+    fprintf('  %s\n', repmat('-', 1, 76));
+
+    nL = numel(lineCounts);
+    setupMs   = zeros(1, nL);
+    updateMs  = zeros(1, nL);
+    usPerLine = zeros(1, nL);
+    refreshHz = zeros(1, nL);
+
+    for c = 1:nL
+        L = lineCounts(c);
+
+        % Deterministic per-line data: distinct phase per line, no RNG.
+        x = linspace(0, nPerLine / 100, nPerLine);
+        Y = zeros(L, nPerLine);
+        for k = 1:L
+            Y(k, :) = sin(x * 0.1 + k) + 0.3 * sin(x * 1.7 + k) + 0.2 * sin(x * 13.0 + k);
+        end
+
+        % Build headless instance and render once (silently) for setup.
+        % Warnings muted only around the one-time setup render (e.g. MATLAB
+        % caps the auto-legend at 50 entries for high line counts) — restored
+        % immediately so the measured update path is unaffected.
+        [fp, fig] = buildFS_(x, Y);
+        ws = warning('off', 'all');
+        tic;
+        fp.render();
+        setupMs(c) = toc * 1000;
+        warning(ws);
+
+        % updateData timing: update line 1 with fresh data each call; this
+        % re-downsamples ALL lines. SkipViewMode isolates the re-downsample
+        % cost from xlim / view-mode logic.
+        for w = 1:nWarm
+            fp.updateData(1, x, Y(1, :) + 0.01 * w, 'SkipViewMode', true);
+        end
+        uTimes = zeros(1, nUpdateRep);
+        for u = 1:nUpdateRep
+            ynew = Y(1, :) + 0.001 * u;   % force a genuine data replace each rep
+            tic;
+            fp.updateData(1, x, ynew, 'SkipViewMode', true);
+            uTimes(u) = toc;
+        end
+        updateMs(c)  = median(uTimes) * 1000;
+        usPerLine(c) = (updateMs(c) * 1000) / L;
+        refreshHz(c) = 1000 / updateMs(c);
+
+        close(fig);
+
+        fprintf('  %-6d | %9s | %11.1f | %13.3f | %9.1f | %6.0fHz\n', ...
+            L, humanCount_(L * nPerLine), setupMs(c), updateMs(c), usPerLine(c), refreshHz(c));
+    end
+
+    fprintf('  %s\n', repmat('-', 1, 76));
+
+    % ---- Soft scaling advisory ----
+    % updateData re-downsamples every line, so its TOTAL cost grows with line
+    % count, but a fixed per-call overhead (drawnow, dispatch) amortizes — so
+    % us/line should be flat or FALLING. A sharp rise means super-linear
+    % per-line overhead crept in. Advisory only, not a gate.
+    perLineDrift = usPerLine(end) / usPerLine(1);
+
+    fprintf('  Scaling (%d -> %d lines):\n', lineCounts(1), lineCounts(end));
+    fprintf('    updateData us/line : %.1f -> %.1f  (%.2fx)  %s\n', ...
+        usPerLine(1), usPerLine(end), perLineDrift, perLineLabel_(perLineDrift));
+    fprintf('    refresh rate       : %.0f Hz (1 line) -> %.0f Hz (%d lines)\n', ...
+        refreshHz(1), refreshHz(end), lineCounts(end));
+    fprintf('  %s\n', repmat('-', 1, 76));
+    fprintf('  Note: composite bench (no time gate). /bench-guard baselines these numbers.\n\n');
+
+    result = struct( ...
+        'lineCounts',   lineCounts, ...
+        'nPerLine',     nPerLine, ...
+        'setupMs',      setupMs, ...
+        'updateMs',     updateMs, ...
+        'usPerLine',    usPerLine, ...
+        'refreshHz',    refreshHz, ...
+        'perLineDrift', perLineDrift);
+end
+
+function [fp, fig] = buildFS_(x, Y)
+    %BUILDFS_ Headless FastSense with size(Y,1) lines on an invisible figure.
+    %   ShowProgress disabled so the console progress bar does not pollute
+    %   timing or output.
+    fig = figure('Visible', 'off', 'Position', [100 100 800 400]);
+    ax = axes('Parent', fig);
+    fp = FastSense('Parent', ax);
+    fp.ShowProgress = false;
+    L = size(Y, 1);
+    for k = 1:L
+        fp.addLine(x, Y(k, :), 'DisplayName', sprintf('line %d', k));
+    end
+end
+
+function s = humanCount_(n)
+    %HUMANCOUNT_ Compact count label (e.g. 6.4M).
+    if n >= 1e6
+        s = sprintf('%.1fM', n / 1e6);
+    elseif n >= 1e3
+        s = sprintf('%.0fK', n / 1e3);
+    else
+        s = sprintf('%d', n);
+    end
+end
+
+function s = perLineLabel_(drift)
+    %PERLINELABEL_ Soft verdict on per-line update cost growth.
+    if drift > 2.0
+        s = '<< WATCH: super-linear per-line update cost';
+    elseif drift > 1.5
+        s = '(mild per-line rise)';
+    elseif drift >= 0.8
+        s = '(~linear in line count)';
+    else
+        s = '(sublinear — fixed overhead amortizes)';
+    end
+end
diff --git a/benchmarks/bench_violation_cull.m b/benchmarks/bench_violation_cull.m
new file mode 100644
index 00000000..cad2abc9
--- /dev/null
+++ b/benchmarks/bench_violation_cull.m
@@ -0,0 +1,208 @@
+function result = bench_violation_cull()
+%BENCH_VIOLATION_CULL Throughput microbenchmark for the threshold-marker kernel.
+%
+%   Isolates and times violation_cull — the fused threshold-violation
+%   detection + pixel-density culling kernel (MEX: violation_cull_mex). It
+%   is the threshold-marker counterpart of the LTTB/MinMax downsamplers
+%   covered by bench_downsample_kernels: whenever a threshold is attached to
+%   a FastSense line, this kernel runs on EVERY render and EVERY zoom/pan,
+%   scanning the full visible data window (O(N)) to find violating points
+%   and immediately cull them to at most one marker per pixel column.
+%
+%   It fuses what used to be two passes — compute_violations (or
+%   compute_violations_dynamic) plus downsample_violations — into a single
+%   compiled call, so it is exactly the per-frame cost that thresholded
+%   dashboards pay on top of the data downsample. Until now it had no direct
+%   benchmark: bench_tag_pipeline_1k.m only lists the kernel name in a
+%   profiler top-N filter, never times it, and no bench_*.m exercises the
+%   detect-and-cull path in isolation.
+%
+%   Two threshold regimes are timed at each size, covering BOTH internal
+%   branches of the kernel:
+%     constant — scalar threshold knot (compute_violations branch)
+%     step     — multi-knot piecewise-constant / ZOH threshold
+%                (compute_violations_dynamic branch, the time-varying path
+%                 that was wholly unbenched)
+%
+%   What it measures (deterministic, no RNG):
+%     - Per-call wall time (median of nReps after warmup) for each regime
+%       across a size sweep (10K -> 20M points).
+%     - Per-point throughput in millions of points/second (Mpts/s).
+%     - "ms per million points" — the O(N) cost coefficient; flat => linear.
+%     - Output marker count (bounded by the ~1000 pixel columns).
+%
+%   Throughput bench, not a pass/fail gate: it PRINTS results and a soft
+%   scaling advisory rather than asserting a machine-specific time budget.
+%   The next /bench-guard (or /perf-watch) run baselines the absolute
+%   numbers and flags regressions against that local baseline. The active
+%   path (compiled MEX vs pure-MATLAB fallback) is detected and labelled,
+%   since the MEX speedup is exactly what this bench protects.
+%
+%   Run:
+%     octave --no-gui --eval "install(); bench_violation_cull();"
+%     % or in MATLAB:
+%     bench_violation_cull
+%
+%   Returns a struct (sizes, per-regime ms / throughput / marker counts,
+%   MEX flag, scaling coefficients) for programmatic baselining.
+%
+%   See also violation_cull, compute_violations, compute_violations_dynamic,
+%   downsample_violations, bench_downsample_kernels.
+
+    here = fileparts(mfilename('fullpath'));
+    addpath(fullfile(here, '..'));
+    install();
+
+    % violation_cull is a PRIVATE helper of FastSense. MATLAB refuses private
+    % directories on the path, so we cd into the private folder for the
+    % duration of the bench: current-folder functions are always callable,
+    % and the compiled MEX (violation_cull_mex) lives there too, so this
+    % exercises the real MEX path. onCleanup restores cwd even on error.
+    % Portable across MATLAB and Octave.
+    privDir = fullfile(here, '..', 'libs', 'FastSense', 'private');
+    origDir = pwd;
+    restoreCwd = onCleanup(@() cd(origDir)); %#ok<NASGU>
+    cd(privDir);
+
+    % ---- Configuration ----
+    sizes  = [1e4, 1e5, 1e6, 5e6, 2e7];
+    labels = {'10K', '100K', '1M', '5M', '20M'};
+
+    nPixels = 1000;      % display width in pixels -> sets cull bucket count
+    nKnots  = 8;         % step-threshold knots (time-varying regime)
+    thLevel = 1.0;       % nominal upper threshold (signal peaks ~1.5)
+    nWarm   = 2;         % warmup calls dissolve JIT / first-call MEX load
+    nReps   = 5;         % median over nReps defuses one-off spikes
+
+    cullMex = (exist('violation_cull_mex', 'file') == 3);
+
+    fprintf('\n================================================================\n');
+    fprintf('  FastSense Violation-Cull Kernel Throughput Benchmark\n');
+    fprintf('  Fused threshold detect + pixel cull (per-frame marker cost)\n');
+    fprintf('================================================================\n');
+    fprintf('  pixels = %d   step knots = %d   threshold = %.2f (upper)\n', ...
+        nPixels, nKnots, thLevel);
+    fprintf('  path: %s   warmup = %d   reps = %d (median)\n', ...
+        pathLabel_(cullMex), nWarm, nReps);
+    fprintf('  %s\n', repmat('-', 1, 76));
+    fprintf('  %-6s | %-28s | %-28s\n', 'N', 'constant threshold', 'step threshold (ZOH)');
+    fprintf('  %-6s | %10s %9s %6s | %10s %9s %6s\n', ...
+        '', 'ms/call', 'Mpts/s', 'mark', 'ms/call', 'Mpts/s', 'mark');
+    fprintf('  %s\n', repmat('-', 1, 76));
+
+    nS = numel(sizes);
+    cMs = zeros(1, nS); cTput = zeros(1, nS); cMark = zeros(1, nS);
+    sMs = zeros(1, nS); sTput = zeros(1, nS); sMark = zeros(1, nS);
+
+    for c = 1:nS
+        n = sizes(c);
+
+        % Deterministic ascending X and 3-tone signal (peaks ~1.5), so a
+        % realistic minority of points exceed thLevel = 1.0. No RNG: identical
+        % workload on MATLAB and Octave, every run.
+        x = linspace(0, n / 100, n);
+        y = sin(x * 0.1) + 0.3 * sin(x * 1.7) + 0.2 * sin(x * 13.0);
+        xmin = x(1);
+        xmax = x(end);
+        pixelWidth = (xmax - xmin) / nPixels;
+
+        % Constant-threshold regime: scalar knot -> compute_violations branch
+        thXc = xmin;
+        thYc = thLevel;
+
+        % Step-threshold regime: multi-knot ZOH -> compute_violations_dynamic
+        % branch. Deterministic oscillation around thLevel keeps the
+        % violation rate realistic and the path non-trivial.
+        thXs = linspace(xmin, xmax, nKnots);
+        thYs = thLevel + 0.3 * sin(1:nKnots);
+
+        % --- Constant threshold ---
+        [xc, ~] = violation_cull(x, y, thXc, thYc, 'upper', pixelWidth, xmin); %#ok<ASGLU>
+        tc = medianTime_(@() violation_cull(x, y, thXc, thYc, 'upper', pixelWidth, xmin), nWarm, nReps);
+        cMs(c)   = tc * 1000;
+        cTput(c) = (n / tc) / 1e6;
+        cMark(c) = numel(xc);
+
+        % --- Step threshold (time-varying / ZOH) ---
+        [xs, ~] = violation_cull(x, y, thXs, thYs, 'upper', pixelWidth, xmin); %#ok<ASGLU>
+        ts = medianTime_(@() violation_cull(x, y, thXs, thYs, 'upper', pixelWidth, xmin), nWarm, nReps);
+        sMs(c)   = ts * 1000;
+        sTput(c) = (n / ts) / 1e6;
+        sMark(c) = numel(xs);
+
+        fprintf('  %-6s | %10.3f %9.1f %6d | %10.3f %9.1f %6d\n', ...
+            labels{c}, cMs(c), cTput(c), cMark(c), sMs(c), sTput(c), sMark(c));
+    end
+
+    fprintf('  %s\n', repmat('-', 1, 76));
+
+    % ---- Soft scaling advisory ----
+    % "ms per million points" is the O(N) coefficient. Compare the largest
+    % size against the 1M reference; flat => linear. A >2x rise is the early
+    % signature of super-linear creep — advisory only, not a gate.
+    refIdx = 3;  % 1M
+    cCoef = cMs ./ (sizes / 1e6);
+    sCoef = sMs ./ (sizes / 1e6);
+    cDrift = cCoef(end) / cCoef(refIdx);
+    sDrift = sCoef(end) / sCoef(refIdx);
+
+    fprintf('  Scaling (ms per 1M pts, %s -> %s):\n', labels{refIdx}, labels{end});
+    fprintf('    constant : %.3f -> %.3f  (%.2fx)  %s\n', ...
+        cCoef(refIdx), cCoef(end), cDrift, driftLabel_(cDrift));
+    fprintf('    step     : %.3f -> %.3f  (%.2fx)  %s\n', ...
+        sCoef(refIdx), sCoef(end), sDrift, driftLabel_(sDrift));
+    fprintf('  %s\n', repmat('-', 1, 76));
+    fprintf('  Note: throughput bench (no time gate). /bench-guard baselines these numbers.\n\n');
+
+    result = struct( ...
+        'sizes',          sizes, ...
+        'labels',         {labels}, ...
+        'nPixels',        nPixels, ...
+        'nKnots',         nKnots, ...
+        'thLevel',        thLevel, ...
+        'cullMex',        cullMex, ...
+        'constMs',        cMs, ...
+        'constTputMpts',  cTput, ...
+        'constMarkers',   cMark, ...
+        'stepMs',         sMs, ...
+        'stepTputMpts',   sTput, ...
+        'stepMarkers',    sMark, ...
+        'constCoefMsPerM', cCoef, ...
+        'stepCoefMsPerM',  sCoef, ...
+        'constDrift',     cDrift, ...
+        'stepDrift',      sDrift);
+end
+
+function t = medianTime_(fn, nWarm, nReps)
+    %MEDIANTIME_ Median wall time of fn() over nReps after nWarm warmups.
+    for w = 1:nWarm
+        fn();
+    end
+    ts = zeros(1, nReps);
+    for r = 1:nReps
+        tic;
+        fn();
+        ts(r) = toc;
+    end
+    t = median(ts);
+end
+
+function s = pathLabel_(useMex)
+    %PATHLABEL_ Human label for the active kernel implementation.
+    if useMex
+        s = 'MEX (compiled)';
+    else
+        s = 'pure-MATLAB fallback';
+    end
+end
+
+function s = driftLabel_(drift)
+    %DRIFTLABEL_ Soft verdict on the O(N) coefficient drift.
+    if drift > 2.0
+        s = '<< WATCH: possible super-linear creep';
+    elseif drift > 1.5
+        s = '(mild rise — memory pressure expected at large N)';
+    else
+        s = '(linear — O(N) holds)';
+    end
+end