From 95e203db83b8700a63c7825aa8e9b0c12b57534b Mon Sep 17 00:00:00 2001
From: Hannes Suhr <sannahrush@googlemail.com>
Date: Wed, 24 Jun 2026 18:40:43 +0200
Subject: [PATCH] test(benchmarks): add isolated microbenchmarks for core
 hot-path kernels

Grows benchmark coverage over the performance-critical surface with five
focused, deterministic bench_*.m files, each isolating one hot-path kernel
as pure computation (no figure/render) and guarding it with a
machine-independent scaling gate:

  - bench_downsample_kernels.m  MinMax + LTTB downsampling (minmax_core_mex /
    lttb_core_mex). LTTB had zero coverage anywhere before this.
  - bench_binary_search.m       range-window lookup (binary_search_mex);
    log-scaling gate catches an O(log N) -> O(N) regression.
  - bench_violation_cull.m      fused threshold-marker detect+cull
    (violation_cull_mex), constant + step-function branches.
  - bench_datastore_range.m     disk-backed range query
    (FastSenseDataStore.getRange); gate asserts fixed-window query time stays
    ~constant as the dataset grows (catches a full-scan regression).
  - bench_delimited_parse.m     CSV ingestion (delimited_parse_mex);
    row-scaling gate. Tag-pipeline ingestion was previously unbenchmarked.

Each follows the existing bench_*.m house style and reaches private wrappers
by cd-ing into the owning private/ folder (works in both MATLAB and Octave,
unlike addpath of a private dir). benchmarks/.reports/coverage.md records the
ranked surface, what each run added, and the remaining gaps so coverage keeps
expanding toward what matters.

No library/production code changed; new files only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 benchmarks/.reports/coverage.md       | 137 +++++++++++++++++++
 benchmarks/bench_binary_search.m      | 152 +++++++++++++++++++++
 benchmarks/bench_datastore_range.m    | 158 ++++++++++++++++++++++
 benchmarks/bench_delimited_parse.m    | 182 ++++++++++++++++++++++++++
 benchmarks/bench_downsample_kernels.m | 163 +++++++++++++++++++++++
 benchmarks/bench_violation_cull.m     | 174 ++++++++++++++++++++++++
 6 files changed, 966 insertions(+)
 create mode 100644 benchmarks/.reports/coverage.md
 create mode 100644 benchmarks/bench_binary_search.m
 create mode 100644 benchmarks/bench_datastore_range.m
 create mode 100644 benchmarks/bench_delimited_parse.m
 create mode 100644 benchmarks/bench_downsample_kernels.m
 create mode 100644 benchmarks/bench_violation_cull.m

diff --git a/benchmarks/.reports/coverage.md b/benchmarks/.reports/coverage.md
new file mode 100644
index 00000000..b1198353
--- /dev/null
+++ b/benchmarks/.reports/coverage.md
@@ -0,0 +1,137 @@
+# Benchmark coverage notes
+
+Tracks what `/bench-evolve` has added and which performance-critical paths
+still lack isolated benchmark coverage. Newest entries first.
+
+## Performance-critical surface (ranked) and coverage status
+
+| Path | Why it matters | Coverage |
+|------|----------------|----------|
+| **Downsampling kernels** (`minmax_downsample` / `lttb_downsample` → `minmax_core_mex` / `lttb_core_mex`) | Runs on every render + every zoom/pan, over the full dataset (≤50M pts). The library's core value. | ✅ `bench_downsample_kernels.m` (isolated, both methods) — *added 2026-06-24*. Also exercised indirectly in `benchmark.m` / `benchmark_zoom.m` / `benchmark_features.m` (render-mixed). |
+| **`binary_search`** (`binary_search_mex`) | Range-window lookup on raw full-N sorted arrays; on the resolve path for every zoom/pan + every tag range query (`FastSense.m`, `FastSenseToolbar.m`, `SensorTag.m`). | ✅ `bench_binary_search.m` (isolated, log-scaling gate) — *added 2026-06-24*. |
+| **Violation marker path** (`violation_cull` → `violation_cull_mex`; constant + step-function branches) | Fused detect+cull on every threshold render/zoom for thresholds with `ShowViolations` (incl. time-varying step thresholds). | ✅ `bench_violation_cull.m` (isolated, both branches, linear-scaling gate) — *added 2026-06-24*. |
+| **Disk range-query** (`FastSenseDataStore.getRange`, `resolve_disk_mex`) | Out-of-core read on every zoom/pan of a disk-backed line. The large-data story's hot read path. | ✅ `bench_datastore_range.m` (fixed-window query, indexed-read gate) — *added 2026-06-24*. Store create/slice still only exploratory (`benchmark_datastore.m` / `profile_datastore.m`). |
+| **CSV ingestion** (`dispatchDelimitedParse_` → `delimited_parse_mex`, fallback `readRawDelimited_`) | Front door for raw sensor data into the Tag pipeline; MEX is ~10–40× the textscan fallback. Slow parse = slow load for big logs. | ✅ `bench_delimited_parse.m` (isolated, row-scaling gate) — *added 2026-06-24*. |
+| **Pyramid build** (`FastSense.buildPyramidLevel`) | Multi-level pre-downsample cache built at render for large lines (powers O(1) zoom). Full-N at render. | ◐ Partial — it is essentially `minmax_downsample` per level (already gated by `bench_downsample_kernels.m`) + chunked disk reads; only *memory*-benchmarked end-to-end (`benchmark_memory.m`). Low marginal value to isolate; private method. |
+| **`to_step_function_mex`** | SIMD step-function conversion — a compiled, deployed, correctness-tested kernel (`TestToStepFunctionMex`). | ⏸️ **DEFERRED** — no confirmed production caller. `MonitorTag.recompute_` emits a binary vector (no step conversion); `StateTag.getXY` is pass-through; only the test suite calls it. The `dispatchDelimitedParse_` comment citing it is stale. **Investigate whether it's still wired into any render path (or is vestigial) before benchmarking.** |
+| **Tag layer** (SensorTag/MonitorTag/CompositeTag getXY, resolve, append) | Live-tick recompute path. | ✅ `bench_sensortag_getxy`, `bench_monitortag_tick`, `bench_monitortag_append`, `bench_compositetag_merge`, `bench_consumer_migration_tick`, `bench_tag_pipeline_1k`. |
+| **Dashboard refresh / load** | Live dashboard refresh rate. | ✅ `bench_dashboard`, `bench_dashboard_live`, `bench_dashboard_load`. |
+| **Full render vs plot(), zoom/pan, memory, features** | End-to-end render comparison. | ✅ `benchmark.m`, `benchmark_zoom.m`, `benchmark_memory.m`, `benchmark_features.m`. |
+
+## Change log
+
+### 2026-06-24 — `bench_downsample_kernels.m`
+- **Gap closed:** isolated downsampling-kernel microbenchmark. Previously the
+  only coverage was a single `minmax_downsample(x,y,1000)` call buried inside
+  the render-heavy `benchmark.m`; **LTTB had zero coverage anywhere**.
+- **What it does:** times `minmax_downsample` and `lttb_downsample` as pure
+  computation (no figure/render) across a 10K→10M size sweep, same ~2000-pt
+  output budget for both, reporting per-call ms + throughput (Mpts/s).
+- **Gate:** machine-independent — fits the empirical log-log scaling exponent
+  over the large-N portion and asserts it stays ≤ 1.3 (catches super-linear
+  creep regardless of host speed).
+- **Reaches the private wrappers** by `cd`-ing into `libs/FastSense/private`
+  (current folder is always searched, even when named `private`) — works in
+  both MATLAB and Octave, unlike the `addpath(.../private)` trick that
+  `benchmark.m` uses (Octave-only; MATLAB rejects private dirs on the path).
+- **First run (MATLAB R2025b, MEX active):** MinMax ~764 Mpts/s @ 10M,
+  LTTB ~349 Mpts/s @ 10M; scaling exponents 0.88 / 0.87 → PASS.
+
+### 2026-06-24 — `bench_binary_search.m`
+- **Gap closed:** isolated range-lookup microbenchmark. `binary_search` is the
+  most broadly-used uncovered kernel — the resolve/zoom window lookup in
+  `FastSense.m` (4060/4103/4178/4460), timestamp lookup (1683), toolbar
+  click/range, and tag range resolve (`SensorTag.m:152`), on raw full-N sorted
+  arrays, every zoom/pan. Re-prioritised **above** the violation marker path
+  this run: `violation_cull` runs on already-downsampled display data
+  (small-N, per-frame), whereas `binary_search` hits the raw full-N array.
+- **What it does:** times 20k scalar `'left'`/`'right'` lookups across a
+  10K→50M sweep, reporting per-query µs + Mqueries/s.
+- **Gate:** machine-independent — fits the per-query log-log exponent over the
+  large-N portion and asserts it stays ≤ 0.6, catching the catastrophic
+  O(log N)→O(N) (linear-scan) regression regardless of host speed.
+- **MEX detection caveat (baked into the bench):** `binary_search_mex` lives in
+  `libs/FastSense/private` and is visible to `binary_search.m` (its parent) but
+  NOT from `benchmarks/`. A plain `exist('binary_search_mex','file')` in the
+  bench misreports as fallback; the bench instead checks the built binary for
+  the current platform on disk (`['binary_search_mex.' mexext]`).
+- **First run (MATLAB R2025b, MEX active):** ~0.95 µs/query @ 10K → ~1.8 µs @ 50M;
+  exponent 0.09 (firmly logarithmic), growth 1.9× over the sweep → PASS.
+
+### 2026-06-24 — `bench_violation_cull.m`
+- **Gap closed:** isolated threshold-marker microbenchmark. `violation_cull` is
+  the fused detect+cull kernel called per (threshold x line) on every
+  render/zoom (`FastSense.m:1368/1371`, `4468/4471`); only
+  `bench_event_marker_regression.m` touched a neighbouring path before.
+- **What it does:** times both threshold branches as pure computation — a
+  constant threshold (thX=0 sentinel) and a 5-knot step-function threshold —
+  across a 1K→1M input sweep, reporting per-call ms + throughput. Annotated
+  that production input is the displayed/downsampled data (~few thousand pts,
+  the low end); upper sizes verify linear scaling.
+- **Gate:** machine-independent — log-log scaling exponent over N >= 1e4 must
+  stay <= 1.3 (catches super-linear creep in detect+cull).
+- **Reaches the private wrapper** via the `cd`-into-`libs/FastSense/private`
+  trick (see [[benchmarking-private-mex-kernels]]).
+- **First run (MATLAB R2025b, MEX active):** constant ~288 Mpts/s @ 1M, step
+  ~261 Mpts/s @ 1M; at the realistic ~1K size both are sub-10 µs. Scaling
+  exponents 0.93 / 0.92 → PASS.
+
+### 2026-06-24 — `bench_datastore_range.m`
+- **Gap closed:** focused, deterministic gate for the disk-backed range-query
+  path (`FastSenseDataStore.getRange`), which every zoom/pan on a disk-backed
+  line hits. Previously only exploratory scripts existed (`benchmark_datastore.m`
+  is a .mat-vs-SQLite sweep and Linux-only — shells out to `free`;
+  `profile_datastore.m` is a profiler script). No figure needed.
+- **What it does:** builds a chunked store at each size, fires fixed-size view
+  windows (width scaled so each query returns ~10k pts regardless of N), times
+  `getRange`, and reports create time + per-query ms + queries/s.
+- **Gate:** machine-independent — the indexed store must read only the window,
+  so per-query time must stay ~constant as the dataset grows; asserts the
+  query-time-vs-total-N exponent <= 0.5 (a full-scan regression → ~1.0).
+- **Robustness:** warms up a throwaway store first (absorbs one-time SQLite/MEX
+  init), and always `cleanup()`s each store (try/catch + post-loop) so temp DBs
+  never leak even if the gate trips.
+- **First run (MATLAB R2025b, mksqlite active):** query time flat at ~0.16 ms
+  across 100K→5M (50× more data), exponent −0.11, exactly 10002 pts/query → PASS.
+
+### Pivot note this run
+Intended target was `to_step_function_mex`, but a fresh survey found it has **no
+confirmed production caller** (see table) — benchmarking it would violate the
+"path that matters" rule. Deferred it (flagged for investigation) and pivoted to
+the disk range-query gate instead.
+
+### 2026-06-24 — `bench_delimited_parse.m`
+- **Gap closed:** isolated CSV-ingestion microbenchmark. `delimited_parse_mex`
+  (via `dispatchDelimitedParse_`) is the parse front door for the Tag pipeline,
+  documented at ~10–40× the textscan fallback, with zero coverage
+  (BatchTagPipeline / delimited ingestion was entirely unbenchmarked).
+- **What it does:** generates deterministic 4-column CSVs of growing row count,
+  times `dispatchDelimitedParse_` (file generation excluded), reports parse ms +
+  rows/s + MB/s. Always deletes its temp files (per-iter + onCleanup backstop).
+- **Gate:** machine-independent — log-log row-scaling exponent over rows ≥ 1e4
+  must stay ≤ 1.3 (catches super-linear parse creep, e.g. O(rows²) realloc).
+- **Reaches the private wrapper** via `cd`-into-`libs/SensorThreshold/private`
+  (see [[benchmarking-private-mex-kernels]]).
+- **First run (MATLAB R2025b, MEX active):** ~5.7 M rows/s (~205 MB/s) at 100K–500K
+  rows; exponent 0.98 (essentially linear) → PASS.
+
+### Pivot notes this run
+Two earmarked targets were rejected on fresh survey:
+- **Pyramid build** — `buildPyramidLevel` is just `minmax_downsample` per level
+  (already gated) + chunked reads; private; low marginal value. Downgraded to
+  ◐ Partial in the table, not benchmarked.
+- **DerivedTag.recompute_** — thin dispatch around a user-supplied `ComputeFn`
+  (`[X,Y] = ComputeFn(Parents)`), so a microbench would mostly measure the test
+  closure, not a FastSense kernel. Deferred unless paired with a built-in
+  compute/alignment path worth isolating.
+
+### Next gap for the following iteration
+Survey fresh, but leading candidates (higher-level paths now that the core MEX
+kernels are covered):
+- **EventStore persistence scaling** — `EventStore.save` (atomic temp-rename
+  write) / `load` as event count grows; relevant for long-running live
+  dashboards. Confirm it isn't already covered by `bench_event_marker_regression`
+  / `bench_dashboard_*` (those attach stores but may not stress save/load at scale).
+- **LiveEventPipeline per-tick processing** (`processMonitorTag_`) on the live
+  refresh path — confirm it isn't already covered by `bench_monitortag_tick`.
+- Still open: the `to_step_function_mex` wiring question (filed as a background task).
diff --git a/benchmarks/bench_binary_search.m b/benchmarks/bench_binary_search.m
new file mode 100644
index 00000000..a6921229
--- /dev/null
+++ b/benchmarks/bench_binary_search.m
@@ -0,0 +1,152 @@
+function bench_binary_search()
+%BENCH_BINARY_SEARCH Isolated microbenchmark of the range-lookup hot path.
+%
+%   binary_search is the gateway to every range query in FastSense. On each
+%   zoom/pan and render it locates the visible index window in a raw, sorted,
+%   full-length X array — FastSense.m (resolve/zoom window, timestamp lookup),
+%   FastSenseToolbar.m (click-to-point, range select) and SensorTag.m (tag
+%   range resolve) all call it, against arrays up to tens of millions of
+%   points. It is MEX-accelerated (binary_search_mex) with a pure-MATLAB
+%   fallback, yet has no benchmark anywhere.
+%
+%   The cost of any single call is tiny (O(log N) comparisons), so absolute
+%   throughput is not the point. The point is the GATE: binary search must
+%   stay logarithmic. If the MEX silently stops loading, or a change turns
+%   the search into a linear scan, large-data zoom/pan responsiveness
+%   collapses — and nothing else in the suite would catch it. This benchmark
+%   times many scalar lookups (both 'left' and 'right') across a wide size
+%   sweep and asserts the per-query time scales sub-linearly with N.
+%
+%   Per-query time grows only weakly with N (a mix of ~log2(N) comparisons
+%   and cache-miss penalty as the array spills out of cache), so the
+%   empirical log-log exponent stays well below the linear-scan exponent of
+%   ~1.0. The gate (exponent <= 0.6) cleanly separates the two regimes and
+%   is machine-independent.
+%
+%   Warmup dissolves first-call/JIT overhead; each measurement loops over a
+%   fixed query batch so per-call dispatch stays representative of production
+%   (binary_search is always called scalar); median of nRuns defuses spikes.
+%
+%   Run:
+%     octave --no-gui --eval "install(); bench_binary_search();"
+%
+%   Exits 0 with "PASS: ..." on success; raises assert() (non-zero exit) if
+%   either direction's per-query scaling exponent exceeds the gate.
+%
+%   See also binary_search, binary_search_mex, bench_downsample_kernels.
+
+    here = fileparts(mfilename('fullpath'));
+    addpath(fullfile(here, '..'));
+    install();
+    % binary_search lives in libs/FastSense/ (not a private/ folder), so
+    % install() puts it on the path and it is directly callable here.
+
+    sizes  = [1e4, 1e5, 1e6, 1e7, 5e7];
+    labels = {'10K', '100K', '1M', '10M', '50M'};
+
+    nQueries = 20000;   % scalar lookups timed per (size, direction, run)
+    nRuns    = 5;       % median of nRuns
+
+    % Deterministic seed — works in both MATLAB and Octave
+    if exist('rng', 'file') == 2
+        rng(0);
+    else
+        rand('state', 0); %#ok<RAND>
+    end
+
+    % binary_search_mex lives in libs/FastSense/private. It is visible to
+    % binary_search.m (its parent folder) and is what the wrapper actually
+    % dispatches to — but it is NOT visible from this benchmark's context,
+    % so a plain exist('binary_search_mex','file') here would misreport as a
+    % fallback. Detect the built binary for THIS platform on disk instead.
+    mexPath = fullfile(here, '..', 'libs', 'FastSense', 'private', ...
+        ['binary_search_mex.' mexext]);
+    useMex = (exist(mexPath, 'file') ~= 0);
+
+    nSizes = numel(sizes);
+    tLeft  = zeros(1, nSizes);   % per-query seconds, 'left'
+    tRight = zeros(1, nSizes);   % per-query seconds, 'right'
+
+    fprintf('\n=== binary_search range-lookup microbenchmark ===\n');
+    fprintf('  binary_search_mex: %s\n', tf_(useMex));
+    fprintf('  %d scalar lookups per measurement, median of %d runs\n', nQueries, nRuns);
+    fprintf('  %s\n', repmat('-', 1, 74));
+    fprintf('  %-6s | %-14s %-12s | %-14s %-12s\n', ...
+        'N', 'left (us/q)', 'left Mq/s', 'right (us/q)', 'right Mq/s');
+    fprintf('  %s\n', repmat('-', 1, 74));
+
+    for c = 1:nSizes
+        n = sizes(c);
+        x = linspace(0, 100, n);          % sorted ascending (binary_search contract)
+        vals = 100 * rand(1, nQueries);   % query targets within range (not timed)
+
+        tLeft(c)  = timeSearch_(x, vals, 'left',  nRuns);
+        tRight(c) = timeSearch_(x, vals, 'right', nRuns);
+
+        fprintf('  %-6s | %12.4f   %10.2f   | %12.4f   %10.2f\n', ...
+            labels{c}, ...
+            tLeft(c)  * 1e6, 1 / tLeft(c)  / 1e6, ...
+            tRight(c) * 1e6, 1 / tRight(c) / 1e6);
+
+        clear x vals;
+    end
+    fprintf('  %s\n', repmat('-', 1, 74));
+
+    % ---- Scaling gate: per-query time must stay sub-linear in N ----
+    % Fit over N >= 1e5 (small N is dominated by fixed call/dispatch overhead
+    % and would flatten the slope). O(log N) + cache effects keep the exponent
+    % well under 1.0; a linear-scan regression drives it toward 1.0.
+    fitMask = sizes >= 1e5;
+    slopeLeft  = scalingExponent_(sizes(fitMask), tLeft(fitMask));
+    slopeRight = scalingExponent_(sizes(fitMask), tRight(fitMask));
+    growthLeft = tLeft(end) / max(tLeft(1), eps);
+
+    gate = 0.6;
+    fprintf('  Per-query scaling exponent (large-N fit, linear-scan ~1.0):\n');
+    fprintf('    left  : %.2f   (gate: <= %.1f)\n', slopeLeft, gate);
+    fprintf('    right : %.2f   (gate: <= %.1f)\n', slopeRight, gate);
+    fprintf('    per-query growth 10K->50M (left): %.1fx\n', growthLeft);
+    fprintf('  %s\n', repmat('-', 1, 74));
+
+    assert(slopeLeft <= gate, ...
+        sprintf(['FAIL: binary_search ''left'' per-query exponent %.2f exceeds %.1f — ' ...
+                 'search is no longer logarithmic (linear-scan regression?).'], slopeLeft, gate));
+    assert(slopeRight <= gate, ...
+        sprintf(['FAIL: binary_search ''right'' per-query exponent %.2f exceeds %.1f — ' ...
+                 'search is no longer logarithmic (linear-scan regression?).'], slopeRight, gate));
+    fprintf('  PASS: lookups stay sub-linear (gate: exponent <= %.1f).\n\n', gate);
+end
+
+function t = timeSearch_(x, vals, dir, nRuns)
+    %TIMESEARCH_ Median-of-nRuns per-query time of binary_search over a batch.
+    %   Warms up first, then times nQueries back-to-back scalar lookups per
+    %   run and returns the median run divided by nQueries.
+    nq = numel(vals);
+    binary_search(x, vals(1),   dir); %#ok<*NASGU> % warmup
+    binary_search(x, vals(end), dir);
+    runTimes = zeros(1, nRuns);
+    for r = 1:nRuns
+        t0 = tic;
+        for q = 1:nq
+            binary_search(x, vals(q), dir);
+        end
+        runTimes(r) = toc(t0);
+    end
+    t = median(runTimes) / nq;
+end
+
+function slope = scalingExponent_(ns, times)
+    %SCALINGEXPONENT_ Log-log slope of per-query time vs N (the growth exponent).
+    %   slope -> 0 indicates flat/logarithmic scaling; -> 1 indicates linear.
+    times = max(times, eps);
+    p = polyfit(log10(ns(:)), log10(times(:)), 1);
+    slope = p(1);
+end
+
+function s = tf_(b)
+    if b
+        s = 'active';
+    else
+        s = 'fallback (pure MATLAB)';
+    end
+end
diff --git a/benchmarks/bench_datastore_range.m b/benchmarks/bench_datastore_range.m
new file mode 100644
index 00000000..231c0633
--- /dev/null
+++ b/benchmarks/bench_datastore_range.m
@@ -0,0 +1,158 @@
+function bench_datastore_range()
+%BENCH_DATASTORE_RANGE Isolated gate for the disk-backed range-query hot path.
+%
+%   FastSenseDataStore is FastSense's out-of-core backend: datasets too large
+%   for RAM live in a chunked SQLite store, and every zoom/pan on a
+%   disk-backed line issues a range query (getRange) to pull just the visible
+%   window before downsampling. This is the large-data story's hot read path
+%   (resolve_disk_mex + chunked SQLite reads, WAL mode for live use).
+%
+%   Today the store has only EXPLORATORY coverage: benchmark_datastore.m
+%   (a .mat-vs-SQLite size sweep, and Linux-only — it shells out to `free`)
+%   and profile_datastore.m (a MATLAB-profiler bottleneck script). Neither is
+%   a focused, deterministic regression GATE. This benchmark fills that gap.
+%
+%   The key property a chunked, indexed store must preserve: for a FIXED-size
+%   view window, query latency should stay roughly CONSTANT as the total
+%   dataset grows — the store seeks to the window (≈ O(log N) index/chunk
+%   lookup) and reads only the window's points, never the whole dataset. To
+%   hold the returned point count constant across sizes, the window width is
+%   scaled inversely with dataset density (each query returns ~targetPts
+%   points regardless of N).
+%
+%   Gate (machine-independent): the log-log exponent of per-query time vs
+%   TOTAL dataset size must stay near zero (<= 0.5). A full-scan regression —
+%   where query cost grows with the whole dataset rather than the window —
+%   drives the exponent toward 1.0 and trips the gate, regardless of host
+%   speed or which backend (SQLite vs binary fallback) is active.
+%
+%   Store creation time (inherently O(N) chunked write) is reported for
+%   context but NOT gated.
+%
+%   Run:
+%     octave --no-gui --eval "install(); bench_datastore_range();"
+%
+%   Exits 0 with "PASS: ..." on success; raises assert() (non-zero exit) if
+%   range-query latency scales with total dataset size.
+%
+%   See also FastSenseDataStore, benchmark_datastore, profile_datastore,
+%            bench_binary_search.
+
+    here = fileparts(mfilename('fullpath'));
+    addpath(fullfile(here, '..'));
+    install();
+
+    sizes  = [1e5, 5e5, 1e6, 5e6];
+    labels = {'100K', '500K', '1M', '5M'};
+
+    targetPts = 10000;   % each range query returns ~this many points
+    nQueries  = 30;      % random view windows per measurement
+    nRuns     = 3;       % median of nRuns
+
+    xSpan = 1000;        % data spans X in [0, xSpan]
+
+    % Deterministic seed — works in both MATLAB and Octave
+    if exist('rng', 'file') == 2
+        rng(0);
+    else
+        rand('state', 0); randn('state', 0); %#ok<RAND>
+    end
+
+    hasSqlite = (exist('mksqlite', 'file') == 3);
+
+    % Warmup: absorb one-time SQLite/MEX/file-creation init on a throwaway
+    % store so the first sized store below isn't penalised (which would bias
+    % the scaling fit negative and inflate its create time).
+    wx = linspace(0, xSpan, 1000);
+    wds = FastSenseDataStore(wx, sin(wx / 50));
+    wds.getRange(0, xSpan / 10);
+    wds.cleanup();
+    clear wx wds;
+
+    nSizes  = numel(sizes);
+    tQuery  = zeros(1, nSizes);   % per-query seconds
+    tCreate = zeros(1, nSizes);   % store creation seconds
+    avgPts  = zeros(1, nSizes);   % avg points returned per query
+
+    fprintf('\n=== FastSenseDataStore range-query microbenchmark ===\n');
+    fprintf('  backend: %s\n', backend_(hasSqlite));
+    fprintf('  fixed view window ~%d pts, %d queries x median of %d runs\n', ...
+        targetPts, nQueries, nRuns);
+    fprintf('  %s\n', repmat('-', 1, 76));
+    fprintf('  %-6s | %-12s | %-14s %-12s | %-10s\n', ...
+        'N', 'create (s)', 'query (ms)', 'queries/s', 'pts/query');
+    fprintf('  %s\n', repmat('-', 1, 76));
+
+    for c = 1:nSizes
+        n = sizes(c);
+        x = linspace(0, xSpan, n);
+        y = sin(x / 50) + 0.1 * randn(1, n);
+
+        % Window width that returns ~targetPts points at this density
+        w = max(xSpan * targetPts / n, eps);
+        centers = (w / 2) + (xSpan - w) * rand(1, nQueries);
+
+        t0 = tic;
+        ds = FastSenseDataStore(x, y);
+        tCreate(c) = toc(t0);
+        clear x y;
+
+        try
+            % Warmup + measure average returned point count
+            [wx, ~] = ds.getRange(centers(1) - w/2, centers(1) + w/2);
+            ds.getRange(centers(2) - w/2, centers(2) + w/2);
+            avgPts(c) = numel(wx);
+
+            runTimes = zeros(1, nRuns);
+            for r = 1:nRuns
+                tq = tic;
+                for q = 1:nQueries
+                    ds.getRange(centers(q) - w/2, centers(q) + w/2);
+                end
+                runTimes(r) = toc(tq);
+            end
+            tQuery(c) = median(runTimes) / nQueries;
+        catch err
+            ds.cleanup();   % never leak the temp store on failure
+            rethrow(err);
+        end
+        ds.cleanup();       % release SQLite handle + temp file before next size
+
+        fprintf('  %-6s | %10.3f   | %12.4f   %10.1f   | %9.0f\n', ...
+            labels{c}, tCreate(c), tQuery(c) * 1000, 1 / tQuery(c), avgPts(c));
+    end
+    fprintf('  %s\n', repmat('-', 1, 76));
+
+    % ---- Gate: fixed-window query time must NOT scale with total dataset ----
+    slope = scalingExponent_(sizes, tQuery);
+    growth = tQuery(end) / max(tQuery(1), eps);
+
+    gate = 0.5;
+    fprintf('  Query-time vs total-N exponent (indexed read ~0, full scan ~1.0):\n');
+    fprintf('    exponent : %.2f   (gate: <= %.1f)\n', slope, gate);
+    fprintf('    100K->5M query-time growth: %.2fx (50x more data)\n', growth);
+    fprintf('  %s\n', repmat('-', 1, 76));
+
+    assert(slope <= gate, ...
+        sprintf(['FAIL: getRange per-query time scales with total dataset size ' ...
+                 '(exponent %.2f > %.1f) — fixed-window queries should be ~constant; ' ...
+                 'full-scan / unindexed-read regression suspected.'], slope, gate));
+    fprintf('  PASS: fixed-window range queries stay ~constant vs dataset size (exponent <= %.1f).\n\n', gate);
+end
+
+function slope = scalingExponent_(ns, times)
+    %SCALINGEXPONENT_ Log-log slope of per-query time vs total dataset size.
+    %   slope -> 0 indicates the indexed store reads only the window;
+    %   slope -> 1 indicates cost grows with the whole dataset (full scan).
+    times = max(times, eps);
+    p = polyfit(log10(ns(:)), log10(times(:)), 1);
+    slope = p(1);
+end
+
+function s = backend_(hasSqlite)
+    if hasSqlite
+        s = 'mksqlite/SQLite (chunked)';
+    else
+        s = 'binary fallback (mksqlite absent)';
+    end
+end
diff --git a/benchmarks/bench_delimited_parse.m b/benchmarks/bench_delimited_parse.m
new file mode 100644
index 00000000..f0bce1c0
--- /dev/null
+++ b/benchmarks/bench_delimited_parse.m
@@ -0,0 +1,182 @@
+function bench_delimited_parse()
+%BENCH_DELIMITED_PARSE Isolated microbenchmark of the CSV-ingestion hot path.
+%
+%   The Tag pipeline ingests raw sensor data from delimited text (CSV/TSV)
+%   files. dispatchDelimitedParse_ is the parse entry point: it prefers the
+%   compiled delimited_parse_mex kernel and falls back to the pure
+%   MATLAB/Octave textscan-based readRawDelimited_ when the binary is absent.
+%   Per the in-repo note (Phase 1028), the MEX is ~10-40x faster than the
+%   fallback at harness scale — yet BatchTagPipeline / delimited ingestion has
+%   no benchmark at all. This is the front door for getting data into
+%   FastSense, and slow parsing directly inflates load time for large logs.
+%
+%   This benchmark generates deterministic multi-column CSV files of growing
+%   row count, times dispatchDelimitedParse_ on each (the whichever-is-active
+%   path — MEX or fallback, reported), and reports parse latency plus row and
+%   byte throughput. File generation is done once per size and is NOT timed.
+%
+%   Gate (machine-independent): delimited parsing is an O(rows) sweep, so the
+%   empirical log-log scaling exponent over the large-N portion must stay
+%   sub-quadratic (<= 1.3). Super-linear creep — e.g. an accidental O(rows^2)
+%   reallocation in the fallback, or per-row overhead growth — trips the gate
+%   regardless of host speed.
+%
+%   Warmup parse dissolves first-call/JIT overhead; small files are parsed
+%   over an inner repeat loop so sub-millisecond parses stay measurable;
+%   median of nRuns defuses one-off spikes. Temp files are always deleted.
+%
+%   Run:
+%     octave --no-gui --eval "install(); bench_delimited_parse();"
+%
+%   Exits 0 with "PASS: ..." on success; raises assert() (non-zero exit) if
+%   parse time scales super-linearly with row count.
+%
+%   See also dispatchDelimitedParse_, readRawDelimited_, delimited_parse_mex,
+%            BatchTagPipeline, bench_datastore_range.
+
+    here = fileparts(mfilename('fullpath'));
+    addpath(fullfile(here, '..'));
+    install();
+    % dispatchDelimitedParse_ / delimited_parse_mex / readRawDelimited_ live in
+    % SensorThreshold's private/ folder, which cannot be put on the path. The
+    % current working folder is always searched regardless of its name, so
+    % cd-ing into private makes them directly callable in both MATLAB and
+    % Octave (and makes the exist() MEX check accurate). onCleanup restores the
+    % original folder even if an assert below trips.
+    privDir = fullfile(here, '..', 'libs', 'SensorThreshold', 'private');
+    origDir = pwd;
+    restoreDir = onCleanup(@() cd(origDir)); %#ok<NASGU>
+    cd(privDir);
+
+    rows   = [1e3, 1e4, 1e5, 5e5];
+    labels = {'1K', '10K', '100K', '500K'};
+    nCols  = 4;          % time + 3 value columns (a modest "wide" sensor CSV)
+
+    nRuns      = 5;      % median of nRuns per size
+    targetRows = 2e5;    % inner-loop repeats sized to parse ~this many rows
+
+    % Deterministic seed — works in both MATLAB and Octave
+    if exist('rng', 'file') == 2
+        rng(0);
+    else
+        rand('state', 0); randn('state', 0); %#ok<RAND>
+    end
+
+    useMex = (exist('delimited_parse_mex', 'file') == 3);
+
+    nSizes  = numel(rows);
+    tParse  = zeros(1, nSizes);   % per-parse seconds
+    fileMB  = zeros(1, nSizes);
+
+    % Track temp files so they are always cleaned up, even on a gate failure.
+    tmpFiles = {};
+    cleanupTmp = onCleanup(@() deleteFiles_(tmpFiles)); %#ok<NASGU>
+
+    fprintf('\n=== Delimited-parse (CSV ingestion) microbenchmark ===\n');
+    fprintf('  delimited_parse_mex: %s\n', tf_(useMex));
+    fprintf('  %d columns (time + %d values), median of %d runs\n', nCols, nCols - 1, nRuns);
+    fprintf('  %s\n', repmat('-', 1, 72));
+    fprintf('  %-6s | %-9s | %-13s %-12s %-10s\n', ...
+        'rows', 'file MB', 'parse (ms)', 'rows/s (M)', 'MB/s');
+    fprintf('  %s\n', repmat('-', 1, 72));
+
+    for c = 1:nSizes
+        n = rows(c);
+        path = [tempname, '.csv'];
+        tmpFiles{end+1} = path; %#ok<AGROW>
+        writeCsv_(path, n, nCols);
+        d = dir(path);
+        fileMB(c) = d.bytes / 1e6;
+
+        nInner = max(1, ceil(targetRows / n));
+        tParse(c) = timeParse_(path, nInner, nRuns);
+
+        fprintf('  %-6s | %8.2f  | %11.4f   %10.2f   %8.1f\n', ...
+            labels{c}, fileMB(c), ...
+            tParse(c) * 1000, n / tParse(c) / 1e6, fileMB(c) / tParse(c));
+
+        delete(path);                 % free disk eagerly between sizes
+        tmpFiles{c} = '';             % already gone — don't double-delete
+    end
+    fprintf('  %s\n', repmat('-', 1, 72));
+
+    % ---- Scaling gate: fit exponent over the large-N portion (>= 1e4) ----
+    fitMask = rows >= 1e4;
+    slope = scalingExponent_(rows(fitMask), tParse(fitMask));
+
+    gate = 1.3;
+    fprintf('  Scaling exponent (large-N fit, ideal ~1.0): %.2f   (gate: <= %.1f)\n', slope, gate);
+    fprintf('  %s\n', repmat('-', 1, 72));
+
+    assert(slope <= gate, ...
+        sprintf(['FAIL: delimited parse scaling exponent %.2f exceeds %.1f — ' ...
+                 'super-linear creep in the CSV-ingestion path.'], slope, gate));
+    fprintf('  PASS: parsing scales near-linearly (gate: exponent <= %.1f).\n\n', gate);
+end
+
+function writeCsv_(path, n, nCols)
+    %WRITECSV_ Write a deterministic n-row, nCols-column CSV with a header.
+    %   Column 1 is a monotonic time axis; remaining columns are smooth
+    %   signals plus light noise. Generation is intentionally outside the
+    %   timed region.
+    x = linspace(0, 1000, n);
+    M = zeros(n, nCols);
+    M(:, 1) = x(:);
+    for k = 2:nCols
+        M(:, k) = sin(x(:) / (10 * k)) + 0.1 * randn(n, 1);
+    end
+
+    fid = fopen(path, 'w');
+    if fid == -1
+        error('bench:fileOpen', 'Cannot open temp file for writing: %s', path);
+    end
+    closer = onCleanup(@() fclose(fid)); %#ok<NASGU>
+
+    hdr = 't';
+    for k = 2:nCols
+        hdr = [hdr, sprintf(',c%d', k - 1)]; %#ok<AGROW>
+    end
+    fprintf(fid, '%s\n', hdr);
+
+    rowFmt = ['%.6g', repmat(',%.6g', 1, nCols - 1), '\n'];
+    fprintf(fid, rowFmt, M.');   % transpose: fprintf consumes column-major
+end
+
+function t = timeParse_(path, nInner, nRuns)
+    %TIMEPARSE_ Median-of-nRuns per-parse time of dispatchDelimitedParse_.
+    dispatchDelimitedParse_(path); % warmup (also primes OS file cache)
+    runTimes = zeros(1, nRuns);
+    for r = 1:nRuns
+        t0 = tic;
+        for i = 1:nInner
+            dispatchDelimitedParse_(path);
+        end
+        runTimes(r) = toc(t0);
+    end
+    t = median(runTimes) / nInner;
+end
+
+function slope = scalingExponent_(ns, times)
+    %SCALINGEXPONENT_ Log-log slope of per-parse time vs row count.
+    times = max(times, eps);
+    p = polyfit(log10(ns(:)), log10(times(:)), 1);
+    slope = p(1);
+end
+
+function deleteFiles_(files)
+    %DELETEFILES_ Best-effort cleanup of any temp files still present.
+    for i = 1:numel(files)
+        f = files{i};
+        if ~isempty(f) && exist(f, 'file')
+            delete(f);
+        end
+    end
+end
+
+function s = tf_(b)
+    if b
+        s = 'active';
+    else
+        s = 'fallback (pure MATLAB/Octave)';
+    end
+end
diff --git a/benchmarks/bench_downsample_kernels.m b/benchmarks/bench_downsample_kernels.m
new file mode 100644
index 00000000..992c5162
--- /dev/null
+++ b/benchmarks/bench_downsample_kernels.m
@@ -0,0 +1,163 @@
+function bench_downsample_kernels()
+%BENCH_DOWNSAMPLE_KERNELS Isolated microbenchmark of the downsampling hot path.
+%
+%   Downsampling is the single most performance-critical computation in
+%   FastSense: minmax_downsample / lttb_downsample run on every render and
+%   on every zoom/pan, over the full dataset (up to tens of millions of
+%   points). They are the reason the library exists. Yet the only existing
+%   coverage is a single minmax_downsample(x, y, 1000) call buried inside
+%   the render-heavy benchmark.m — mixed with figure creation and drawnow,
+%   never isolated, and LTTB is not benchmarked anywhere at all.
+%
+%   This benchmark times BOTH downsamplers as PURE computation (no figure,
+%   no rendering) across a size sweep, reporting per-call latency and
+%   throughput (Mpts/s). With the MEX kernels compiled it exercises
+%   minmax_core_mex / lttb_core_mex (the production path); without them it
+%   transparently times the pure-MATLAB fallbacks (a flag reports which).
+%
+%   Both methods are driven to the same output budget (~2000 points, a
+%   realistic display width) so their throughput is directly comparable.
+%
+%   Gate (machine-independent): downsampling is an O(N) sweep, so per-call
+%   time must scale near-linearly with N. The benchmark fits the empirical
+%   scaling exponent over the large-N portion of the sweep (where O(N)
+%   dominates measurement noise) and asserts it stays sub-linear-ish
+%   (exponent <= 1.3). Super-linear creep — the classic downsampling
+%   regression — trips this gate regardless of absolute machine speed.
+%
+%   Warmup passes dissolve JIT first-call overhead; small sizes are timed
+%   over an inner repeat loop so sub-millisecond calls stay measurable;
+%   median of nRuns defuses one-off spikes.
+%
+%   Run:
+%     octave --no-gui --eval "install(); bench_downsample_kernels();"
+%
+%   Exits 0 with "PASS: ..." on success; raises assert() (non-zero exit) if
+%   either kernel's empirical scaling exponent exceeds the gate.
+%
+%   See also minmax_downsample, lttb_downsample, benchmark, benchmark_zoom.
+
+    here = fileparts(mfilename('fullpath'));
+    addpath(fullfile(here, '..'));
+    install();
+    % minmax_downsample / lttb_downsample live in FastSense's private/ folder.
+    % A private/ folder cannot be put on the path (Octave permits it but
+    % MATLAB rejects it), so the wrappers are not callable from here. The
+    % current working folder is ALWAYS searched regardless of its name,
+    % however, so cd-ing into the private folder makes them directly callable
+    % in both MATLAB and Octave — no path manipulation, no touching libs/.
+    % onCleanup restores the original folder even if an assert below trips.
+    privDir = fullfile(here, '..', 'libs', 'FastSense', 'private');
+    origDir = pwd;
+    restoreDir = onCleanup(@() cd(origDir)); %#ok<NASGU>
+    cd(privDir);
+
+    sizes  = [1e4, 1e5, 1e6, 5e6, 1e7];
+    labels = {'10K', '100K', '1M', '5M', '10M'};
+
+    % Equal output budget so the two methods are directly comparable:
+    %   minmax emits ~2*numBuckets points -> numBuckets = 1000 -> ~2000 pts
+    %   lttb   emits numOut points          -> numOut     = 2000 -> 2000 pts
+    minmaxBuckets = 1000;
+    lttbOut       = 2000;
+
+    nRuns       = 5;     % median of nRuns per (method, size)
+    targetWork  = 2e6;   % inner-loop repeats sized to process ~this many pts
+
+    % Deterministic seed — works in both MATLAB and Octave
+    if exist('rng', 'file') == 2
+        rng(0);
+    else
+        rand('state', 0); randn('state', 0); %#ok<RAND>
+    end
+
+    mexMinmax = (exist('minmax_core_mex', 'file') == 3);
+    mexLttb   = (exist('lttb_core_mex',   'file') == 3);
+
+    nSizes = numel(sizes);
+    tMinmax = zeros(1, nSizes);   % per-call seconds
+    tLttb   = zeros(1, nSizes);
+
+    fprintf('\n=== Downsampling kernel microbenchmark (pure computation) ===\n');
+    fprintf('  MinMax MEX: %s   |   LTTB MEX: %s\n', tf_(mexMinmax), tf_(mexLttb));
+    fprintf('  Output budget: minmax numBuckets=%d (~%d pts)  lttb numOut=%d\n', ...
+        minmaxBuckets, 2 * minmaxBuckets, lttbOut);
+    fprintf('  %s\n', repmat('-', 1, 74));
+    fprintf('  %-6s | %-13s %-12s | %-13s %-12s\n', ...
+        'N', 'MinMax (ms)', 'MinMax Mpts/s', 'LTTB (ms)', 'LTTB Mpts/s');
+    fprintf('  %s\n', repmat('-', 1, 74));
+
+    for c = 1:nSizes
+        n = sizes(c);
+        x = linspace(0, 100, n);
+        y = sin(x * 2 * pi / 10) + 0.5 * randn(1, n);
+
+        nInner = max(1, ceil(targetWork / n));
+
+        tMinmax(c) = timeCall_(@() minmax_downsample(x, y, minmaxBuckets), nInner, nRuns);
+        tLttb(c)   = timeCall_(@() lttb_downsample(x, y, lttbOut),         nInner, nRuns);
+
+        fprintf('  %-6s | %11.3f   %10.1f   | %11.3f   %10.1f\n', ...
+            labels{c}, ...
+            tMinmax(c) * 1000, n / tMinmax(c) / 1e6, ...
+            tLttb(c)   * 1000, n / tLttb(c)   / 1e6);
+
+        clear x y;
+    end
+    fprintf('  %s\n', repmat('-', 1, 74));
+
+    % ---- Scaling gate: fit exponent over the large-N portion (>= 1e5) ----
+    % Small N is dominated by fixed dispatch/allocation overhead and would
+    % bias the slope; restrict the fit to where the O(N) sweep dominates.
+    fitMask = sizes >= 1e5;
+    slopeMinmax = scalingExponent_(sizes(fitMask), tMinmax(fitMask));
+    slopeLttb   = scalingExponent_(sizes(fitMask), tLttb(fitMask));
+
+    gate = 1.3;
+    fprintf('  Scaling exponent (large-N fit, ideal ~1.0):\n');
+    fprintf('    MinMax : %.2f   (gate: <= %.1f)\n', slopeMinmax, gate);
+    fprintf('    LTTB   : %.2f   (gate: <= %.1f)\n', slopeLttb, gate);
+    fprintf('  %s\n', repmat('-', 1, 74));
+
+    assert(slopeMinmax <= gate, ...
+        sprintf(['FAIL: minmax_downsample scaling exponent %.2f exceeds %.1f — ' ...
+                 'super-linear creep in the downsampling hot path.'], slopeMinmax, gate));
+    assert(slopeLttb <= gate, ...
+        sprintf(['FAIL: lttb_downsample scaling exponent %.2f exceeds %.1f — ' ...
+                 'super-linear creep in the downsampling hot path.'], slopeLttb, gate));
+    fprintf('  PASS: both kernels scale near-linearly (gate: exponent <= %.1f).\n\n', gate);
+end
+
+function t = timeCall_(fn, nInner, nRuns)
+    %TIMECALL_ Median-of-nRuns per-call time of fn, averaged over nInner reps.
+    %   Warms up first to dissolve JIT/first-call overhead, then times nInner
+    %   back-to-back calls per run and returns the median run divided by
+    %   nInner — a robust per-call estimate that keeps sub-ms calls measurable.
+    fn(); fn(); % warmup
+    runTimes = zeros(1, nRuns);
+    for r = 1:nRuns
+        t0 = tic;
+        for i = 1:nInner
+            fn();
+        end
+        runTimes(r) = toc(t0);
+    end
+    t = median(runTimes) / nInner;
+end
+
+function slope = scalingExponent_(ns, times)
+    %SCALINGEXPONENT_ Log-log slope of per-call time vs N (the O(N) exponent).
+    %   slope ~ 1.0 indicates linear scaling; > 1 indicates super-linear creep.
+    %   Guards against a degenerate fit when timings are too small to resolve.
+    times = max(times, eps);
+    p = polyfit(log10(ns(:)), log10(times(:)), 1);
+    slope = p(1);
+end
+
+function s = tf_(b)
+    if b
+        s = 'active';
+    else
+        s = 'fallback (pure MATLAB)';
+    end
+end
diff --git a/benchmarks/bench_violation_cull.m b/benchmarks/bench_violation_cull.m
new file mode 100644
index 00000000..6d9d53ae
--- /dev/null
+++ b/benchmarks/bench_violation_cull.m
@@ -0,0 +1,174 @@
+function bench_violation_cull()
+%BENCH_VIOLATION_CULL Isolated microbenchmark of the threshold-marker hot path.
+%
+%   violation_cull is the fused detect-and-cull kernel behind threshold
+%   violation markers. On every render and every zoom/pan, FastSense calls it
+%   once per (threshold x line) for each threshold with ShowViolations: it
+%   finds the points that cross the threshold and culls them to one marker per
+%   pixel column in a single pass (FastSense.m:1368/1371, 4468/4471). It is
+%   MEX-accelerated (violation_cull_mex) with a pure-MATLAB fallback, and
+%   handles both constant thresholds and time-varying (step-function)
+%   thresholds — the latter a recent feature (per-widget time-varying spec).
+%   No benchmark exercises it directly; only bench_event_marker_regression
+%   touches a neighbouring render path (getEventsForTag).
+%
+%   This benchmark times BOTH threshold branches as pure computation (no
+%   figure, no rendering): a constant threshold (thX = 0 sentinel) and a
+%   multi-knot step-function threshold, across an input-size sweep, reporting
+%   per-call latency and throughput (input Mpts/s).
+%
+%   In production the input is the line's DISPLAYED (downsampled) data —
+%   typically a few thousand points (~2 x pixel width). The lower sizes here
+%   bracket that realistic range; the larger sizes exist to verify the kernel
+%   scales linearly with input length (the regression we actually guard).
+%
+%   Gate (machine-independent): detection + culling is an O(N) sweep, so the
+%   empirical log-log scaling exponent over the large-N portion must stay
+%   sub-quadratic (<= 1.3). Super-linear creep trips the gate regardless of
+%   absolute host speed.
+%
+%   Warmup dissolves JIT first-call overhead; small sizes are timed over an
+%   inner repeat loop so sub-millisecond calls stay measurable; median of
+%   nRuns defuses one-off spikes.
+%
+%   Run:
+%     octave --no-gui --eval "install(); bench_violation_cull();"
+%
+%   Exits 0 with "PASS: ..." on success; raises assert() (non-zero exit) if
+%   either branch's scaling exponent exceeds the gate.
+%
+%   See also violation_cull, compute_violations, compute_violations_dynamic,
+%            downsample_violations, bench_downsample_kernels.
+
+    here = fileparts(mfilename('fullpath'));
+    addpath(fullfile(here, '..'));
+    install();
+    % violation_cull (and violation_cull_mex) live in FastSense's private/
+    % folder, which cannot be put on the path. The current working folder is
+    % always searched regardless of its name, so cd-ing into private makes the
+    % wrapper directly callable in both MATLAB and Octave. onCleanup restores
+    % the original folder even if an assert below trips.
+    privDir = fullfile(here, '..', 'libs', 'FastSense', 'private');
+    origDir = pwd;
+    restoreDir = onCleanup(@() cd(origDir)); %#ok<NASGU>
+    cd(privDir);
+
+    sizes  = [1e3, 1e4, 1e5, 1e6];
+    labels = {'1K', '10K', '100K', '1M'};
+
+    nRuns      = 5;     % median of nRuns per (branch, size)
+    targetWork = 2e6;   % inner-loop repeats sized to process ~this many pts
+
+    % Deterministic seed — works in both MATLAB and Octave
+    if exist('rng', 'file') == 2
+        rng(0);
+    else
+        rand('state', 0); randn('state', 0); %#ok<RAND>
+    end
+
+    useMex = (exist('violation_cull_mex', 'file') == 3);
+
+    % Threshold configuration. Signal oscillates ~[-1.5, 1.5]; an upper
+    % threshold at 0.5 yields a healthy fraction of violations so the culling
+    % stage does real work. The step-function branch uses 5 knots across the
+    % X range to exercise the piecewise-constant interpolation path.
+    direction  = 'upper';
+    constLevel = 0.5;
+    PixelWidth = 1000;                       % nominal axis width in pixels
+    stepKnotsN = 5;
+
+    nSizes = numel(sizes);
+    tConst = zeros(1, nSizes);   % per-call seconds, constant threshold
+    tStep  = zeros(1, nSizes);   % per-call seconds, step-function threshold
+
+    fprintf('\n=== violation_cull threshold-marker microbenchmark (pure computation) ===\n');
+    fprintf('  violation_cull_mex: %s\n', tf_(useMex));
+    fprintf('  direction=%s  constLevel=%.2f  stepKnots=%d  pixelWidth=%d\n', ...
+        direction, constLevel, stepKnotsN, PixelWidth);
+    fprintf('  (production input = displayed/downsampled data, ~few thousand pts)\n');
+    fprintf('  %s\n', repmat('-', 1, 74));
+    fprintf('  %-6s | %-13s %-12s | %-13s %-12s\n', ...
+        'N', 'const (ms)', 'const Mpts/s', 'step (ms)', 'step Mpts/s');
+    fprintf('  %s\n', repmat('-', 1, 74));
+
+    for c = 1:nSizes
+        n = sizes(c);
+        x = linspace(0, 100, n);                       % sorted ascending
+        y = sin(x * 2 * pi / 10) + 0.5 * randn(1, n);  % ~[-1.5, 1.5]
+
+        pw   = (x(end) - x(1)) / PixelWidth;           % X units per pixel
+        xmin = x(1);
+
+        % Step-function threshold: knots across the X range, varying levels
+        thX = linspace(x(1), x(end), stepKnotsN);
+        thY = constLevel + 0.2 * sin(1:stepKnotsN);
+
+        nInner = max(1, ceil(targetWork / n));
+
+        % Constant threshold uses the thX = 0 sentinel (matches FastSense.m)
+        tConst(c) = timeCall_(@() violation_cull(x, y, 0, constLevel, direction, pw, xmin), ...
+            nInner, nRuns);
+        tStep(c)  = timeCall_(@() violation_cull(x, y, thX, thY, direction, pw, xmin), ...
+            nInner, nRuns);
+
+        fprintf('  %-6s | %11.4f   %10.1f   | %11.4f   %10.1f\n', ...
+            labels{c}, ...
+            tConst(c) * 1000, n / tConst(c) / 1e6, ...
+            tStep(c)  * 1000, n / tStep(c)  / 1e6);
+
+        clear x y;
+    end
+    fprintf('  %s\n', repmat('-', 1, 74));
+
+    % ---- Scaling gate: fit exponent over the large-N portion (>= 1e4) ----
+    fitMask = sizes >= 1e4;
+    slopeConst = scalingExponent_(sizes(fitMask), tConst(fitMask));
+    slopeStep  = scalingExponent_(sizes(fitMask), tStep(fitMask));
+
+    gate = 1.3;
+    fprintf('  Scaling exponent (large-N fit, ideal ~1.0):\n');
+    fprintf('    constant : %.2f   (gate: <= %.1f)\n', slopeConst, gate);
+    fprintf('    step     : %.2f   (gate: <= %.1f)\n', slopeStep, gate);
+    fprintf('  %s\n', repmat('-', 1, 74));
+
+    assert(slopeConst <= gate, ...
+        sprintf(['FAIL: violation_cull (constant) scaling exponent %.2f exceeds %.1f — ' ...
+                 'super-linear creep in the threshold-marker path.'], slopeConst, gate));
+    assert(slopeStep <= gate, ...
+        sprintf(['FAIL: violation_cull (step) scaling exponent %.2f exceeds %.1f — ' ...
+                 'super-linear creep in the threshold-marker path.'], slopeStep, gate));
+    fprintf('  PASS: both branches scale near-linearly (gate: exponent <= %.1f).\n\n', gate);
+end
+
+function t = timeCall_(fn, nInner, nRuns)
+    %TIMECALL_ Median-of-nRuns per-call time of fn, averaged over nInner reps.
+    %   Warms up first to dissolve JIT/first-call overhead, then times nInner
+    %   back-to-back calls per run and returns the median run divided by
+    %   nInner — a robust per-call estimate that keeps sub-ms calls measurable.
+    fn(); fn(); % warmup
+    runTimes = zeros(1, nRuns);
+    for r = 1:nRuns
+        t0 = tic;
+        for i = 1:nInner
+            fn();
+        end
+        runTimes(r) = toc(t0);
+    end
+    t = median(runTimes) / nInner;
+end
+
+function slope = scalingExponent_(ns, times)
+    %SCALINGEXPONENT_ Log-log slope of per-call time vs N (the O(N) exponent).
+    %   slope ~ 1.0 indicates linear scaling; > 1 indicates super-linear creep.
+    times = max(times, eps);
+    p = polyfit(log10(ns(:)), log10(times(:)), 1);
+    slope = p(1);
+end
+
+function s = tf_(b)
+    if b
+        s = 'active';
+    else
+        s = 'fallback (pure MATLAB)';
+    end
+end