Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions benchmarks/.reports/coverage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Benchmark coverage map

Maintained by `/bench-evolve`. Each run adds **one** new benchmark targeting the
most important performance-critical path that still lacks coverage, then records
it here so the next run can see history and pick the next-biggest gap.

## Covered paths

| Path / hot spot | Benchmark(s) |
|---|---|
| FastSense vs `plot()`, point reduction | `benchmark.m` |
| Feature overhead (theme/band/shaded/fill/marker/grid) | `benchmark_features.m` |
| Memory: in-RAM vs disk-backed | `benchmark_memory.m` |
| Zoom/pan per-frame latency | `benchmark_zoom.m` |
| DataStore write / range query / disk render | `benchmark_datastore.m`, `profile_datastore.m` |
| Dashboard create / render / live tick | `bench_dashboard.m`, `bench_dashboard_live.m`, `bench_dashboard_load.m` |
| Legacy-vs-Tag consumer tick parity | `bench_consumer_migration_tick.m` |
| 1000-tag live pipeline tick | `bench_tag_pipeline_1k.m` |
| MonitorTag append vs recompute / tick | `bench_monitortag_append.m`, `bench_monitortag_tick.m` |
| SensorTag.getXY zero-copy | `bench_sensortag_getxy.m` |
| CompositeTag k-way merge | `bench_compositetag_merge.m` |
| Event-marker rendering overhead | `bench_event_marker_regression.m` |
| **Downsample kernels (LTTB + MinMax) throughput** | **`bench_downsample_kernels.m`** ← 2026-06-24 |
| **Threshold violation detect + pixel cull (constant & step)** | **`bench_violation_cull.m`** ← 2026-06-24 |
| **Binary-search (viewport-clip / bucket-boundary) per-call latency** | **`bench_binary_search.m`** ← 2026-06-24 |
| **Multi-line live-update scaling (lines-per-axes, refresh Hz)** | **`bench_fastsense_multiline.m`** ← 2026-06-24 |
| **Detached-mirror refresh overhead (headline constraint)** | **`bench_detached_mirror_refresh.m`** ← 2026-06-24 |
| **DerivedTag resolve-chain recompute vs depth** | **`bench_derived_resolve_chain.m`** ← 2026-06-24 |

## Run log

- **2026-06-24** — Added `bench_downsample_kernels.m`. Gap closed: the core
`resolve → downsample → render` kernels (`lttb_downsample`, `minmax_downsample`)
had **no direct microbenchmark** — `benchmark.m` called MinMax once at a single
size and LTTB was never timed. New bench sweeps 10K→20M points, times both
kernels (MEX path when available), reports Mpts/s throughput, and checks the
O(N) coefficient for super-linear creep. First MATLAB R2025b (MEX) run:
LTTB 42→344 Mpts/s, MinMax 133→1100 Mpts/s, scaling linear (0.91x / 0.74x).
- **2026-06-24** — Added `bench_violation_cull.m`. Gap closed: the fused
threshold violation-detection + pixel-culling kernel (`violation_cull`, MEX:
`violation_cull_mex`) — the threshold-marker counterpart of the downsamplers,
run on every render/zoom when a threshold is attached — had no direct bench.
New bench sweeps 10K→20M points across BOTH the constant and step/ZOH threshold
branches, reports Mpts/s + culled-marker counts, checks O(N) scaling. First
MATLAB R2025b (MEX) run: constant 118→504 Mpts/s, step 37→457 Mpts/s, markers
cap at the 1000-pixel budget by 5M, scaling linear (0.93x / 0.85x).
- **2026-06-24** — Added `bench_binary_search.m`. Gap closed: the O(log N)
`binary_search` kernel (MEX: `binary_search_mex`) — the most ubiquitous hot-path
call (every viewport clip / zoom does left+right edge searches; downsamplers map
bucket boundaries through it) — had no direct bench. New bench times per-call
latency over 1e5 scattered queries across 10K→50M-element arrays, both 'left'
and 'right', and checks growth vs O(log N). First MATLAB R2025b (MEX) run:
~918→1610 ns/call, 1M→50M growth 1.45x (vs pure-log 1.28x — log + cache), well
short of any O(N) degradation.
- **2026-06-24** — Added `bench_fastsense_multiline.m` (FIRST composite-path
bench; per-kernel surface now complete — see Notes). Sweeps line count
(1→64) on one FastSense axes at 100K pts/line, timing updateData() (the live
path, which re-downsamples *all* lines per call) with SkipViewMode. First
MATLAB R2025b run: updateData 1.8→26 ms, us/line falls 1775→406 (sublinear —
fixed per-call overhead amortizes), effective refresh 564 Hz (1 line) → 39 Hz
(64 lines). Setup render ~1.1 s (figure-realization dominated, informational).
Verified portable (Octave API smoke passed).
- **2026-06-24** — Added `bench_detached_mirror_refresh.m` (the project's HEADLINE
constraint: "detached live-mirrored widgets must not degrade refresh rate").
Holds total widgets constant (8), detaches K=0/1/2/4 as mirrors, measures
amortized active onLiveTick(). **Key finding: detaching DOES add real refresh
overhead** — baseline ~18-20 ms (≈55 Hz), rising to +35% (1 mirror) … +120-200%
(4 mirrors). Mirrors tick in-line on the shared refresh path, so this is
expected; the bench baselines it so regressions (overhead GROWING) are caught.
Two methodology lessons baked in: (a) the path needs ~15-20 warmup ticks before
it settles — a short warmup made the first scenario eat all the JIT/cache noise
and falsely read "-61%"; (b) onLiveTick is BIMODAL under drawnow('limitrate'),
so the stable metric is the amortized average over a tick batch, not a per-tick
median. Verified MATLAB R2025b; **Octave-skipped** (see Notes).
- **2026-06-24** — Added `bench_derived_resolve_chain.m`. Gap closed: DerivedTag
(lazy-memoized derived-tag resolve) had NO bench — `bench_compositetag_merge`
covers the CompositeTag merge only. Builds a sensor->T1->...->TD chain, sweeps
depth 1→32, and times (a) COLD getXY after invalidating the whole chain (full
recompute = live cost) vs (b) WARM getXY (memoized cache). First MATLAB R2025b
run (1e6 pts/node): cold 0.69→14.2 ms ~linear in depth, warm flat ~0.02 ms
(memo saves the whole chain walk). Two tuning notes: 1e5 pts was noise-bound
(per-node <0.1 ms) so bumped to 1e6; cold recompute allocates a fresh array
per node (bursty GC) so the estimator is MIN-over-reps, not median. Verified
portable (full Octave run passed: cold 30 ms / warm 0.02 ms at depth 32).

## Per-kernel MEX surface: COMPLETE

Every MEX kernel **actually wired into production** now has direct coverage:

| Kernel | Production caller | Bench |
|---|---|---|
| `lttb_core_mex`, `minmax_core_mex` | `lttb_downsample`, `minmax_downsample` | `bench_downsample_kernels.m` |
| `violation_cull_mex` | `violation_cull` | `bench_violation_cull.m` |
| `binary_search_mex` | `binary_search` (public + private) | `bench_binary_search.m` |
| `build_store_mex` | `FastSenseDataStore.m:672` | `benchmark_datastore.m` (write timing) |

**Three kernels are test-only / unwired** — they ship and have parity tests but
no production `.m` call site (verified by grepping `libs/**/*.m` for `<name>(`):
`compute_violations_mex`, `to_step_function_mex`, `resolve_disk_mex`. They are
**deliberately NOT benched** (a bench would not reflect any real hot path). This
is worth flagging to maintainers: either wire them in or treat them as staged.
Do **not** keep re-picking them in future runs.

## Remaining gaps (ranked — composite paths)

The loop has now pivoted from kernels to composite hot paths. Candidates:

1. **Pyramid-cache rebuild cost** — `FastSenseDataStore` pyramid level rebuild on
live append; touched by `benchmark_datastore.m` but not isolated.
**Top next gap** — deterministic (disk), no figures.
2. **Widget-count refresh sweep** — `bench_dashboard_live.m` fixes 8 widgets; a
widget-count sweep (no detach) would baseline the in-grid refresh scaling.
3. **DerivedTag fan-OUT / CompositeTag fan-IN width** — `bench_derived_resolve_chain.m`
covers chain DEPTH; the complementary axis is one sensor feeding many derived
tags (fan-out) and many sensors merged by one CompositeTag (fan-in width).

### Notes / corrections
- `compute_violations` + `compute_violations_dynamic` + `downsample_violations`
are covered **indirectly** by `bench_violation_cull.m`: the fused `violation_cull`
kernel dispatches to them in its pure-MATLAB fallback, and the bench drives both
the constant (`compute_violations`) and step/ZOH (`compute_violations_dynamic`)
branches. `compute_violations.m` standalone is a trivial one-line vectorized mask
(no MEX dispatch) — not worth an isolated bench.
- **Octave-compat finding (for maintainers; spun off as a separate task):** the
real culprit is `DashboardWidgetRegistry.fromStruct` at
`libs/Dashboard/DashboardWidgetRegistry.m:92` — `w = feval([className '.fromStruct'], s)`.
Octave does not resolve that dotted-static-method feval form ("function
'FastSenseWidget.fromStruct' not found"). This breaks BOTH serialized dashboard
load (`DashboardEngine.load`) and the detach feature (`detachWidget` →
`DetachedMirror.cloneWidget:178` → the registry call) under Octave, despite
Octave being "fully supported." `bench_detached_mirror_refresh.m` guards with an
Octave skip. Fix candidate: `fn = str2func([className '.fromStruct']); w = fn(s);`.
Not changed here (additive-only remit).
195 changes: 195 additions & 0 deletions benchmarks/bench_binary_search.m
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
function result = bench_binary_search()
%BENCH_BINARY_SEARCH Per-call latency microbenchmark for the binary-search kernel.
%
% Times binary_search (MEX: binary_search_mex), the O(log N) lower/upper
% bound search over a sorted time array. It is the most ubiquitously called
% kernel on the hot path: every viewport clip and every zoom/pan locates
% the visible window by two binary searches (left edge + right edge) into
% the full sorted X array, and the downsamplers call it to map bucket
% boundaries to indices. The private copy's own header calls it "used
% extensively by the downsampling and viewport-clipping routines."
%
% Unlike the downsample/violation kernels (O(N), tens of ms), a single
% binary search is sub-microsecond, so throughput-per-point is the wrong
% metric. This bench instead measures PER-CALL latency over a large batch
% of scattered queries, and reports how it scales with log2(N). Because
% the dominant cost is the MATLAB->MEX dispatch the caller actually pays
% (the bench times the binary_search wrapper, not binary_search_mex
% directly), the measured number is the real per-query cost on the hot
% path. No bench_*.m exercised it before.
%
% What it measures (deterministic, no RNG):
% - Per-call latency (ns) for both 'left' and 'right' searches, median
% of nReps after warmup, over a size sweep (10K -> 50M element arrays).
% - Throughput in millions of calls/second.
% - Whether latency tracks O(log N) (the array doubles cost the search
% one extra comparison) rather than degrading to O(N).
%
% Queries are a deterministic low-discrepancy scramble across [min, max]
% plus a 5% out-of-range margin (exercising the clamp branches), so the
% access pattern into the searched array is scattered — realistic for
% viewport clipping, not an artificially cache-friendly monotone sweep.
%
% Throughput bench, not a pass/fail gate: it PRINTS results and a soft
% scaling advisory. The next /bench-guard (or /perf-watch) run baselines
% the absolute numbers. The active path (compiled MEX vs pure-MATLAB
% fallback) is detected and labelled — the MEX speedup is what it protects.
%
% Run:
% octave --no-gui --eval "install(); bench_binary_search();"
% % or in MATLAB:
% bench_binary_search
%
% Returns a struct (sizes, per-direction ns/call + Mcalls/s, MEX flag,
% scaling drift) for programmatic baselining.
%
% See also binary_search, binary_search_mex, lttb_downsample,
% minmax_downsample, bench_downsample_kernels.

here = fileparts(mfilename('fullpath'));
addpath(fullfile(here, '..'));
install();

% binary_search has a public copy on the path, but binary_search_mex is a
% PRIVATE helper. We cd into the private folder for the bench: the private
% binary_search.m is callable from cwd and sees the MEX (which lives there
% too), so this exercises the real MEX path AND lets us detect it reliably
% via exist(). onCleanup restores cwd even on error. Portable MATLAB/Octave.
privDir = fullfile(here, '..', 'libs', 'FastSense', 'private');
origDir = pwd;
restoreCwd = onCleanup(@() cd(origDir)); %#ok<NASGU>
cd(privDir);

% ---- Configuration ----
sizes = [1e4, 1e5, 1e6, 1e7, 5e7];
labels = {'10K', '100K', '1M', '10M', '50M'};

nQueries = 1e5; % calls per timed run (per direction)
nWarm = 2; % warmup runs dissolve JIT / first-call MEX load
nReps = 5; % median over nReps defuses one-off spikes
scramble = 31337; % prime, coprime to nQueries -> low-discrepancy order

bsMex = (exist('binary_search_mex', 'file') == 3);

fprintf('\n================================================================\n');
fprintf(' FastSense Binary-Search Kernel Latency Benchmark\n');
fprintf(' O(log N) viewport-clip / bucket-boundary search (per-call cost)\n');
fprintf('================================================================\n');
fprintf(' queries/run = %d path: %s warmup = %d reps = %d (median)\n', ...
nQueries, pathLabel_(bsMex), nWarm, nReps);
fprintf(' %s\n', repmat('-', 1, 72));
fprintf(' %-6s | %7s | %-21s | %-21s\n', 'N', 'log2 N', 'left search', 'right search');
fprintf(' %-6s | %7s | %11s %9s | %11s %9s\n', ...
'', '(cmp)', 'ns/call', 'Mcall/s', 'ns/call', 'Mcall/s');
fprintf(' %s\n', repmat('-', 1, 72));

nS = numel(sizes);
nsL = zeros(1, nS); tputL = zeros(1, nS);
nsR = zeros(1, nS); tputR = zeros(1, nS);

% Deterministic scrambled query order (computed once; same length each size)
perm = mod((0:nQueries - 1) * scramble, nQueries) + 1;

for c = 1:nS
n = sizes(c);

% Sorted ascending array to search. Only X is needed.
x = linspace(0, n / 100, n);
xmin = x(1);
xmax = x(end);
span = xmax - xmin;

% Query values span the range + 5% margin on each side (clamp paths),
% visited in a scattered, deterministic order (no RNG).
valsSorted = linspace(xmin - 0.05 * span, xmax + 0.05 * span, nQueries);
vals = valsSorted(perm);

nsL(c) = perCallNs_(@() runQueries_(x, vals, 'left'), nWarm, nReps, nQueries);
tputL(c) = 1e3 / nsL(c); % calls per us *1e3 -> Mcalls/s == 1e9/ns/1e6
nsR(c) = perCallNs_(@() runQueries_(x, vals, 'right'), nWarm, nReps, nQueries);
tputR(c) = 1e3 / nsR(c);

fprintf(' %-6s | %7.1f | %11.1f %9.1f | %11.1f %9.1f\n', ...
labels{c}, log2(n), nsL(c), tputL(c), nsR(c), tputR(c));
end

fprintf(' %s\n', repmat('-', 1, 72));

% ---- Soft scaling advisory ----
% Pure O(log N): latency from 1M to 50M should rise by at most
% log2(50M)/log2(1M) ~= 1.3x (or stay flat if dispatch-bound). A >3x rise
% means the search is degrading worse than logarithmic. Advisory only.
refIdx = 3; % 1M
lDrift = nsL(end) / nsL(refIdx);
rDrift = nsR(end) / nsR(refIdx);
logRatio = log2(sizes(end)) / log2(sizes(refIdx));

fprintf(' Scaling (ns/call, %s -> %s; pure O(log N) ~= %.2fx):\n', ...
labels{refIdx}, labels{end}, logRatio);
fprintf(' left : %.1f -> %.1f (%.2fx) %s\n', ...
nsL(refIdx), nsL(end), lDrift, logDriftLabel_(lDrift));
fprintf(' right : %.1f -> %.1f (%.2fx) %s\n', ...
nsR(refIdx), nsR(end), rDrift, logDriftLabel_(rDrift));
fprintf(' %s\n', repmat('-', 1, 72));
fprintf(' Note: latency bench (no time gate). /bench-guard baselines these numbers.\n\n');

result = struct( ...
'sizes', sizes, ...
'labels', {labels}, ...
'nQueries', nQueries, ...
'binarySearchMex', bsMex, ...
'leftNsPerCall', nsL, ...
'leftMcallsPerS', tputL, ...
'rightNsPerCall', nsR, ...
'rightMcallsPerS', tputR, ...
'leftDrift', lDrift, ...
'rightDrift', rDrift, ...
'logRatio', logRatio);
end

function acc = runQueries_(x, vals, direction)
%RUNQUERIES_ Issue numel(vals) binary searches; accumulate idx as a sink
% so the loop body cannot be optimized away.
acc = 0;
for q = 1:numel(vals)
acc = acc + binary_search(x, vals(q), direction);
end
end

function ns = perCallNs_(fn, nWarm, nReps, nCalls)
%PERCALLNS_ Median per-call latency (ns) of fn (which issues nCalls calls).
for w = 1:nWarm
fn();
end
ts = zeros(1, nReps);
for r = 1:nReps
tic;
fn();
ts(r) = toc;
end
ns = (median(ts) / nCalls) * 1e9;
end

function s = pathLabel_(useMex)
%PATHLABEL_ Human label for the active kernel implementation.
if useMex
s = 'MEX (compiled)';
else
s = 'pure-MATLAB fallback';
end
end

function s = logDriftLabel_(drift)
%LOGDRIFTLABEL_ Soft verdict on latency growth vs O(log N).
% Reference: a 50x array growth costs pure O(log N) only ~1.3x. Mild
% excess above that is cache-miss cost on the larger array, still
% logarithmic in comparison count; a large multiple means the search
% has degraded toward O(N).
if drift > 3.0
s = '<< WATCH: growth exceeds O(log N)';
elseif drift > 1.35
s = '(grows ~O(log N) + cache)';
else
s = '(flat — dispatch-bound)';
end
end
Loading