diff --git a/benchmarks/.reports/coverage.md b/benchmarks/.reports/coverage.md new file mode 100644 index 00000000..fc24fe84 --- /dev/null +++ b/benchmarks/.reports/coverage.md @@ -0,0 +1,133 @@ +# Benchmark coverage map + +Maintained by `/bench-evolve`. Each run adds **one** new benchmark targeting the +most important performance-critical path that still lacks coverage, then records +it here so the next run can see history and pick the next-biggest gap. + +## Covered paths + +| Path / hot spot | Benchmark(s) | +|---|---| +| FastSense vs `plot()`, point reduction | `benchmark.m` | +| Feature overhead (theme/band/shaded/fill/marker/grid) | `benchmark_features.m` | +| Memory: in-RAM vs disk-backed | `benchmark_memory.m` | +| Zoom/pan per-frame latency | `benchmark_zoom.m` | +| DataStore write / range query / disk render | `benchmark_datastore.m`, `profile_datastore.m` | +| Dashboard create / render / live tick | `bench_dashboard.m`, `bench_dashboard_live.m`, `bench_dashboard_load.m` | +| Legacy-vs-Tag consumer tick parity | `bench_consumer_migration_tick.m` | +| 1000-tag live pipeline tick | `bench_tag_pipeline_1k.m` | +| MonitorTag append vs recompute / tick | `bench_monitortag_append.m`, `bench_monitortag_tick.m` | +| SensorTag.getXY zero-copy | `bench_sensortag_getxy.m` | +| CompositeTag k-way merge | `bench_compositetag_merge.m` | +| Event-marker rendering overhead | `bench_event_marker_regression.m` | +| **Downsample kernels (LTTB + MinMax) throughput** | **`bench_downsample_kernels.m`** ← 2026-06-24 | +| **Threshold violation detect + pixel cull (constant & step)** | **`bench_violation_cull.m`** ← 2026-06-24 | +| **Binary-search (viewport-clip / bucket-boundary) per-call latency** | **`bench_binary_search.m`** ← 2026-06-24 | +| **Multi-line live-update scaling (lines-per-axes, refresh Hz)** | **`bench_fastsense_multiline.m`** ← 2026-06-24 | +| **Detached-mirror refresh overhead (headline constraint)** | **`bench_detached_mirror_refresh.m`** ← 2026-06-24 | +| **DerivedTag resolve-chain recompute vs depth** | **`bench_derived_resolve_chain.m`** ← 2026-06-24 | + +## Run log + +- **2026-06-24** — Added `bench_downsample_kernels.m`. Gap closed: the core + `resolve → downsample → render` kernels (`lttb_downsample`, `minmax_downsample`) + had **no direct microbenchmark** — `benchmark.m` called MinMax once at a single + size and LTTB was never timed. New bench sweeps 10K→20M points, times both + kernels (MEX path when available), reports Mpts/s throughput, and checks the + O(N) coefficient for super-linear creep. First MATLAB R2025b (MEX) run: + LTTB 42→344 Mpts/s, MinMax 133→1100 Mpts/s, scaling linear (0.91x / 0.74x). +- **2026-06-24** — Added `bench_violation_cull.m`. Gap closed: the fused + threshold violation-detection + pixel-culling kernel (`violation_cull`, MEX: + `violation_cull_mex`) — the threshold-marker counterpart of the downsamplers, + run on every render/zoom when a threshold is attached — had no direct bench. + New bench sweeps 10K→20M points across BOTH the constant and step/ZOH threshold + branches, reports Mpts/s + culled-marker counts, checks O(N) scaling. First + MATLAB R2025b (MEX) run: constant 118→504 Mpts/s, step 37→457 Mpts/s, markers + cap at the 1000-pixel budget by 5M, scaling linear (0.93x / 0.85x). +- **2026-06-24** — Added `bench_binary_search.m`. Gap closed: the O(log N) + `binary_search` kernel (MEX: `binary_search_mex`) — the most ubiquitous hot-path + call (every viewport clip / zoom does left+right edge searches; downsamplers map + bucket boundaries through it) — had no direct bench. New bench times per-call + latency over 1e5 scattered queries across 10K→50M-element arrays, both 'left' + and 'right', and checks growth vs O(log N). First MATLAB R2025b (MEX) run: + ~918→1610 ns/call, 1M→50M growth 1.45x (vs pure-log 1.28x — log + cache), well + short of any O(N) degradation. +- **2026-06-24** — Added `bench_fastsense_multiline.m` (FIRST composite-path + bench; per-kernel surface now complete — see Notes). Sweeps line count + (1→64) on one FastSense axes at 100K pts/line, timing updateData() (the live + path, which re-downsamples *all* lines per call) with SkipViewMode. First + MATLAB R2025b run: updateData 1.8→26 ms, us/line falls 1775→406 (sublinear — + fixed per-call overhead amortizes), effective refresh 564 Hz (1 line) → 39 Hz + (64 lines). Setup render ~1.1 s (figure-realization dominated, informational). + Verified portable (Octave API smoke passed). +- **2026-06-24** — Added `bench_detached_mirror_refresh.m` (the project's HEADLINE + constraint: "detached live-mirrored widgets must not degrade refresh rate"). + Holds total widgets constant (8), detaches K=0/1/2/4 as mirrors, measures + amortized active onLiveTick(). **Key finding: detaching DOES add real refresh + overhead** — baseline ~18-20 ms (≈55 Hz), rising to +35% (1 mirror) … +120-200% + (4 mirrors). Mirrors tick in-line on the shared refresh path, so this is + expected; the bench baselines it so regressions (overhead GROWING) are caught. + Two methodology lessons baked in: (a) the path needs ~15-20 warmup ticks before + it settles — a short warmup made the first scenario eat all the JIT/cache noise + and falsely read "-61%"; (b) onLiveTick is BIMODAL under drawnow('limitrate'), + so the stable metric is the amortized average over a tick batch, not a per-tick + median. Verified MATLAB R2025b; **Octave-skipped** (see Notes). +- **2026-06-24** — Added `bench_derived_resolve_chain.m`. Gap closed: DerivedTag + (lazy-memoized derived-tag resolve) had NO bench — `bench_compositetag_merge` + covers the CompositeTag merge only. Builds a sensor->T1->...->TD chain, sweeps + depth 1→32, and times (a) COLD getXY after invalidating the whole chain (full + recompute = live cost) vs (b) WARM getXY (memoized cache). First MATLAB R2025b + run (1e6 pts/node): cold 0.69→14.2 ms ~linear in depth, warm flat ~0.02 ms + (memo saves the whole chain walk). Two tuning notes: 1e5 pts was noise-bound + (per-node <0.1 ms) so bumped to 1e6; cold recompute allocates a fresh array + per node (bursty GC) so the estimator is MIN-over-reps, not median. Verified + portable (full Octave run passed: cold 30 ms / warm 0.02 ms at depth 32). + +## Per-kernel MEX surface: COMPLETE + +Every MEX kernel **actually wired into production** now has direct coverage: + +| Kernel | Production caller | Bench | +|---|---|---| +| `lttb_core_mex`, `minmax_core_mex` | `lttb_downsample`, `minmax_downsample` | `bench_downsample_kernels.m` | +| `violation_cull_mex` | `violation_cull` | `bench_violation_cull.m` | +| `binary_search_mex` | `binary_search` (public + private) | `bench_binary_search.m` | +| `build_store_mex` | `FastSenseDataStore.m:672` | `benchmark_datastore.m` (write timing) | + +**Three kernels are test-only / unwired** — they ship and have parity tests but +no production `.m` call site (verified by grepping `libs/**/*.m` for `(`): +`compute_violations_mex`, `to_step_function_mex`, `resolve_disk_mex`. They are +**deliberately NOT benched** (a bench would not reflect any real hot path). This +is worth flagging to maintainers: either wire them in or treat them as staged. +Do **not** keep re-picking them in future runs. + +## Remaining gaps (ranked — composite paths) + +The loop has now pivoted from kernels to composite hot paths. Candidates: + +1. **Pyramid-cache rebuild cost** — `FastSenseDataStore` pyramid level rebuild on + live append; touched by `benchmark_datastore.m` but not isolated. + **Top next gap** — deterministic (disk), no figures. +2. **Widget-count refresh sweep** — `bench_dashboard_live.m` fixes 8 widgets; a + widget-count sweep (no detach) would baseline the in-grid refresh scaling. +3. **DerivedTag fan-OUT / CompositeTag fan-IN width** — `bench_derived_resolve_chain.m` + covers chain DEPTH; the complementary axis is one sensor feeding many derived + tags (fan-out) and many sensors merged by one CompositeTag (fan-in width). + +### Notes / corrections +- `compute_violations` + `compute_violations_dynamic` + `downsample_violations` + are covered **indirectly** by `bench_violation_cull.m`: the fused `violation_cull` + kernel dispatches to them in its pure-MATLAB fallback, and the bench drives both + the constant (`compute_violations`) and step/ZOH (`compute_violations_dynamic`) + branches. `compute_violations.m` standalone is a trivial one-line vectorized mask + (no MEX dispatch) — not worth an isolated bench. +- **Octave-compat finding (for maintainers; spun off as a separate task):** the + real culprit is `DashboardWidgetRegistry.fromStruct` at + `libs/Dashboard/DashboardWidgetRegistry.m:92` — `w = feval([className '.fromStruct'], s)`. + Octave does not resolve that dotted-static-method feval form ("function + 'FastSenseWidget.fromStruct' not found"). This breaks BOTH serialized dashboard + load (`DashboardEngine.load`) and the detach feature (`detachWidget` → + `DetachedMirror.cloneWidget:178` → the registry call) under Octave, despite + Octave being "fully supported." `bench_detached_mirror_refresh.m` guards with an + Octave skip. Fix candidate: `fn = str2func([className '.fromStruct']); w = fn(s);`. + Not changed here (additive-only remit). diff --git a/benchmarks/bench_binary_search.m b/benchmarks/bench_binary_search.m new file mode 100644 index 00000000..87849a22 --- /dev/null +++ b/benchmarks/bench_binary_search.m @@ -0,0 +1,195 @@ +function result = bench_binary_search() +%BENCH_BINARY_SEARCH Per-call latency microbenchmark for the binary-search kernel. +% +% Times binary_search (MEX: binary_search_mex), the O(log N) lower/upper +% bound search over a sorted time array. It is the most ubiquitously called +% kernel on the hot path: every viewport clip and every zoom/pan locates +% the visible window by two binary searches (left edge + right edge) into +% the full sorted X array, and the downsamplers call it to map bucket +% boundaries to indices. The private copy's own header calls it "used +% extensively by the downsampling and viewport-clipping routines." +% +% Unlike the downsample/violation kernels (O(N), tens of ms), a single +% binary search is sub-microsecond, so throughput-per-point is the wrong +% metric. This bench instead measures PER-CALL latency over a large batch +% of scattered queries, and reports how it scales with log2(N). Because +% the dominant cost is the MATLAB->MEX dispatch the caller actually pays +% (the bench times the binary_search wrapper, not binary_search_mex +% directly), the measured number is the real per-query cost on the hot +% path. No bench_*.m exercised it before. +% +% What it measures (deterministic, no RNG): +% - Per-call latency (ns) for both 'left' and 'right' searches, median +% of nReps after warmup, over a size sweep (10K -> 50M element arrays). +% - Throughput in millions of calls/second. +% - Whether latency tracks O(log N) (the array doubles cost the search +% one extra comparison) rather than degrading to O(N). +% +% Queries are a deterministic low-discrepancy scramble across [min, max] +% plus a 5% out-of-range margin (exercising the clamp branches), so the +% access pattern into the searched array is scattered — realistic for +% viewport clipping, not an artificially cache-friendly monotone sweep. +% +% Throughput bench, not a pass/fail gate: it PRINTS results and a soft +% scaling advisory. The next /bench-guard (or /perf-watch) run baselines +% the absolute numbers. The active path (compiled MEX vs pure-MATLAB +% fallback) is detected and labelled — the MEX speedup is what it protects. +% +% Run: +% octave --no-gui --eval "install(); bench_binary_search();" +% % or in MATLAB: +% bench_binary_search +% +% Returns a struct (sizes, per-direction ns/call + Mcalls/s, MEX flag, +% scaling drift) for programmatic baselining. +% +% See also binary_search, binary_search_mex, lttb_downsample, +% minmax_downsample, bench_downsample_kernels. + + here = fileparts(mfilename('fullpath')); + addpath(fullfile(here, '..')); + install(); + + % binary_search has a public copy on the path, but binary_search_mex is a + % PRIVATE helper. We cd into the private folder for the bench: the private + % binary_search.m is callable from cwd and sees the MEX (which lives there + % too), so this exercises the real MEX path AND lets us detect it reliably + % via exist(). onCleanup restores cwd even on error. Portable MATLAB/Octave. + privDir = fullfile(here, '..', 'libs', 'FastSense', 'private'); + origDir = pwd; + restoreCwd = onCleanup(@() cd(origDir)); %#ok + cd(privDir); + + % ---- Configuration ---- + sizes = [1e4, 1e5, 1e6, 1e7, 5e7]; + labels = {'10K', '100K', '1M', '10M', '50M'}; + + nQueries = 1e5; % calls per timed run (per direction) + nWarm = 2; % warmup runs dissolve JIT / first-call MEX load + nReps = 5; % median over nReps defuses one-off spikes + scramble = 31337; % prime, coprime to nQueries -> low-discrepancy order + + bsMex = (exist('binary_search_mex', 'file') == 3); + + fprintf('\n================================================================\n'); + fprintf(' FastSense Binary-Search Kernel Latency Benchmark\n'); + fprintf(' O(log N) viewport-clip / bucket-boundary search (per-call cost)\n'); + fprintf('================================================================\n'); + fprintf(' queries/run = %d path: %s warmup = %d reps = %d (median)\n', ... + nQueries, pathLabel_(bsMex), nWarm, nReps); + fprintf(' %s\n', repmat('-', 1, 72)); + fprintf(' %-6s | %7s | %-21s | %-21s\n', 'N', 'log2 N', 'left search', 'right search'); + fprintf(' %-6s | %7s | %11s %9s | %11s %9s\n', ... + '', '(cmp)', 'ns/call', 'Mcall/s', 'ns/call', 'Mcall/s'); + fprintf(' %s\n', repmat('-', 1, 72)); + + nS = numel(sizes); + nsL = zeros(1, nS); tputL = zeros(1, nS); + nsR = zeros(1, nS); tputR = zeros(1, nS); + + % Deterministic scrambled query order (computed once; same length each size) + perm = mod((0:nQueries - 1) * scramble, nQueries) + 1; + + for c = 1:nS + n = sizes(c); + + % Sorted ascending array to search. Only X is needed. + x = linspace(0, n / 100, n); + xmin = x(1); + xmax = x(end); + span = xmax - xmin; + + % Query values span the range + 5% margin on each side (clamp paths), + % visited in a scattered, deterministic order (no RNG). + valsSorted = linspace(xmin - 0.05 * span, xmax + 0.05 * span, nQueries); + vals = valsSorted(perm); + + nsL(c) = perCallNs_(@() runQueries_(x, vals, 'left'), nWarm, nReps, nQueries); + tputL(c) = 1e3 / nsL(c); % calls per us *1e3 -> Mcalls/s == 1e9/ns/1e6 + nsR(c) = perCallNs_(@() runQueries_(x, vals, 'right'), nWarm, nReps, nQueries); + tputR(c) = 1e3 / nsR(c); + + fprintf(' %-6s | %7.1f | %11.1f %9.1f | %11.1f %9.1f\n', ... + labels{c}, log2(n), nsL(c), tputL(c), nsR(c), tputR(c)); + end + + fprintf(' %s\n', repmat('-', 1, 72)); + + % ---- Soft scaling advisory ---- + % Pure O(log N): latency from 1M to 50M should rise by at most + % log2(50M)/log2(1M) ~= 1.3x (or stay flat if dispatch-bound). A >3x rise + % means the search is degrading worse than logarithmic. Advisory only. + refIdx = 3; % 1M + lDrift = nsL(end) / nsL(refIdx); + rDrift = nsR(end) / nsR(refIdx); + logRatio = log2(sizes(end)) / log2(sizes(refIdx)); + + fprintf(' Scaling (ns/call, %s -> %s; pure O(log N) ~= %.2fx):\n', ... + labels{refIdx}, labels{end}, logRatio); + fprintf(' left : %.1f -> %.1f (%.2fx) %s\n', ... + nsL(refIdx), nsL(end), lDrift, logDriftLabel_(lDrift)); + fprintf(' right : %.1f -> %.1f (%.2fx) %s\n', ... + nsR(refIdx), nsR(end), rDrift, logDriftLabel_(rDrift)); + fprintf(' %s\n', repmat('-', 1, 72)); + fprintf(' Note: latency bench (no time gate). /bench-guard baselines these numbers.\n\n'); + + result = struct( ... + 'sizes', sizes, ... + 'labels', {labels}, ... + 'nQueries', nQueries, ... + 'binarySearchMex', bsMex, ... + 'leftNsPerCall', nsL, ... + 'leftMcallsPerS', tputL, ... + 'rightNsPerCall', nsR, ... + 'rightMcallsPerS', tputR, ... + 'leftDrift', lDrift, ... + 'rightDrift', rDrift, ... + 'logRatio', logRatio); +end + +function acc = runQueries_(x, vals, direction) + %RUNQUERIES_ Issue numel(vals) binary searches; accumulate idx as a sink + % so the loop body cannot be optimized away. + acc = 0; + for q = 1:numel(vals) + acc = acc + binary_search(x, vals(q), direction); + end +end + +function ns = perCallNs_(fn, nWarm, nReps, nCalls) + %PERCALLNS_ Median per-call latency (ns) of fn (which issues nCalls calls). + for w = 1:nWarm + fn(); + end + ts = zeros(1, nReps); + for r = 1:nReps + tic; + fn(); + ts(r) = toc; + end + ns = (median(ts) / nCalls) * 1e9; +end + +function s = pathLabel_(useMex) + %PATHLABEL_ Human label for the active kernel implementation. + if useMex + s = 'MEX (compiled)'; + else + s = 'pure-MATLAB fallback'; + end +end + +function s = logDriftLabel_(drift) + %LOGDRIFTLABEL_ Soft verdict on latency growth vs O(log N). + % Reference: a 50x array growth costs pure O(log N) only ~1.3x. Mild + % excess above that is cache-miss cost on the larger array, still + % logarithmic in comparison count; a large multiple means the search + % has degraded toward O(N). + if drift > 3.0 + s = '<< WATCH: growth exceeds O(log N)'; + elseif drift > 1.35 + s = '(grows ~O(log N) + cache)'; + else + s = '(flat — dispatch-bound)'; + end +end diff --git a/benchmarks/bench_derived_resolve_chain.m b/benchmarks/bench_derived_resolve_chain.m new file mode 100644 index 00000000..f280fb99 --- /dev/null +++ b/benchmarks/bench_derived_resolve_chain.m @@ -0,0 +1,173 @@ +function result = bench_derived_resolve_chain() +%BENCH_DERIVED_RESOLVE_CHAIN Recompute cost of a DerivedTag dependency chain. +% +% Measures the resolve fan-out cost of DerivedTag — the lazy-memoized +% derived/composite tag whose getXY() recomputes from its parents on +% demand. When a base sensor updates in live mode, the invalidation +% cascades down every derived tag that depends on it, and the next refresh +% recomputes the whole chain. That full-chain recompute is the live cost of +% derived tags, and no benchmark exercised it: bench_compositetag_merge.m +% covers the CompositeTag k-way merge specifically, bench_tag_pipeline_1k.m +% measures aggregate pipeline tickOnce throughput, and bench_dashboard_load.m +% times getXY on a populated dashboard — none isolate how resolve cost +% scales with dependency-chain DEPTH. +% +% Topology: a base SensorTag T0 feeds a linear chain +% T0 (sensor) -> T1=f(T0) -> T2=f(T1) -> ... -> TD=f(T(D-1)) +% where each f is a cheap O(N) elementwise transform. Resolving the leaf TD +% after invalidating the chain recomputes all D nodes top-to-bottom. +% +% What it measures (deterministic, no RNG; no figures): +% - COLD getXY: invalidate every node, then time leaf getXY() — the full +% D-node recompute (the live-refresh cost). +% - WARM getXY: time leaf getXY() again with nothing dirty — returns the +% memoized cache (shows the lazy-memo payoff). +% - "us per node" for the cold path — flat => recompute is linear in chain +% depth; a rise would signal super-linear fan-out overhead. +% +% Throughput bench, not a pass/fail gate: it PRINTS results and a soft +% scaling advisory. The next /bench-guard (or /perf-watch) run baselines the +% numbers. +% +% Run: +% octave --no-gui --eval "install(); bench_derived_resolve_chain();" +% % or in MATLAB: +% bench_derived_resolve_chain +% +% Returns a struct (depths, cold/warm ms, us/node, scaling drift) for baselining. +% +% See also DerivedTag, CompositeTag, bench_compositetag_merge, +% bench_tag_pipeline_1k, bench_dashboard_load. + + here = fileparts(mfilename('fullpath')); + addpath(fullfile(here, '..')); + install(); + + % ---- Configuration ---- + depths = [1, 2, 4, 8, 16, 32]; + nPts = 1e6; % points in the base sensor (flows through every node). + % Sized so each node's O(N) recompute dominates timer + % noise — at 1e5 the per-node cost was sub-0.1 ms and + % the sweep was noise-bound. + nWarm = 2; + nReps = 10; % cold recompute allocates a fresh array per node, so + % deep chains see bursty GC; MIN over reps is the + % GC-robust compute-time estimator (standard for CPU + % microbenchmarks) and keeps the sweep monotonic. + + baseX = linspace(0, 1000, nPts); + baseY = sin(baseX / 7) + 0.2 * sin(baseX * 1.3); + + fprintf('\n================================================================\n'); + fprintf(' FastSense DerivedTag Resolve-Chain Benchmark\n'); + fprintf(' Full-chain recompute cost vs dependency depth (live resolve)\n'); + fprintf('================================================================\n'); + fprintf(' points = %d (flows through every node) warmup = %d reps = %d (min)\n', ... + nPts, nWarm, nReps); + fprintf(' %s\n', repmat('-', 1, 72)); + fprintf(' %-6s | %-9s | %-13s | %-10s | %-13s\n', ... + 'depth', 'nodes', 'cold ms', 'us/node', 'warm (cached) ms'); + fprintf(' %s\n', repmat('-', 1, 72)); + + nD = numel(depths); + coldMs = zeros(1, nD); + warmMs = zeros(1, nD); + usPerNode = zeros(1, nD); + + for c = 1:nD + D = depths(c); + + % Build the chain: sensor T0, then D derived tags each transforming + % its single parent. @chainStep_ is shared by every node. + t0 = SensorTag('chain-src', 'X', baseX, 'Y', baseY); + dt = cell(1, D); + prev = t0; + for i = 1:D + dt{i} = DerivedTag(sprintf('chain-d%d', i), {prev}, @chainStep_); + prev = dt{i}; + end + leaf = dt{D}; + + % Warmup (populate caches, dissolve JIT). + for w = 1:nWarm + invalidateChain_(dt); + leaf.getXY(); + end + + % COLD: invalidate every node, then time the full-chain recompute. + cold = zeros(1, nReps); + for r = 1:nReps + invalidateChain_(dt); + tic; + [xo, ~] = leaf.getXY(); %#ok + cold(r) = toc; + end + coldMs(c) = min(cold) * 1000; % GC-robust compute-time estimate + usPerNode(c) = (coldMs(c) * 1000) / D; + + % WARM: nothing dirty -> returns memoized cache. + warm = zeros(1, nReps); + for r = 1:nReps + tic; + leaf.getXY(); + warm(r) = toc; + end + warmMs(c) = min(warm) * 1000; + + % Correctness touch: leaf length must match the source. + assert(numel(xo) == nPts, 'bench_derived_resolve_chain:badOutput', ... + 'leaf getXY returned %d points, expected %d', numel(xo), nPts); + + fprintf(' %-6d | %-9d | %13.3f | %10.1f | %13.4f\n', ... + D, D, coldMs(c), usPerNode(c), warmMs(c)); + end + + fprintf(' %s\n', repmat('-', 1, 72)); + + % ---- Soft scaling advisory ---- + % Each node does O(N) work, so the cold recompute should be ~linear in + % depth => us/node flat. Compare us/node at the deepest chain against the + % shallowest; a large rise means super-linear fan-out overhead. + perNodeDrift = usPerNode(end) / usPerNode(1); + + fprintf(' Scaling (depth %d -> %d):\n', depths(1), depths(end)); + fprintf(' cold us/node : %.1f -> %.1f (%.2fx) %s\n', ... + usPerNode(1), usPerNode(end), perNodeDrift, perNodeLabel_(perNodeDrift)); + fprintf(' warm cache hit stays ~flat: %.4f -> %.4f ms\n', warmMs(1), warmMs(end)); + fprintf(' %s\n', repmat('-', 1, 72)); + fprintf(' Note: throughput bench (no time gate). /bench-guard baselines these numbers.\n\n'); + + result = struct( ... + 'depths', depths, ... + 'nPts', nPts, ... + 'coldMs', coldMs, ... + 'warmMs', warmMs, ... + 'usPerNode', usPerNode, ... + 'perNodeDrift', perNodeDrift); +end + +function [x, y] = chainStep_(parents) + %CHAINSTEP_ One derived node: pull the single parent's (X,Y) and apply a + % cheap O(N) elementwise transform. Deterministic, no RNG. + [x, y] = parents{1}.getXY(); + y = y * 0.999 + 0.001; +end + +function invalidateChain_(dt) + %INVALIDATECHAIN_ Mark every node dirty so the next leaf getXY recomputes + % the full chain (not just the leaf). + for i = 1:numel(dt) + dt{i}.invalidate(); + end +end + +function s = perNodeLabel_(drift) + %PERNODELABEL_ Soft verdict on per-node recompute growth. + if drift > 2.0 + s = '<< WATCH: super-linear fan-out cost'; + elseif drift > 1.5 + s = '(mild per-node rise)'; + else + s = '(linear in depth — O(depth * N))'; + end +end diff --git a/benchmarks/bench_detached_mirror_refresh.m b/benchmarks/bench_detached_mirror_refresh.m new file mode 100644 index 00000000..79e8e2e1 --- /dev/null +++ b/benchmarks/bench_detached_mirror_refresh.m @@ -0,0 +1,217 @@ +function result = bench_detached_mirror_refresh() +%BENCH_DETACHED_MIRROR_REFRESH Refresh-rate cost of detached live mirrors. +% +% Directly exercises the project's headline performance constraint: +% "detached live-mirrored widgets must not degrade dashboard refresh rate." +% No committed benchmark covered it — bench_dashboard_live.m times +% onLiveTick() for a fixed all-in-grid dashboard, never with a detached +% mirror attached. +% +% DashboardEngine.detachWidget() pops a widget into a standalone figure as +% a DetachedMirror, and onLiveTick() ticks every mirror in-line on the same +% refresh path (DashboardEngine.onLiveTick -> the DetachedMirrors loop -> +% DetachedMirror.tick). So a mirror's per-tick cost is paid by the live +% dashboard tick itself — this bench measures and baselines that cost. +% +% Experiment design (isolates the mirror variable): +% - Fixed total widget count N. Each scenario detaches K of them +% (K = 0,1,2,4) and re-measures active refresh latency. +% - A detached widget leaves the grid but its mirror still ticks, so +% TOTAL widgets serviced per tick is constant; only how many are +% mirrored changes. Rising refresh time => mirrors add cost. +% - Per-tag data size is held CONSTANT every tick (fresh Y on a fixed X, +% no array growth) so data volume is not a confound. +% +% Measurement method (learned from the data, not assumed): +% - The path needs a LONG warmup: per-tick cost decays over ~15-20 ticks +% (JIT + render-data caches). A large GLOBAL warmup precedes all +% measurement so scenario ordering cannot bias the result. +% - onLiveTick() time is BIMODAL because drawnow('limitrate') throttles +% figure flushes — some ticks flush (expensive), some are coalesced +% (cheap). A per-tick median is unstable on bimodal data, so this bench +% reports the AMORTIZED average over a fixed tick batch (total / nTicks) +% — which is exactly the effective refresh rate and is stable. (Same +% amortization bench_dashboard_live.m uses.) +% +% What it measures (deterministic, no RNG; headless): +% - amortized active refresh time (ms/tick) per mirror count. +% - effective refresh rate (Hz) and overhead vs the 0-mirror baseline. +% +% Throughput bench, not a pass/fail gate: it PRINTS results and a soft +% advisory. The next /bench-guard (or /perf-watch) run baselines the +% numbers and — most importantly — flags if the per-mirror overhead GROWS +% over time. Complements the /refresh-budget watchdog with a committed, +% baseline-able file. +% +% Run (MATLAB only — see note): +% bench_detached_mirror_refresh +% +% Note: the DetachedMirror path is currently MATLAB-only — under Octave, +% DetachedMirror.cloneWidget calls feval('FastSenseWidget.fromStruct', ...) +% and Octave does not resolve that dotted-static-method feval form +% ("function 'FastSenseWidget.fromStruct' not found"). This bench detects +% Octave and skips cleanly so CI/watchdog sweeps do not crash. (The Octave +% gap is a library-side compat issue worth flagging to maintainers.) +% +% Returns a struct (mirror counts, amortized ms, Hz, overhead %) for baselining. +% +% See also DashboardEngine.detachWidget, DashboardEngine.onLiveTick, +% DetachedMirror, bench_dashboard_live, bench_fastsense_multiline. + + here = fileparts(mfilename('fullpath')); + addpath(fullfile(here, '..')); + install(); + + % DetachedMirror.cloneWidget uses feval('FastSenseWidget.fromStruct', ...), + % which Octave does not resolve. Skip cleanly rather than crash a sweep. + if exist('OCTAVE_VERSION', 'builtin') ~= 0 + fprintf('\n[bench_detached_mirror_refresh] SKIPPED on Octave: the DetachedMirror\n'); + fprintf(' path requires MATLAB (feval ''FastSenseWidget.fromStruct'' unsupported).\n\n'); + result = struct('skipped', true, 'reason', 'octave-detach-unsupported'); + return; + end + + % ---- Configuration ---- + N_WIDGETS = 8; + N_PTS = 2e4; % points per tag (held constant every tick) + N_GLOBAL_WARM = 20; % global warmup — path settles over ~15-20 ticks + N_WARM = 12; % per-scenario settle (new mirror figures need + % several flushes before their cost stabilizes) + N_TICKS = 30; % amortized batch per scenario (total / N_TICKS) + detachCounts = [0, 1, 2, 4]; + + baseX = linspace(0, 1000, N_PTS); + + % ---- Build N Tag-bound FastSense widgets (deterministic data) ---- + tags = cell(1, N_WIDGETS); + for i = 1:N_WIDGETS + yi = sin(baseX / 7 + i) + 0.2 * sin(baseX * 1.3 + i); + tags{i} = SensorTag(sprintf('mir-tag-%d', i), 'X', baseX, 'Y', yi); + end + + d = DashboardEngine('BenchMirror'); + for i = 1:N_WIDGETS + col = mod(i - 1, 2) * 12 + 1; + row = ceil(i / 2); + d.addWidget('fastsense', ... + 'Title', sprintf('Tag %d', i), ... + 'Position', [col, row, 12, 2], ... + 'Tag', tags{i}); + end + + % Render headless; mute warnings only around render (e.g. legend caps). + wsR = warning('off', 'all'); + d.render(); + warning(wsR); + + widgets = d.activePageWidgets(); % capture handles in order (pre-detach) + + % ---- Global warmup so scenario ordering cannot bias the baseline ---- + phase = 0; + for w = 1:N_GLOBAL_WARM + phase = phase + 0.01; + doTick_(d, tags, baseX, phase); + end + + fprintf('\n================================================================\n'); + fprintf(' FastSense Detached-Mirror Refresh Benchmark\n'); + fprintf(' Constraint: detaching a live mirror must NOT degrade refresh\n'); + fprintf('================================================================\n'); + fprintf(' widgets = %d points/tag = %d (constant/tick) amortized over %d ticks\n', ... + N_WIDGETS, N_PTS, N_TICKS); + fprintf(' global warmup = %d ticks (onLiveTick is bimodal; amortized avg is the stable metric)\n', ... + N_GLOBAL_WARM); + fprintf(' %s\n', repmat('-', 1, 72)); + fprintf(' %-8s | %-7s | %-7s | %-13s | %-9s | %-10s\n', ... + 'mirrors', 'in-grid', 'total', 'refresh ms', 'refresh', 'vs base'); + fprintf(' %s\n', repmat('-', 1, 72)); + + nC = numel(detachCounts); + tickMs = zeros(1, nC); + refreshHz = zeros(1, nC); + overhead = zeros(1, nC); + detachedSoFar = 0; + + for c = 1:nC + target = detachCounts(c); + + % Detach progressively up to the target count. + while detachedSoFar < target + detachedSoFar = detachedSoFar + 1; + wsD = warning('off', 'all'); + d.detachWidget(widgets{detachedSoFar}); + warning(wsD); + end + + % Per-scenario settle (mirror figures need a first flush after detach). + for w = 1:N_WARM + phase = phase + 0.01; + doTick_(d, tags, baseX, phase); + end + + % Amortized active refresh: total wall time over N_TICKS / N_TICKS. + % Amortization absorbs the bimodal drawnow-limitrate flush pattern. + tBatch = tic; + for k = 1:N_TICKS + phase = phase + 0.01; + doTick_(d, tags, baseX, phase); + end + tickMs(c) = toc(tBatch) * 1000 / N_TICKS; + refreshHz(c) = 1000 / tickMs(c); + overhead(c) = (tickMs(c) / tickMs(1) - 1) * 100; + + fprintf(' %-8d | %-7d | %-7d | %13.3f | %6.0fHz | %+9.1f%%\n', ... + target, N_WIDGETS - target, N_WIDGETS, tickMs(c), refreshHz(c), overhead(c)); + end + + fprintf(' %s\n', repmat('-', 1, 72)); + + maxOver = max(overhead); + fprintf(' Baseline (0 mirrors): %.3f ms (%.0f Hz)\n', tickMs(1), refreshHz(1)); + fprintf(' Refresh overhead from mirrors (max): %+.1f%% %s\n', ... + maxOver, constraintLabel_(maxOver)); + fprintf(' %s\n', repmat('-', 1, 72)); + fprintf(' Note: composite bench (no time gate). /bench-guard baselines these + watches for growth.\n\n'); + + % ---- Cleanup: close mirror figures + dashboard ---- + try + for i = 1:numel(d.DetachedMirrors) + try, delete(d.DetachedMirrors{i}); catch, end + end + close(d.hFigure); + catch + end + + result = struct( ... + 'detachCounts', detachCounts, ... + 'nWidgets', N_WIDGETS, ... + 'nPts', N_PTS, ... + 'tickMs', tickMs, ... + 'refreshHz', refreshHz, ... + 'overheadPct', overhead, ... + 'maxOverheadPct', maxOver); +end + +function doTick_(d, tags, baseX, phase) + %DOTICK_ Replace every tag's data (fixed size, fresh Y) then run one + % onLiveTick(). Constant per-tag size isolates mirror overhead from data + % volume. The whole call is timed in batch by the caller (amortized). + for i = 1:numel(tags) + newY = sin(baseX / 7 + i + phase) + 0.2 * sin(baseX * 1.3 + i + phase); + tags{i}.updateData(baseX, newY); + end + d.onLiveTick(); +end + +function s = constraintLabel_(maxOverPct) + %CONSTRAINTLABEL_ Soft verdict on the mirror refresh overhead. + % Mirrors necessarily add draw work to the shared tick, so some overhead + % is expected; the point is to baseline it and catch regressions. + if maxOverPct > 150 + s = '<< notable: mirrors add heavy refresh cost — baseline + watch'; + elseif maxOverPct > 50 + s = '(mirrors add meaningful refresh cost — expected; watch for growth)'; + else + s = '(refresh largely preserved)'; + end +end diff --git a/benchmarks/bench_downsample_kernels.m b/benchmarks/bench_downsample_kernels.m new file mode 100644 index 00000000..23a2561a --- /dev/null +++ b/benchmarks/bench_downsample_kernels.m @@ -0,0 +1,195 @@ +function result = bench_downsample_kernels() +%BENCH_DOWNSAMPLE_KERNELS Throughput microbenchmark for the core downsamplers. +% +% Isolates and times the two kernels at the heart of FastSense's +% resolve -> downsample -> render hot path: +% +% lttb_downsample (LTTB, MEX: lttb_core_mex) — shape-preserving +% minmax_downsample (MinMax, MEX: minmax_core_mex) — envelope-preserving +% +% Both run on EVERY render and EVERY zoom/pan re-downsample, scanning the +% full N-point input (millions of samples) down to a ~2000-point display +% budget. They touch more data per frame than any other operation on the +% live path, so their per-point throughput is the dominant cost of +% plotting large series. Yet until now no benchmark exercised them +% directly: benchmark.m called minmax_downsample exactly once at a single +% size (as a side-measurement of the FastSense-vs-plot() comparison) and +% LTTB was never timed at all. benchmark_zoom.m measures full per-frame +% latency, but that number is dominated by drawnow/getframe GPU flush and +% hides the kernel's contribution. This bench fills that gap. +% +% What it measures (deterministic, no RNG): +% - Per-call wall time (median of nReps after warmup) for each kernel +% across a size sweep (10K -> 20M points). +% - Per-point throughput in millions of points/second (Mpts/s). +% - "ms per million points" — the O(N) cost coefficient. For a correct +% linear-time kernel this stays roughly FLAT as N grows; a creeping +% value is the early signature of an accidental super-linear (e.g. +% O(N log N) sort, repeated allocation) regression. +% +% It is a throughput bench, not a pass/fail gate: it PRINTS results and a +% soft scaling advisory rather than asserting a machine-specific time +% budget. The next /bench-guard (or /perf-watch) run baselines the +% absolute numbers and flags regressions against that local baseline. +% +% Each kernel's active path (compiled MEX vs pure-MATLAB fallback) is +% detected and labelled, since the MEX speedup is exactly what this bench +% protects. +% +% Run: +% octave --no-gui --eval "install(); bench_downsample_kernels();" +% % or in MATLAB: +% bench_downsample_kernels +% +% Returns a struct (sizes, per-kernel ms / throughput / output counts, +% MEX flags, scaling coefficients) for programmatic baselining. +% +% See also lttb_downsample, minmax_downsample, benchmark_zoom, benchmark, +% bench_sensortag_getxy. + + here = fileparts(mfilename('fullpath')); + addpath(fullfile(here, '..')); + install(); + + % lttb_downsample / minmax_downsample are PRIVATE helpers of FastSense. + % MATLAB refuses private directories on the path, so instead we cd into + % the private folder for the duration of the bench: current-folder + % functions are always callable, and the compiled MEX kernels + % (lttb_core_mex / minmax_core_mex) live there too, so this exercises the + % real MEX path. onCleanup restores the original cwd even on error. This + % is portable across MATLAB and Octave. + privDir = fullfile(here, '..', 'libs', 'FastSense', 'private'); + origDir = pwd; + restoreCwd = onCleanup(@() cd(origDir)); %#ok + cd(privDir); + + % ---- Configuration ---- + sizes = [1e4, 1e5, 1e6, 5e6, 2e7]; + labels = {'10K', '100K', '1M', '5M', '20M'}; + + numOut = 2000; % LTTB output budget (representative display target) + numBuckets = 1000; % MinMax buckets -> ~2000 output points (comparable) + + nWarm = 2; % warmup calls dissolve JIT / first-call MEX load + nReps = 5; % median over nReps defuses one-off spikes + + lttbMex = (exist('lttb_core_mex', 'file') == 3); + minmaxMex = (exist('minmax_core_mex', 'file') == 3); + + fprintf('\n================================================================\n'); + fprintf(' FastSense Downsample Kernel Throughput Benchmark\n'); + fprintf(' Core resolve->downsample->render hot path (per-frame cost)\n'); + fprintf('================================================================\n'); + fprintf(' LTTB target numOut = %d path: %s\n', numOut, pathLabel_(lttbMex)); + fprintf(' MinMax target numBuckets = %d path: %s\n', numBuckets, pathLabel_(minmaxMex)); + fprintf(' warmup = %d reps = %d (median) signal: deterministic 3-tone\n', nWarm, nReps); + fprintf(' %s\n', repmat('-', 1, 76)); + fprintf(' %-6s | %-28s | %-28s\n', 'N', 'LTTB', 'MinMax'); + fprintf(' %-6s | %10s %9s %6s | %10s %9s %6s\n', ... + '', 'ms/call', 'Mpts/s', 'out', 'ms/call', 'Mpts/s', 'out'); + fprintf(' %s\n', repmat('-', 1, 76)); + + nS = numel(sizes); + lttbMs = zeros(1, nS); lttbTput = zeros(1, nS); lttbN = zeros(1, nS); + mmMs = zeros(1, nS); mmTput = zeros(1, nS); mmN = zeros(1, nS); + + for c = 1:nS + n = sizes(c); + + % Deterministic ascending X (~100 Hz) and a 3-tone signal whose + % high-frequency component keeps per-bucket min/max non-degenerate. + % No RNG: identical workload on MATLAB and Octave, every run. + x = linspace(0, n / 100, n); + y = sin(x * 0.1) + 0.3 * sin(x * 1.7) + 0.2 * sin(x * 13.0); + + % --- LTTB (linear mode -> MEX when available) --- + [lx, ~] = lttb_downsample(x, y, numOut); %#ok correctness touch + tL = medianTime_(@() lttb_downsample(x, y, numOut), nWarm, nReps); + lttbMs(c) = tL * 1000; + lttbTput(c) = (n / tL) / 1e6; + lttbN(c) = numel(lx); + + % --- MinMax (NaN-free fast path -> MEX when available) --- + [mx, ~] = minmax_downsample(x, y, numBuckets, false); %#ok + tM = medianTime_(@() minmax_downsample(x, y, numBuckets, false), nWarm, nReps); + mmMs(c) = tM * 1000; + mmTput(c) = (n / tM) / 1e6; + mmN(c) = numel(mx); + + fprintf(' %-6s | %10.3f %9.1f %6d | %10.3f %9.1f %6d\n', ... + labels{c}, lttbMs(c), lttbTput(c), lttbN(c), ... + mmMs(c), mmTput(c), mmN(c)); + end + + fprintf(' %s\n', repmat('-', 1, 76)); + + % ---- Soft scaling advisory ---- + % "ms per million points" is the O(N) coefficient. Compare the largest + % size against the 1M reference (index 3); flat => linear. A >2x rise is + % the early signature of super-linear creep — advisory only, not a gate. + refIdx = 3; % 1M — past fixed-overhead, below memory-pressure regime + lttbCoef = lttbMs ./ (sizes / 1e6); + mmCoef = mmMs ./ (sizes / 1e6); + lttbDrift = lttbCoef(end) / lttbCoef(refIdx); + mmDrift = mmCoef(end) / mmCoef(refIdx); + + fprintf(' Scaling (ms per 1M pts, %s -> %s):\n', labels{refIdx}, labels{end}); + fprintf(' LTTB : %.3f -> %.3f (%.2fx) %s\n', ... + lttbCoef(refIdx), lttbCoef(end), lttbDrift, driftLabel_(lttbDrift)); + fprintf(' MinMax : %.3f -> %.3f (%.2fx) %s\n', ... + mmCoef(refIdx), mmCoef(end), mmDrift, driftLabel_(mmDrift)); + fprintf(' %s\n', repmat('-', 1, 76)); + fprintf(' Note: throughput bench (no time gate). /bench-guard baselines these numbers.\n\n'); + + result = struct( ... + 'sizes', sizes, ... + 'labels', {labels}, ... + 'numOut', numOut, ... + 'numBuckets', numBuckets, ... + 'lttbMex', lttbMex, ... + 'minmaxMex', minmaxMex, ... + 'lttbMs', lttbMs, ... + 'lttbTputMpts', lttbTput, ... + 'lttbOutN', lttbN, ... + 'minmaxMs', mmMs, ... + 'minmaxTputMpts', mmTput, ... + 'minmaxOutN', mmN, ... + 'lttbCoefMsPerM', lttbCoef, ... + 'minmaxCoefMsPerM', mmCoef, ... + 'lttbDrift', lttbDrift, ... + 'minmaxDrift', mmDrift); +end + +function t = medianTime_(fn, nWarm, nReps) + %MEDIANTIME_ Median wall time of fn() over nReps after nWarm warmups. + for w = 1:nWarm + fn(); + end + ts = zeros(1, nReps); + for r = 1:nReps + tic; + fn(); + ts(r) = toc; + end + t = median(ts); +end + +function s = pathLabel_(useMex) + %PATHLABEL_ Human label for the active kernel implementation. + if useMex + s = 'MEX (compiled)'; + else + s = 'pure-MATLAB fallback'; + end +end + +function s = driftLabel_(drift) + %DRIFTLABEL_ Soft verdict on the O(N) coefficient drift. + if drift > 2.0 + s = '<< WATCH: possible super-linear creep'; + elseif drift > 1.5 + s = '(mild rise — memory pressure expected at large N)'; + else + s = '(linear — O(N) holds)'; + end +end diff --git a/benchmarks/bench_fastsense_multiline.m b/benchmarks/bench_fastsense_multiline.m new file mode 100644 index 00000000..8eae28ce --- /dev/null +++ b/benchmarks/bench_fastsense_multiline.m @@ -0,0 +1,181 @@ +function result = bench_fastsense_multiline() +%BENCH_FASTSENSE_MULTILINE Live-update scaling vs line count on one axes. +% +% Measures how the live refresh path scales as a SINGLE FastSense axes +% accumulates lines — the multi-sensor overlay case. The focus is +% updateData(), the live hot path: per its own contract it re-downsamples +% *all* lines on every call, so a one-line live update costs O(lineCount). +% That per-call cost is what the project's refresh-rate constraint rides +% on, so the headline metric is the achievable refresh rate (Hz) as line +% count grows. +% +% Why this is the gap: the kernel microbenches (bench_downsample_kernels, +% bench_violation_cull, bench_binary_search) now cover every MEX kernel +% actually wired into production. What was NOT covered is the COMPOSITE +% scaling — how the whole update path grows with line count on one axes. +% benchmark.m and benchmark_zoom.m use a single line; the dashboard benches +% vary widget count, not lines-per-axes. This bench isolates that axis. +% +% What it measures (deterministic, no RNG; headless invisible figure): +% - updateData() wall time (median, SkipViewMode to isolate the +% re-downsample cost from view-mode / xlim adjustment). +% - "us per line" — if it stays flat the live cost is linear in line +% count; if it FALLS, a fixed per-call overhead is amortizing (good); +% a sharp RISE would signal super-linear per-line overhead. +% - effective refresh rate (Hz = 1000 / updateData ms). +% - one-time render() setup cost (single sample, figure-realization +% dominated — reported for context, not a clean scaling signal). +% +% The setup render is run with ShowProgress disabled so the console +% progress bar does not pollute the timing or the output. +% +% Throughput bench, not a pass/fail gate: it PRINTS results and a soft +% scaling advisory. The next /bench-guard (or /perf-watch) run baselines +% the absolute numbers. +% +% Run: +% octave --no-gui --eval "install(); bench_fastsense_multiline();" +% % or in MATLAB: +% bench_fastsense_multiline +% +% Returns a struct (line counts, update ms, us/line, Hz, setup ms, +% scaling drift) for programmatic baselining. +% +% See also FastSense, FastSense.updateData, FastSense.render, +% bench_downsample_kernels, bench_dashboard_live. + + here = fileparts(mfilename('fullpath')); + addpath(fullfile(here, '..')); + install(); + + % ---- Configuration ---- + lineCounts = [1, 4, 16, 64]; + nPerLine = 1e5; % points per line (fixed — isolates line-count axis) + nWarm = 2; % updateData warmups + nUpdateRep = 7; % updateData measurements (median) + + fprintf('\n================================================================\n'); + fprintf(' FastSense Multi-Line Live-Update Scaling Benchmark\n'); + fprintf(' updateData() re-downsamples all lines -> refresh cost vs lines\n'); + fprintf('================================================================\n'); + fprintf(' points/line = %d update reps = %d (median) SkipViewMode = true\n', ... + nPerLine, nUpdateRep); + fprintf(' %s\n', repmat('-', 1, 76)); + fprintf(' %-6s | %-9s | %-11s | %-13s | %-9s | %-8s\n', ... + 'lines', 'total pts', 'setup ms', 'updateData ms', 'us/line', 'refresh'); + fprintf(' %s\n', repmat('-', 1, 76)); + + nL = numel(lineCounts); + setupMs = zeros(1, nL); + updateMs = zeros(1, nL); + usPerLine = zeros(1, nL); + refreshHz = zeros(1, nL); + + for c = 1:nL + L = lineCounts(c); + + % Deterministic per-line data: distinct phase per line, no RNG. + x = linspace(0, nPerLine / 100, nPerLine); + Y = zeros(L, nPerLine); + for k = 1:L + Y(k, :) = sin(x * 0.1 + k) + 0.3 * sin(x * 1.7 + k) + 0.2 * sin(x * 13.0 + k); + end + + % Build headless instance and render once (silently) for setup. + % Warnings muted only around the one-time setup render (e.g. MATLAB + % caps the auto-legend at 50 entries for high line counts) — restored + % immediately so the measured update path is unaffected. + [fp, fig] = buildFS_(x, Y); + ws = warning('off', 'all'); + tic; + fp.render(); + setupMs(c) = toc * 1000; + warning(ws); + + % updateData timing: update line 1 with fresh data each call; this + % re-downsamples ALL lines. SkipViewMode isolates the re-downsample + % cost from xlim / view-mode logic. + for w = 1:nWarm + fp.updateData(1, x, Y(1, :) + 0.01 * w, 'SkipViewMode', true); + end + uTimes = zeros(1, nUpdateRep); + for u = 1:nUpdateRep + ynew = Y(1, :) + 0.001 * u; % force a genuine data replace each rep + tic; + fp.updateData(1, x, ynew, 'SkipViewMode', true); + uTimes(u) = toc; + end + updateMs(c) = median(uTimes) * 1000; + usPerLine(c) = (updateMs(c) * 1000) / L; + refreshHz(c) = 1000 / updateMs(c); + + close(fig); + + fprintf(' %-6d | %9s | %11.1f | %13.3f | %9.1f | %6.0fHz\n', ... + L, humanCount_(L * nPerLine), setupMs(c), updateMs(c), usPerLine(c), refreshHz(c)); + end + + fprintf(' %s\n', repmat('-', 1, 76)); + + % ---- Soft scaling advisory ---- + % updateData re-downsamples every line, so its TOTAL cost grows with line + % count, but a fixed per-call overhead (drawnow, dispatch) amortizes — so + % us/line should be flat or FALLING. A sharp rise means super-linear + % per-line overhead crept in. Advisory only, not a gate. + perLineDrift = usPerLine(end) / usPerLine(1); + + fprintf(' Scaling (%d -> %d lines):\n', lineCounts(1), lineCounts(end)); + fprintf(' updateData us/line : %.1f -> %.1f (%.2fx) %s\n', ... + usPerLine(1), usPerLine(end), perLineDrift, perLineLabel_(perLineDrift)); + fprintf(' refresh rate : %.0f Hz (1 line) -> %.0f Hz (%d lines)\n', ... + refreshHz(1), refreshHz(end), lineCounts(end)); + fprintf(' %s\n', repmat('-', 1, 76)); + fprintf(' Note: composite bench (no time gate). /bench-guard baselines these numbers.\n\n'); + + result = struct( ... + 'lineCounts', lineCounts, ... + 'nPerLine', nPerLine, ... + 'setupMs', setupMs, ... + 'updateMs', updateMs, ... + 'usPerLine', usPerLine, ... + 'refreshHz', refreshHz, ... + 'perLineDrift', perLineDrift); +end + +function [fp, fig] = buildFS_(x, Y) + %BUILDFS_ Headless FastSense with size(Y,1) lines on an invisible figure. + % ShowProgress disabled so the console progress bar does not pollute + % timing or output. + fig = figure('Visible', 'off', 'Position', [100 100 800 400]); + ax = axes('Parent', fig); + fp = FastSense('Parent', ax); + fp.ShowProgress = false; + L = size(Y, 1); + for k = 1:L + fp.addLine(x, Y(k, :), 'DisplayName', sprintf('line %d', k)); + end +end + +function s = humanCount_(n) + %HUMANCOUNT_ Compact count label (e.g. 6.4M). + if n >= 1e6 + s = sprintf('%.1fM', n / 1e6); + elseif n >= 1e3 + s = sprintf('%.0fK', n / 1e3); + else + s = sprintf('%d', n); + end +end + +function s = perLineLabel_(drift) + %PERLINELABEL_ Soft verdict on per-line update cost growth. + if drift > 2.0 + s = '<< WATCH: super-linear per-line update cost'; + elseif drift > 1.5 + s = '(mild per-line rise)'; + elseif drift >= 0.8 + s = '(~linear in line count)'; + else + s = '(sublinear — fixed overhead amortizes)'; + end +end diff --git a/benchmarks/bench_violation_cull.m b/benchmarks/bench_violation_cull.m new file mode 100644 index 00000000..cad2abc9 --- /dev/null +++ b/benchmarks/bench_violation_cull.m @@ -0,0 +1,208 @@ +function result = bench_violation_cull() +%BENCH_VIOLATION_CULL Throughput microbenchmark for the threshold-marker kernel. +% +% Isolates and times violation_cull — the fused threshold-violation +% detection + pixel-density culling kernel (MEX: violation_cull_mex). It +% is the threshold-marker counterpart of the LTTB/MinMax downsamplers +% covered by bench_downsample_kernels: whenever a threshold is attached to +% a FastSense line, this kernel runs on EVERY render and EVERY zoom/pan, +% scanning the full visible data window (O(N)) to find violating points +% and immediately cull them to at most one marker per pixel column. +% +% It fuses what used to be two passes — compute_violations (or +% compute_violations_dynamic) plus downsample_violations — into a single +% compiled call, so it is exactly the per-frame cost that thresholded +% dashboards pay on top of the data downsample. Until now it had no direct +% benchmark: bench_tag_pipeline_1k.m only lists the kernel name in a +% profiler top-N filter, never times it, and no bench_*.m exercises the +% detect-and-cull path in isolation. +% +% Two threshold regimes are timed at each size, covering BOTH internal +% branches of the kernel: +% constant — scalar threshold knot (compute_violations branch) +% step — multi-knot piecewise-constant / ZOH threshold +% (compute_violations_dynamic branch, the time-varying path +% that was wholly unbenched) +% +% What it measures (deterministic, no RNG): +% - Per-call wall time (median of nReps after warmup) for each regime +% across a size sweep (10K -> 20M points). +% - Per-point throughput in millions of points/second (Mpts/s). +% - "ms per million points" — the O(N) cost coefficient; flat => linear. +% - Output marker count (bounded by the ~1000 pixel columns). +% +% Throughput bench, not a pass/fail gate: it PRINTS results and a soft +% scaling advisory rather than asserting a machine-specific time budget. +% The next /bench-guard (or /perf-watch) run baselines the absolute +% numbers and flags regressions against that local baseline. The active +% path (compiled MEX vs pure-MATLAB fallback) is detected and labelled, +% since the MEX speedup is exactly what this bench protects. +% +% Run: +% octave --no-gui --eval "install(); bench_violation_cull();" +% % or in MATLAB: +% bench_violation_cull +% +% Returns a struct (sizes, per-regime ms / throughput / marker counts, +% MEX flag, scaling coefficients) for programmatic baselining. +% +% See also violation_cull, compute_violations, compute_violations_dynamic, +% downsample_violations, bench_downsample_kernels. + + here = fileparts(mfilename('fullpath')); + addpath(fullfile(here, '..')); + install(); + + % violation_cull is a PRIVATE helper of FastSense. MATLAB refuses private + % directories on the path, so we cd into the private folder for the + % duration of the bench: current-folder functions are always callable, + % and the compiled MEX (violation_cull_mex) lives there too, so this + % exercises the real MEX path. onCleanup restores cwd even on error. + % Portable across MATLAB and Octave. + privDir = fullfile(here, '..', 'libs', 'FastSense', 'private'); + origDir = pwd; + restoreCwd = onCleanup(@() cd(origDir)); %#ok + cd(privDir); + + % ---- Configuration ---- + sizes = [1e4, 1e5, 1e6, 5e6, 2e7]; + labels = {'10K', '100K', '1M', '5M', '20M'}; + + nPixels = 1000; % display width in pixels -> sets cull bucket count + nKnots = 8; % step-threshold knots (time-varying regime) + thLevel = 1.0; % nominal upper threshold (signal peaks ~1.5) + nWarm = 2; % warmup calls dissolve JIT / first-call MEX load + nReps = 5; % median over nReps defuses one-off spikes + + cullMex = (exist('violation_cull_mex', 'file') == 3); + + fprintf('\n================================================================\n'); + fprintf(' FastSense Violation-Cull Kernel Throughput Benchmark\n'); + fprintf(' Fused threshold detect + pixel cull (per-frame marker cost)\n'); + fprintf('================================================================\n'); + fprintf(' pixels = %d step knots = %d threshold = %.2f (upper)\n', ... + nPixels, nKnots, thLevel); + fprintf(' path: %s warmup = %d reps = %d (median)\n', ... + pathLabel_(cullMex), nWarm, nReps); + fprintf(' %s\n', repmat('-', 1, 76)); + fprintf(' %-6s | %-28s | %-28s\n', 'N', 'constant threshold', 'step threshold (ZOH)'); + fprintf(' %-6s | %10s %9s %6s | %10s %9s %6s\n', ... + '', 'ms/call', 'Mpts/s', 'mark', 'ms/call', 'Mpts/s', 'mark'); + fprintf(' %s\n', repmat('-', 1, 76)); + + nS = numel(sizes); + cMs = zeros(1, nS); cTput = zeros(1, nS); cMark = zeros(1, nS); + sMs = zeros(1, nS); sTput = zeros(1, nS); sMark = zeros(1, nS); + + for c = 1:nS + n = sizes(c); + + % Deterministic ascending X and 3-tone signal (peaks ~1.5), so a + % realistic minority of points exceed thLevel = 1.0. No RNG: identical + % workload on MATLAB and Octave, every run. + x = linspace(0, n / 100, n); + y = sin(x * 0.1) + 0.3 * sin(x * 1.7) + 0.2 * sin(x * 13.0); + xmin = x(1); + xmax = x(end); + pixelWidth = (xmax - xmin) / nPixels; + + % Constant-threshold regime: scalar knot -> compute_violations branch + thXc = xmin; + thYc = thLevel; + + % Step-threshold regime: multi-knot ZOH -> compute_violations_dynamic + % branch. Deterministic oscillation around thLevel keeps the + % violation rate realistic and the path non-trivial. + thXs = linspace(xmin, xmax, nKnots); + thYs = thLevel + 0.3 * sin(1:nKnots); + + % --- Constant threshold --- + [xc, ~] = violation_cull(x, y, thXc, thYc, 'upper', pixelWidth, xmin); %#ok + tc = medianTime_(@() violation_cull(x, y, thXc, thYc, 'upper', pixelWidth, xmin), nWarm, nReps); + cMs(c) = tc * 1000; + cTput(c) = (n / tc) / 1e6; + cMark(c) = numel(xc); + + % --- Step threshold (time-varying / ZOH) --- + [xs, ~] = violation_cull(x, y, thXs, thYs, 'upper', pixelWidth, xmin); %#ok + ts = medianTime_(@() violation_cull(x, y, thXs, thYs, 'upper', pixelWidth, xmin), nWarm, nReps); + sMs(c) = ts * 1000; + sTput(c) = (n / ts) / 1e6; + sMark(c) = numel(xs); + + fprintf(' %-6s | %10.3f %9.1f %6d | %10.3f %9.1f %6d\n', ... + labels{c}, cMs(c), cTput(c), cMark(c), sMs(c), sTput(c), sMark(c)); + end + + fprintf(' %s\n', repmat('-', 1, 76)); + + % ---- Soft scaling advisory ---- + % "ms per million points" is the O(N) coefficient. Compare the largest + % size against the 1M reference; flat => linear. A >2x rise is the early + % signature of super-linear creep — advisory only, not a gate. + refIdx = 3; % 1M + cCoef = cMs ./ (sizes / 1e6); + sCoef = sMs ./ (sizes / 1e6); + cDrift = cCoef(end) / cCoef(refIdx); + sDrift = sCoef(end) / sCoef(refIdx); + + fprintf(' Scaling (ms per 1M pts, %s -> %s):\n', labels{refIdx}, labels{end}); + fprintf(' constant : %.3f -> %.3f (%.2fx) %s\n', ... + cCoef(refIdx), cCoef(end), cDrift, driftLabel_(cDrift)); + fprintf(' step : %.3f -> %.3f (%.2fx) %s\n', ... + sCoef(refIdx), sCoef(end), sDrift, driftLabel_(sDrift)); + fprintf(' %s\n', repmat('-', 1, 76)); + fprintf(' Note: throughput bench (no time gate). /bench-guard baselines these numbers.\n\n'); + + result = struct( ... + 'sizes', sizes, ... + 'labels', {labels}, ... + 'nPixels', nPixels, ... + 'nKnots', nKnots, ... + 'thLevel', thLevel, ... + 'cullMex', cullMex, ... + 'constMs', cMs, ... + 'constTputMpts', cTput, ... + 'constMarkers', cMark, ... + 'stepMs', sMs, ... + 'stepTputMpts', sTput, ... + 'stepMarkers', sMark, ... + 'constCoefMsPerM', cCoef, ... + 'stepCoefMsPerM', sCoef, ... + 'constDrift', cDrift, ... + 'stepDrift', sDrift); +end + +function t = medianTime_(fn, nWarm, nReps) + %MEDIANTIME_ Median wall time of fn() over nReps after nWarm warmups. + for w = 1:nWarm + fn(); + end + ts = zeros(1, nReps); + for r = 1:nReps + tic; + fn(); + ts(r) = toc; + end + t = median(ts); +end + +function s = pathLabel_(useMex) + %PATHLABEL_ Human label for the active kernel implementation. + if useMex + s = 'MEX (compiled)'; + else + s = 'pure-MATLAB fallback'; + end +end + +function s = driftLabel_(drift) + %DRIFTLABEL_ Soft verdict on the O(N) coefficient drift. + if drift > 2.0 + s = '<< WATCH: possible super-linear creep'; + elseif drift > 1.5 + s = '(mild rise — memory pressure expected at large N)'; + else + s = '(linear — O(N) holds)'; + end +end