Skip to content

Commit 15b778b

Browse files
committed
web: recursive-CTE skip scans for summary, discovery, and filter universe
Replace the three remaining whole-group/whole-table scans on the read path with loose index scans (PR-5.1.5, cold-render fix): - collectQuerySummary: latest-per-(query_idx, engine, format) via a recursive successor walk of idx_query_measurements_summary plus a per-series latest probe (NULLS LAST semantics preserved via an IS NOT NULL descent with an IS NULL fallback). Prod tpcds: 2796ms -> 63ms. - collectQueryGroups: distinct discovery tuples via a 15-branch NULL-aware successor walk of idx_query_measurements_chart instead of a 4.85M-row GROUP BY. Prod: 2333ms -> 20ms. - collectFilterUniverse: distinct engine/format on query_measurements via single-column skip scans over the 006 indexes. Prod: 565ms -> 0.2ms. Every probe spells out the full index-prefix ORDER BY because IS NULL pins do not join planner equivalence classes (a bare ORDER BY on the suffix columns forces a Sort over the whole group), and successor seeks are single-inequality branches under a lazy Append because row comparisons are only btree index quals from the index's first column. Results verified byte-identical against the replaced queries on both the testcontainer seed and the full prod seed. Signed-off-by: "Connor Tsui" <connor@spiraldb.com>
1 parent 93ccf97 commit 15b778b

2 files changed

Lines changed: 271 additions & 39 deletions

File tree

benchmarks-website/web/lib/queries.ts

Lines changed: 150 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -755,17 +755,118 @@ type QueryGroupRow = {
755755
query_idx: number;
756756
};
757757

758+
/**
759+
* The five discovery dimensions of `query_measurements`, in the column order of
760+
* `idx_query_measurements_chart`. Doubles as the probes' ORDER BY: spelling out
761+
* the full index prefix is what lets the planner prove each probe is an ordered
762+
* index descent even under `IS NULL` pins (see `collectQuerySummary` in
763+
* `summary.ts` for the pathkey rationale).
764+
*/
765+
const DISCOVERY_COLS = 'q.dataset, q.dataset_variant, q.scale_factor, q.storage, q.query_idx';
766+
767+
/**
768+
* The successor probe of the discovery skip scan: given the current tuple `s`,
769+
* find the next distinct `(dataset, dataset_variant, scale_factor, storage,
770+
* query_idx)` tuple in index order (ASC, NULLS LAST on the two nullable
771+
* columns). A single row comparison cannot express this (it would not be a
772+
* btree index qual past column 1, and NULL components poison it), so the
773+
* successor is a `UNION ALL` of mutually-ordered single-inequality branches,
774+
* evaluated lazily under the caller's `LIMIT 1` (Append stops at the first
775+
* row): deepest level first (next query_idx within the same group), then next
776+
* storage, scale_factor, dataset_variant, dataset.
777+
*
778+
* The nullable levels (scale_factor, dataset_variant) follow NULLS LAST order
779+
* with two branches each: `col > s.col` walks the non-NULL values (vacuously
780+
* empty when `s.col` is NULL, since a NULL comparison is never true), then
781+
* `col IS NULL AND s.col IS NOT NULL` steps from the last non-NULL value into
782+
* the NULL partition; the `s.col IS NOT NULL` guard keeps the NULL partition
783+
* from succeeding itself forever. Equality pins on a nullable column likewise
784+
* need both forms (`= s.col` / `IS NULL AND s.col IS NULL`) because
785+
* `IS NOT DISTINCT FROM` is not index-sargable; the dead combination returns
786+
* no rows at the btree layer for free. Every branch is a pure O(log n) descent
787+
* of `idx_query_measurements_chart`.
788+
*/
789+
function discoverySuccessorSql(): string {
790+
const variantPins = [
791+
'q.dataset_variant = s.dataset_variant',
792+
'q.dataset_variant IS NULL AND s.dataset_variant IS NULL',
793+
];
794+
const scalePins = [
795+
'q.scale_factor = s.scale_factor',
796+
'q.scale_factor IS NULL AND s.scale_factor IS NULL',
797+
];
798+
const branches: string[] = [];
799+
for (const variantPin of variantPins) {
800+
for (const scalePin of scalePins) {
801+
branches.push(
802+
`q.dataset = s.dataset AND ${variantPin} AND ${scalePin}
803+
AND q.storage = s.storage AND q.query_idx > s.query_idx`,
804+
);
805+
}
806+
}
807+
for (const variantPin of variantPins) {
808+
for (const scalePin of scalePins) {
809+
branches.push(
810+
`q.dataset = s.dataset AND ${variantPin} AND ${scalePin} AND q.storage > s.storage`,
811+
);
812+
}
813+
}
814+
for (const variantPin of variantPins) {
815+
branches.push(`q.dataset = s.dataset AND ${variantPin} AND q.scale_factor > s.scale_factor`);
816+
}
817+
for (const variantPin of variantPins) {
818+
branches.push(
819+
`q.dataset = s.dataset AND ${variantPin}
820+
AND q.scale_factor IS NULL AND s.scale_factor IS NOT NULL`,
821+
);
822+
}
823+
branches.push('q.dataset = s.dataset AND q.dataset_variant > s.dataset_variant');
824+
branches.push(
825+
'q.dataset = s.dataset AND q.dataset_variant IS NULL AND s.dataset_variant IS NOT NULL',
826+
);
827+
branches.push('q.dataset > s.dataset');
828+
return branches
829+
.map(
830+
(branch) => `(SELECT ${DISCOVERY_COLS}
831+
FROM query_measurements q
832+
WHERE ${branch}
833+
ORDER BY ${DISCOVERY_COLS}
834+
LIMIT 1)`,
835+
)
836+
.join('\n UNION ALL\n ');
837+
}
838+
758839
/**
759840
* Distinct query groups, one per `(dataset, dataset_variant, scale_factor,
760841
* storage)` tuple, each with one `Q{idx}` chart link per query index. Rows
761842
* arrive grouped by the tuple (ORDER BY matches `groups.rs`), so a new group
762843
* starts whenever the tuple changes.
844+
*
845+
* The distinct tuples come from a recursive-CTE skip scan (anchor = first index
846+
* tuple, step = [`discoverySuccessorSql`]) instead of a `GROUP BY` that scans
847+
* all of `query_measurements` (~1.3s at the prod seed vs ~ms; PR-5.1.5, the
848+
* same loose-index-scan treatment as `collectQuerySummary`). The skip scan
849+
* walks `idx_query_measurements_chart` in its native order (NULLS LAST), so the
850+
* outer ORDER BY re-sorts the few hundred result tuples into the v3-parity
851+
* NULLS FIRST order the group builder expects.
763852
*/
764853
async function collectQueryGroups(): Promise<Group[]> {
765854
const text = `
855+
WITH RECURSIVE tuples AS (
856+
(SELECT ${DISCOVERY_COLS}
857+
FROM query_measurements q
858+
ORDER BY ${DISCOVERY_COLS}
859+
LIMIT 1)
860+
UNION ALL
861+
SELECT nxt.dataset, nxt.dataset_variant, nxt.scale_factor, nxt.storage, nxt.query_idx
862+
FROM tuples s
863+
CROSS JOIN LATERAL (
864+
${discoverySuccessorSql()}
865+
LIMIT 1
866+
) nxt
867+
)
766868
SELECT dataset, dataset_variant, scale_factor, storage, query_idx
767-
FROM query_measurements
768-
GROUP BY dataset, dataset_variant, scale_factor, storage, query_idx
869+
FROM tuples
769870
ORDER BY dataset, dataset_variant NULLS FIRST,
770871
scale_factor NULLS FIRST, storage, query_idx
771872
`;
@@ -1076,18 +1177,57 @@ export async function collectGroupCharts(
10761177
export async function collectFilterUniverse(): Promise<FilterUniverse> {
10771178
// The two queries are independent; run them on parallel pool connections
10781179
// since this collector sits on every force-dynamic page render.
1180+
//
1181+
// The `query_measurements` DISTINCTs are recursive-CTE skip scans over the 006
1182+
// single-column indexes (`idx_query_measurements_engine` / `_format`) instead
1183+
// of full index scans (~460ms each at the prod seed vs ~ms; PR-5.1.5, the same
1184+
// loose-index-scan treatment as `collectQuerySummary` -- see `summary.ts` for
1185+
// the mechanics). One descent per distinct value: the anchor takes the
1186+
// smallest, each step seeks the first value strictly greater than the last.
1187+
// `engine`/`format` are NOT NULL by schema; the `IS NOT NULL` anchor guards
1188+
// are kept so an all-NULL column would yield `[]` rather than `[null]`,
1189+
// preserving the prior queries' explicit-filter semantics. The three small
1190+
// fact tables (a few thousand rows each) stay plain DISTINCT arms; `UNION`
1191+
// dedupes across all four sources.
10791192
const [engines, formats] = await Promise.all([
10801193
getPool().query<{ value: string }>(
1081-
`SELECT DISTINCT engine AS value FROM query_measurements
1082-
WHERE engine IS NOT NULL`,
1194+
`WITH RECURSIVE engines AS (
1195+
(SELECT q.engine AS value FROM query_measurements q
1196+
WHERE q.engine IS NOT NULL
1197+
ORDER BY q.engine
1198+
LIMIT 1)
1199+
UNION ALL
1200+
SELECT nxt.value
1201+
FROM engines e
1202+
CROSS JOIN LATERAL (
1203+
SELECT q.engine AS value FROM query_measurements q
1204+
WHERE q.engine > e.value
1205+
ORDER BY q.engine
1206+
LIMIT 1
1207+
) nxt
1208+
)
1209+
SELECT value FROM engines`,
10831210
),
10841211
getPool().query<{ value: string }>(
1085-
`SELECT DISTINCT value FROM (
1086-
SELECT format AS value FROM query_measurements WHERE format IS NOT NULL
1087-
UNION SELECT format AS value FROM compression_times WHERE format IS NOT NULL
1088-
UNION SELECT format AS value FROM compression_sizes WHERE format IS NOT NULL
1089-
UNION SELECT format AS value FROM random_access_times WHERE format IS NOT NULL
1090-
) AS f`,
1212+
`WITH RECURSIVE qm_formats AS (
1213+
(SELECT q.format AS value FROM query_measurements q
1214+
WHERE q.format IS NOT NULL
1215+
ORDER BY q.format
1216+
LIMIT 1)
1217+
UNION ALL
1218+
SELECT nxt.value
1219+
FROM qm_formats f
1220+
CROSS JOIN LATERAL (
1221+
SELECT q.format AS value FROM query_measurements q
1222+
WHERE q.format > f.value
1223+
ORDER BY q.format
1224+
LIMIT 1
1225+
) nxt
1226+
)
1227+
SELECT value FROM qm_formats
1228+
UNION SELECT format AS value FROM compression_times WHERE format IS NOT NULL
1229+
UNION SELECT format AS value FROM compression_sizes WHERE format IS NOT NULL
1230+
UNION SELECT format AS value FROM random_access_times WHERE format IS NOT NULL`,
10911231
),
10921232
]);
10931233
return {

benchmarks-website/web/lib/summary.ts

Lines changed: 121 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -352,23 +352,56 @@ async function collectQuerySummary(
352352
// penalty model: each series scores the geomean of `(10 + value) / (10 +
353353
// best)` over every query, imputing a penalty where the series has no value.
354354
//
355-
// "Latest per series" is a `DISTINCT ON (query_idx, engine, format)` ordered by
356-
// the DENORMALIZED `commit_timestamp DESC` (migrations 006/007, PR-5.1.5 fix c)
357-
// instead of joining `commits` and `row_number()`-windowing the whole group.
358-
// The covering index `idx_query_measurements_summary` (dataset, dataset_variant,
355+
// "Latest per series" is a recursive-CTE skip scan (loose index scan) over the
356+
// covering index `idx_query_measurements_summary` (dataset, dataset_variant,
359357
// scale_factor, storage, query_idx, engine, format, commit_timestamp DESC)
360-
// INCLUDE (value_ns) makes this an Index Only Scan: the sargable group filter
361-
// seeks the index prefix, the remaining (query_idx, engine, format,
362-
// commit_timestamp DESC) order matches the DISTINCT ON + ORDER BY, and value_ns
363-
// (both the `> 0` filter and the projection) is read from the index leaf so
364-
// there is no heap fetch. ~1.95s warm for tpcds at the prod seed, vs ~6.2s for
365-
// the original commits-join `row_number()` window (006's non-covering index was
366-
// ignored because value_ns forced a per-row heap fetch -- see 007). `NULLS LAST`
367-
// keeps a transient NULL (a row from a writer not yet populating
368-
// `commit_timestamp`, before the post-deploy re-backfill) from winning "latest";
369-
// the timestamp value is the same `commits.timestamp` the join used, so the
370-
// same-second-tie behavior (an accepted tradeoff) is unchanged. `$1` = dataset,
371-
// `$2` = storage; variant/scale params append only when non-null.
358+
// INCLUDE (value_ns) from migrations 006/007 (PR-5.1.5 fix c). The previous
359+
// `DISTINCT ON (query_idx, engine, format) ... ORDER BY commit_timestamp DESC
360+
// NULLS LAST` form scanned the group's entire history (~1.8M index entries,
361+
// ~2.4s warm for tpcds at the prod seed); the skip scan instead does one
362+
// O(log n) index descent per distinct series tuple plus one per series for its
363+
// latest value (~1.3K descents, ~20ms), which is what makes the cache-cold
364+
// `/api/groups` render fast.
365+
//
366+
// Three non-obvious constructions keep every probe a pure index descent:
367+
//
368+
// - The successor probe cannot be a single row comparison
369+
// `(query_idx, engine, format) > (...)`: a row comparison is only a btree
370+
// index qual when its first column is the index's FIRST column, and these
371+
// are index columns 5-7 (the planner degrades the row form to a filter over
372+
// a full scan). It is instead three single-column-inequality branches (next
373+
// format within the series' query_idx+engine, then next engine within its
374+
// query_idx, then next query_idx), each fully index-sargable. `UNION ALL`
375+
// under `LIMIT 1` evaluates the branches lazily in order (Append stops at
376+
// the first row), and branch order equals tuple-successor order, so the
377+
// first non-empty branch IS the successor.
378+
//
379+
// - Every probe's ORDER BY spells out the full index prefix (dataset,
380+
// dataset_variant, scale_factor, storage, ...) even though those columns
381+
// are pinned by the WHERE. An `IS NULL` pin (NULL-variant/scale groups)
382+
// does not join the planner's equivalence classes, so with the short
383+
// `ORDER BY query_idx, engine, format` form the planner cannot prove the
384+
// index already provides the order and inserts a Sort over the whole
385+
// group. The pinned columns are constant per group, so the long form is
386+
// semantically identical.
387+
//
388+
// - "Latest" must order `commit_timestamp DESC NULLS LAST` (a transient NULL
389+
// -- a row from a writer not yet populating `commit_timestamp`, before the
390+
// post-deploy re-backfill -- must not win over real timestamps), but the
391+
// index is `commit_timestamp DESC`, i.e. NULLS FIRST, so that order is not
392+
// index-provided. The per-series latest probe therefore takes the newest
393+
// `commit_timestamp IS NOT NULL` row first (index-ordered descent past the
394+
// NULL block) and falls back to an arbitrary NULL-timestamp row only when
395+
// the series has no timestamped rows, which is exactly the NULLS LAST
396+
// semantics. (Which row wins among all-NULL ties is unspecified, as it
397+
// already was under `DISTINCT ON`.)
398+
//
399+
// The `value_ns > 0` filter rides inside every probe: it is read from the
400+
// index leaf (INCLUDE), so the enumeration lands directly on series that have
401+
// at least one valid row, the same set `DISTINCT ON` produced. The timestamp
402+
// value is the same `commits.timestamp` the original join used, so the
403+
// same-second-tie behavior (an accepted tradeoff) is unchanged. `$1` =
404+
// dataset, `$2` = storage; variant/scale params append only when non-null.
372405
const params: unknown[] = [dataset, storage];
373406
const variantPred =
374407
datasetVariant === null
@@ -378,22 +411,81 @@ async function collectQuerySummary(
378411
scaleFactor === null
379412
? 'q.scale_factor IS NULL'
380413
: `q.scale_factor = $${params.push(scaleFactor)}`;
381-
const text = `
382-
SELECT query_idx, series, value_ns
383-
FROM (
384-
SELECT DISTINCT ON (q.query_idx, q.engine, q.format)
385-
q.query_idx AS query_idx,
386-
q.engine || ':' || q.format AS series,
387-
q.value_ns::float8 AS value_ns
388-
FROM query_measurements q
389-
WHERE q.dataset = $1
414+
const groupPred = `q.dataset = $1
390415
AND ${variantPred}
391416
AND ${scalePred}
392-
AND q.storage = $2
393-
AND q.value_ns > 0
394-
ORDER BY q.query_idx, q.engine, q.format, q.commit_timestamp DESC NULLS LAST
417+
AND q.storage = $2`;
418+
const indexOrder =
419+
'q.dataset, q.dataset_variant, q.scale_factor, q.storage, q.query_idx, q.engine, q.format';
420+
const text = `
421+
WITH RECURSIVE series AS (
422+
(SELECT q.query_idx, q.engine, q.format
423+
FROM query_measurements q
424+
WHERE ${groupPred}
425+
AND q.value_ns > 0
426+
ORDER BY ${indexOrder}
427+
LIMIT 1)
428+
UNION ALL
429+
SELECT nxt.query_idx, nxt.engine, nxt.format
430+
FROM series s
431+
CROSS JOIN LATERAL (
432+
(SELECT q.query_idx, q.engine, q.format
433+
FROM query_measurements q
434+
WHERE ${groupPred}
435+
AND q.query_idx = s.query_idx
436+
AND q.engine = s.engine
437+
AND q.format > s.format
438+
AND q.value_ns > 0
439+
ORDER BY ${indexOrder}
440+
LIMIT 1)
441+
UNION ALL
442+
(SELECT q.query_idx, q.engine, q.format
443+
FROM query_measurements q
444+
WHERE ${groupPred}
445+
AND q.query_idx = s.query_idx
446+
AND q.engine > s.engine
447+
AND q.value_ns > 0
448+
ORDER BY ${indexOrder}
449+
LIMIT 1)
450+
UNION ALL
451+
(SELECT q.query_idx, q.engine, q.format
452+
FROM query_measurements q
453+
WHERE ${groupPred}
454+
AND q.query_idx > s.query_idx
455+
AND q.value_ns > 0
456+
ORDER BY ${indexOrder}
457+
LIMIT 1)
458+
LIMIT 1
459+
) nxt
460+
)
461+
SELECT s.query_idx AS query_idx,
462+
s.engine || ':' || s.format AS series,
463+
latest.value_ns AS value_ns
464+
FROM series s
465+
CROSS JOIN LATERAL (
466+
(SELECT q.value_ns::float8 AS value_ns
467+
FROM query_measurements q
468+
WHERE ${groupPred}
469+
AND q.query_idx = s.query_idx
470+
AND q.engine = s.engine
471+
AND q.format = s.format
472+
AND q.value_ns > 0
473+
AND q.commit_timestamp IS NOT NULL
474+
ORDER BY ${indexOrder}, q.commit_timestamp DESC
475+
LIMIT 1)
476+
UNION ALL
477+
(SELECT q.value_ns::float8 AS value_ns
478+
FROM query_measurements q
479+
WHERE ${groupPred}
480+
AND q.query_idx = s.query_idx
481+
AND q.engine = s.engine
482+
AND q.format = s.format
483+
AND q.value_ns > 0
484+
AND q.commit_timestamp IS NULL
485+
LIMIT 1)
486+
LIMIT 1
395487
) latest
396-
ORDER BY query_idx, series
488+
ORDER BY s.query_idx, s.engine || ':' || s.format
397489
`;
398490
const rows = (
399491
await getPool().query<{ query_idx: number; series: string; value_ns: number }>(text, params)

0 commit comments

Comments
 (0)