Skip to content

Commit dd700fc

Browse files
MagicalTuxclaude
andcommitted
feat(eqp): render LIST SUBQUERY + BLOOM FILTER for a non-correlated IN (SELECT) (B9a)
A single non-correlated `[NOT] IN (SELECT …)` in the WHERE now renders a `LIST SUBQUERY 1` node (child: the body's plan, then a `CREATE BLOOM FILTER` child) after the table access, matching SQLite — graphite previously rendered just the bare access. Gated to the provably byte-exact subset so nothing is emitted into a non-matching plan: graphite's access is a bare `SCAN` (so there is no seek to diverge from SQLite's cost-model outer choice — an `… AND c=?` that graphite seeks makes the line a SEARCH and declines here, dodging the cases where SQLite scans-plus-bloom where graphite would seek), and either the form is `NOT IN` (never seeks the IN column) or the IN column is not seekable (so SQLite also scans). This covers the common `NOT IN (SELECT …)`. A positive `IN` on an indexed / rowid column (SQLite serves a per-candidate SEARCH), and a correlated / compound / cross-position subquery, keep graphite's prior bare SCAN — the executor per-value seek is a documented follow-up (B9a-seek). New free fns collect_in_selects / single_where_in_select. Removed the now-handled `b IN (SELECT)` case from the scalar-subquery decline test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 94426fb commit dd700fc

4 files changed

Lines changed: 244 additions & 7 deletions

File tree

ROADMAP.md

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1173,11 +1173,21 @@ plan **and** row differential parity vs the pinned `sqlite3 3.50.4`; the executo
11731173
already returns correct rows for every one (B9i differs only in an SQL-unspecified
11741174
tie/representative order), so they are perf/EQP-fidelity work, not correctness:
11751175

1176-
- **B9a — `IN (SELECT …)``LIST SUBQUERY` + bloom filter.** A `WHERE col IN
1177-
(<non-correlated SELECT>)` seeks the outer table per candidate and materializes the
1178-
subquery as a `LIST SUBQUERY N` sibling with a `CREATE BLOOM FILTER` node; graphite
1179-
currently `SCAN`s and folds the `IN` in the tree-walker. Needs the list-materialize
1180-
+ bloom EQP model (and, for the VDBE, the seek-per-value plan).
1176+
- **B9a — `IN (SELECT …)``LIST SUBQUERY` + bloom filter.****Partly done (EQP).**
1177+
A single non-correlated `[NOT] IN (SELECT …)` now renders `LIST SUBQUERY 1` (child =
1178+
the body's plan, then a `CREATE BLOOM FILTER` child) after the access, for the
1179+
*provably byte-exact* subset: graphite's access is a bare `SCAN` (so no seek to
1180+
diverge from SQLite's cost-model outer choice — an `… AND c=?` that graphite seeks
1181+
makes the line a SEARCH and declines, dodging the cases where SQLite scans-plus-bloom
1182+
where graphite would seek), and either the form is `NOT IN` or the IN column is not
1183+
seekable. This covers the common `NOT IN (SELECT …)`. *Still open (**B9a-seek**):* a
1184+
positive `IN` on an **indexed / rowid** column, which SQLite serves with a
1185+
per-candidate `SEARCH t (col=?)` — graphite folds the `IN` in the tree-walker and
1186+
scans, so it declines (no wrong node emitted). That needs the executor to evaluate
1187+
the non-correlated subquery to a value list and seek per value (the `find_in_constraint`
1188+
+ `try_index_in` path already seeks a literal `IN` list), plus the outer-access EQP;
1189+
and a compound / correlated body (`CORRELATED LIST SUBQUERY`, shared-id bump) stays
1190+
deferred.
11811191
- **B9b — window-function EQP.** `… OVER (…)` renders `CO-ROUTINE (subquery-N)` over
11821192
the windowed input; the `(subquery-N)` label is codegen-order-fragile (see the
11831193
`schema-sql-canonicalization` note), so this needs a deterministic-numbering model

src/exec/mod.rs

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14358,6 +14358,45 @@ impl Connection {
1435814358
}
1435914359
}
1436014360
}
14361+
// A single non-correlated `[NOT] IN (SELECT …)` in the WHERE renders a
14362+
// `LIST SUBQUERY 1` node (child = the body's plan, then a `CREATE BLOOM FILTER`
14363+
// sibling under it) after the scan. We render it only where the whole plan is
14364+
// provably byte-exact: graphite's access is a bare `SCAN {label}` — so there is
14365+
// no seek to diverge from SQLite's cost-model choice (an `… AND c=?` that
14366+
// graphite seeks makes the line a SEARCH and declines here, dodging the case
14367+
// where SQLite scans-plus-bloom where graphite would seek) — AND either the form
14368+
// is `NOT IN` (which never seeks the IN column) or the IN column is not seekable
14369+
// (so SQLite also scans; a positive `IN` on an indexed / rowid column, which
14370+
// SQLite SEARCHes per candidate, declines — that seek is roadmap B9a-seek).
14371+
if from.joins.is_empty()
14372+
&& single_scan_detail.as_deref() == Some(alloc::format!("SCAN {label}").as_str())
14373+
{
14374+
if let Some((body, negated, operand)) = sel
14375+
.where_clause
14376+
.as_ref()
14377+
.and_then(|w| single_where_in_select(w))
14378+
{
14379+
let operand_is_rowid = matches!(operand, Expr::Column { column, .. }
14380+
if is_rowid_alias(column)
14381+
&& !meta.columns.iter().any(|c| c.name.eq_ignore_ascii_case(column)));
14382+
let in_col_seekable = operand_is_rowid
14383+
|| col_index(operand, &meta.columns).is_some_and(|c| {
14384+
meta.ipk == Some(c)
14385+
|| self
14386+
.indexes_of(&from.first.name)
14387+
.is_ok_and(|ixs| ixs.iter().any(|i| i.cols.first() == Some(&c)))
14388+
});
14389+
if (negated || !in_col_seekable) && self.eqp_scalar_bodies_renderable(&[body]) {
14390+
let list_id = *next_id;
14391+
*next_id += 1;
14392+
out.push((list_id, parent, String::from("LIST SUBQUERY 1")));
14393+
self.eqp_select(body, list_id, next_id, out, params)?;
14394+
let bloom_id = *next_id;
14395+
*next_id += 1;
14396+
out.push((bloom_id, list_id, String::from("CREATE BLOOM FILTER")));
14397+
}
14398+
}
14399+
}
1436114400
// Fold each join in FROM order, tracking the accumulated left columns so
1436214401
// the rowid-seek decision (shared with the executor via `rowid_join_seek`)
1436314402
// can print `SEARCH … USING INTEGER PRIMARY KEY (rowid=?)` in lockstep
@@ -27149,6 +27188,87 @@ fn collect_where_scalar_subqueries<'a>(e: &'a Expr, out: &mut Vec<&'a Select>) -
2714927188
}
2715027189
}
2715127190

27191+
/// Collect every `[NOT] IN (SELECT …)` in `e` as `(body, negated, operand)`, setting
27192+
/// `other` if any scalar `(SELECT …)` / `EXISTS` is present. Does NOT descend into a
27193+
/// subquery body (that is the body's own plan). Lifetime-preserving (mirrors
27194+
/// [`collect_where_scalar_subqueries`]) so the caller can hold the borrowed refs.
27195+
fn collect_in_selects<'a>(
27196+
e: &'a Expr,
27197+
ins: &mut Vec<(&'a Select, bool, &'a Expr)>,
27198+
other: &mut bool,
27199+
) {
27200+
match e {
27201+
Expr::Subquery(_) | Expr::Exists { .. } => *other = true,
27202+
Expr::InSelect {
27203+
expr,
27204+
select,
27205+
negated,
27206+
} => {
27207+
ins.push((select.as_ref(), *negated, expr.as_ref()));
27208+
collect_in_selects(expr, ins, other);
27209+
}
27210+
Expr::Unary { expr, .. }
27211+
| Expr::IsNull { expr, .. }
27212+
| Expr::Cast { expr, .. }
27213+
| Expr::Paren(expr)
27214+
| Expr::Collate { expr, .. } => collect_in_selects(expr, ins, other),
27215+
Expr::Binary { left, right, .. } => {
27216+
collect_in_selects(left, ins, other);
27217+
collect_in_selects(right, ins, other);
27218+
}
27219+
Expr::Function { args, .. } | Expr::RowValue(args) => {
27220+
for a in args {
27221+
collect_in_selects(a, ins, other);
27222+
}
27223+
}
27224+
Expr::InList { expr, list, .. } => {
27225+
collect_in_selects(expr, ins, other);
27226+
for a in list {
27227+
collect_in_selects(a, ins, other);
27228+
}
27229+
}
27230+
Expr::Between {
27231+
expr, low, high, ..
27232+
} => {
27233+
collect_in_selects(expr, ins, other);
27234+
collect_in_selects(low, ins, other);
27235+
collect_in_selects(high, ins, other);
27236+
}
27237+
Expr::Case {
27238+
operand,
27239+
when_then,
27240+
else_result,
27241+
} => {
27242+
if let Some(o) = operand {
27243+
collect_in_selects(o, ins, other);
27244+
}
27245+
for (w, t) in when_then {
27246+
collect_in_selects(w, ins, other);
27247+
collect_in_selects(t, ins, other);
27248+
}
27249+
if let Some(el) = else_result {
27250+
collect_in_selects(el, ins, other);
27251+
}
27252+
}
27253+
_ => {}
27254+
}
27255+
}
27256+
27257+
/// The single `[NOT] IN (SELECT …)` subquery in `e`, as `(body, negated, operand)`,
27258+
/// but only when it is the *sole* subquery anywhere in `e` (any scalar `(SELECT …)`,
27259+
/// `EXISTS`, or a second `IN (SELECT)` returns `None`, so the shared subquery-id
27260+
/// counter stays a clean `1`). Used to render a `LIST SUBQUERY 1` + `CREATE BLOOM
27261+
/// FILTER` node.
27262+
fn single_where_in_select(e: &Expr) -> Option<(&Select, bool, &Expr)> {
27263+
let mut ins = Vec::new();
27264+
let mut other = false;
27265+
collect_in_selects(e, &mut ins, &mut other);
27266+
if other || ins.len() != 1 {
27267+
return None;
27268+
}
27269+
ins.into_iter().next()
27270+
}
27271+
2715227272
/// Reconstruct the `sql` text stored in `sqlite_schema` for a `CREATE` statement
2715327273
/// the way SQLite does (`sqlite3EndTable` / `sqlite3CreateIndex`): a regenerated
2715427274
/// `CREATE <TYPE> ` head — which drops `TEMP`/`IF NOT EXISTS` and normalises the

tests/eqp_in_subquery.rs

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
//! B9a — a single non-correlated `[NOT] IN (SELECT …)` in the WHERE renders a
2+
//! `LIST SUBQUERY 1` node (its body plan + a `CREATE BLOOM FILTER` child) after the
3+
//! table access, matching SQLite. graphite used to render just the bare access.
4+
//!
5+
//! Rendered only where the whole plan is provably byte-exact: graphite's access is a
6+
//! bare `SCAN` (so there is no seek to diverge from SQLite's cost-model choice), and
7+
//! either the form is `NOT IN` (which never seeks the IN column) or the IN column is
8+
//! not seekable (so SQLite also scans). A *positive* `IN` on an indexed / rowid column
9+
//! — which SQLite serves with a per-candidate `SEARCH` — declines (that seek is a
10+
//! separate follow-up), as do a correlated / compound / cross-position subquery.
11+
//! Verified vs the sqlite3 3.50.4 CLI.
12+
13+
#![cfg(feature = "std")]
14+
15+
use std::process::Command;
16+
17+
fn sqlite3_available() -> bool {
18+
Command::new("sqlite3").arg("--version").output().is_ok()
19+
}
20+
21+
fn plan(bin: &str, base: &str, sql: &str) -> String {
22+
let full = format!("{base} EXPLAIN QUERY PLAN {sql}");
23+
let out = Command::new(bin)
24+
.arg(":memory:")
25+
.arg(&full)
26+
.output()
27+
.unwrap();
28+
String::from_utf8_lossy(&out.stdout)
29+
.lines()
30+
.filter(|l| !l.trim().is_empty() && !l.starts_with("QUERY PLAN"))
31+
.map(|l| l.trim_start_matches(|c: char| " |`*+_-".contains(c)))
32+
.collect::<Vec<_>>()
33+
.join("#")
34+
}
35+
36+
fn rows(bin: &str, base: &str, sql: &str) -> String {
37+
let full = format!("{base} {sql}");
38+
let out = Command::new(bin)
39+
.arg(":memory:")
40+
.arg(&full)
41+
.output()
42+
.unwrap();
43+
String::from_utf8_lossy(&out.stdout).trim_end().to_string()
44+
}
45+
46+
// `b` is indexed, `c`/`d` are not; `u`/`w` are the subquery sources.
47+
const BASE: &str = "CREATE TABLE t(a INTEGER PRIMARY KEY, b, c, d); CREATE INDEX tb ON t(b); \
48+
CREATE TABLE u(x,y); CREATE INDEX ux ON u(x); CREATE TABLE w(z);";
49+
50+
#[test]
51+
fn in_subquery_list_subquery_matches_sqlite() {
52+
if !sqlite3_available() {
53+
eprintln!("sqlite3 CLI not found; skipping");
54+
return;
55+
}
56+
let g = env!("CARGO_BIN_EXE_graphitesql");
57+
for q in [
58+
// NOT IN never seeks the IN column → the access stays SCAN in both.
59+
"SELECT * FROM t WHERE b NOT IN (SELECT y FROM u)",
60+
"SELECT * FROM t WHERE c NOT IN (SELECT z FROM w)",
61+
"SELECT * FROM t WHERE b NOT IN (SELECT z FROM w WHERE z>3)",
62+
"SELECT * FROM t WHERE b NOT IN (SELECT y FROM u) AND c=5",
63+
// Positive IN on an UNINDEXED column → SQLite also scans.
64+
"SELECT * FROM t WHERE d IN (SELECT y FROM u)",
65+
"SELECT d FROM t WHERE d IN (SELECT y FROM u)",
66+
] {
67+
assert_eq!(plan("sqlite3", BASE, q), plan(g, BASE, q), "plan for {q}");
68+
}
69+
}
70+
71+
#[test]
72+
fn in_subquery_out_of_subset_declines_to_bare_scan() {
73+
// These render a SEARCH / different node shape in sqlite that graphite doesn't
74+
// reproduce yet; graphite must keep its prior bare `SCAN t` (no LIST SUBQUERY /
75+
// bloom node emitted into a non-matching plan).
76+
let g = env!("CARGO_BIN_EXE_graphitesql");
77+
for q in [
78+
"SELECT * FROM t WHERE b IN (SELECT y FROM u)", // positive IN on indexed col → SQLite SEARCHes
79+
"SELECT * FROM t WHERE a IN (SELECT x FROM u)", // rowid seek
80+
"SELECT * FROM t WHERE rowid IN (SELECT x FROM u)",
81+
"SELECT * FROM t WHERE b IN (SELECT y FROM u WHERE u.y=t.a)", // correlated
82+
"SELECT * FROM t WHERE b NOT IN (SELECT y FROM u UNION SELECT x FROM u)", // compound body
83+
"SELECT * FROM t WHERE d IN (SELECT y FROM u) AND (SELECT count(*) FROM w)>0", // cross-position
84+
] {
85+
assert_eq!(plan(g, BASE, q), "SCAN t", "expected bare SCAN for {q}");
86+
}
87+
}
88+
89+
#[test]
90+
fn in_subquery_rows_unaffected() {
91+
if !sqlite3_available() {
92+
eprintln!("sqlite3 CLI not found; skipping");
93+
return;
94+
}
95+
let g = env!("CARGO_BIN_EXE_graphitesql");
96+
let base = format!(
97+
"{BASE} INSERT INTO t VALUES(1,10,5,7),(2,20,6,8),(3,30,7,7),(4,40,8,9); \
98+
INSERT INTO u VALUES(7,7),(8,8);"
99+
);
100+
for q in [
101+
"SELECT a FROM t WHERE b NOT IN (SELECT y FROM u) ORDER BY a",
102+
"SELECT a FROM t WHERE d IN (SELECT y FROM u) ORDER BY a",
103+
"SELECT count(*) FROM t WHERE c NOT IN (SELECT x FROM u)",
104+
] {
105+
assert_eq!(rows("sqlite3", &base, q), rows(g, &base, q), "rows for {q}");
106+
}
107+
}

tests/eqp_where_scalar_subquery.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -173,8 +173,8 @@ fn where_scalar_subquery_declines_unrenderable() {
173173
"SELECT * FROM t WHERE a=(SELECT x FROM u WHERE x=a)",
174174
// EXISTS → CORRELATED SCALAR SUBQUERY (a different node).
175175
"SELECT * FROM t WHERE EXISTS(SELECT 1 FROM u WHERE x=a)",
176-
// IN (SELECT) → LIST SUBQUERY + CREATE BLOOM FILTER.
177-
"SELECT * FROM t WHERE b IN (SELECT x FROM u)",
176+
// (`b IN (SELECT x FROM u)` now renders LIST SUBQUERY + CREATE BLOOM FILTER
177+
// see B9a and `tests/eqp_in_subquery.rs`.)
178178
// A CTE reference bumps the id counter → SQLite numbers it 2.
179179
"WITH cte AS (SELECT x FROM u) SELECT * FROM t WHERE a=(SELECT min(x) FROM cte)",
180180
// A compound (UNION) body bumps the counter too.

0 commit comments

Comments
 (0)