You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(sql): aggregates / GROUP BY / DISTINCT / HAVING over JOIN results (SQLR-6) (#164)
Generalize the SQLR-3 aggregation pipeline from (table, rowid) to the
RowScope trait, so the joined row stream feeds the same accumulator the
single-table path uses:
- aggregate_rows is now generic over an iterator of RowScopes; group
keys and aggregate args resolve through scope.lookup, so NULL-padded
outer-join rows group under NULL and COUNT(col) skips their NULLs.
- GROUP BY keys carry an optional t. qualifier (GROUP BY customers.name)
via the new GroupByKey struct; AggregateArg::Column keeps its
qualifier too (SUM(orders.amount)).
- The bare-column-must-be-in-GROUP-BY check stays in the parser for
single-table queries and moves to the executor for joined ones, where
qualifier resolution needs the schemas (resolve_scope_column).
- SELECT DISTINCT over a join dedupes the projected output rows, with
LIMIT deferred past the dedupe (mirrors the single-table path).
- HAVING composes over joins through the shared
lower_having_into_hidden_slots + run_aggregation_pipeline helpers.
- Bonus fix: SELECT * FROM t GROUP BY c used to panic on the
'validated to be in GROUP BY' expect (parser validation skips
Projection::All); it now surfaces the standard 'must appear in
GROUP BY' error.
- Stale 'HAVING is not yet supported' error message and docs updated.
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -175,7 +175,7 @@ sqlrite> DELETE FROM users WHERE age < 30;
175
175
|`CREATE TABLE`|`PRIMARY KEY`, `UNIQUE`, `NOT NULL`; `IF NOT EXISTS` (idempotent re-create); duplicate-column detection; types `INTEGER`/`INT`/`BIGINT`/`SMALLINT`, `TEXT`/`VARCHAR`, `REAL`/`FLOAT`/`DOUBLE`/`DECIMAL`, `BOOLEAN`. Auto-creates `sqlrite_autoindex_<table>_<col>` for every PK + UNIQUE column |
176
176
|`CREATE [UNIQUE] INDEX`| Single-column, named indexes; `IF NOT EXISTS`; persists as a dedicated cell-based B-Tree. INTEGER + TEXT columns only |
177
177
|`INSERT INTO`| Explicit column list required; auto-ROWID for `INTEGER PRIMARY KEY`; multi-row `VALUES (…), (…)`; UNIQUE enforcement; clean type errors (no panics); NULL padding for omitted columns |
178
-
|`SELECT`|`*` or column list with optional `AS alias`; `WHERE`; `DISTINCT`; `GROUP BY col[, col …]`; aggregate projections `COUNT(*)` / `COUNT([DISTINCT] col)` / `SUM` / `AVG` / `MIN` / `MAX`; `[INNER\|LEFT OUTER\|RIGHT OUTER\|FULL OUTER\|CROSS] JOIN` with `ON ...` / `USING (...)` / `NATURAL` constraints, table aliases and qualified `t.col` references; single-column `ORDER BY [ASC\|DESC]` (also resolves alias and aggregate display names); `LIMIT n`. `WHERE col = literal` probes an index when one exists. Catalog introspection via `SELECT … FROM sqlrite_master`|
178
+
|`SELECT`|`*` or column list with optional `AS alias`; `WHERE`; `DISTINCT`; `GROUP BY col[, t.col …]` (qualified keys allowed); `HAVING`; aggregate projections `COUNT(*)` / `COUNT([DISTINCT] col)` / `SUM` / `AVG` / `MIN` / `MAX`; `[INNER\|LEFT OUTER\|RIGHT OUTER\|FULL OUTER\|CROSS] JOIN` with `ON ...` / `USING (...)` / `NATURAL` constraints, table aliases and qualified `t.col` references — aggregates / `GROUP BY` / `DISTINCT` / `HAVING` all compose over join results; single-column `ORDER BY [ASC\|DESC]` (also resolves alias and aggregate display names); `LIMIT n`. `WHERE col = literal` probes an index when one exists. Catalog introspection via `SELECT … FROM sqlrite_master`|
179
179
|`UPDATE`| Multi-column `SET`; `WHERE`; UNIQUE + type enforcement; arithmetic in assignments (`SET age = age + 1`) |
180
180
|`DELETE`|`WHERE` predicate or full-table delete |
181
181
|`BEGIN` / `COMMIT` / `ROLLBACK`| Real transactions, snapshot-based; WAL-backed commit; single-level (no savepoints); auto-rollback if `COMMIT`'s disk write fails |
@@ -193,7 +193,7 @@ Expressions in `WHERE` and `UPDATE`'s `SET` RHS:
193
193
- String concat — `||`
194
194
- Literals — integer + real numbers, `'single-quoted strings'`, `TRUE` / `FALSE`, `NULL`; parentheses for grouping
195
195
196
-
**Not yet supported** (common ones): subqueries, CTEs, `HAVING`, `LIKE … ESCAPE '<char>'`, `IN (subquery)`, `DISTINCT` on `SUM`/`AVG`/`MIN`/`MAX`, GROUP BY on expressions, expressions in the projection list, `OFFSET`, multi-column `ORDER BY`, savepoints, comma joins (`FROM a, b`), aggregates / DISTINCT / GROUP BY *over* JOIN results. The [full list with context](docs/supported-sql.md#not-yet-supported) lives in the reference.
196
+
**Not yet supported** (common ones): subqueries, CTEs, `HAVING` without `GROUP BY`, `LIKE … ESCAPE '<char>'`, `IN (subquery)`, `DISTINCT` on `SUM`/`AVG`/`MIN`/`MAX`, GROUP BY on expressions, expressions in the projection list, `OFFSET`, multi-column `ORDER BY`, savepoints, comma joins (`FROM a, b`). The [full list with context](docs/supported-sql.md#not-yet-supported) lives in the reference.
Copy file name to clipboardExpand all lines: docs/architecture.md
-1Lines changed: 0 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -139,7 +139,6 @@ Steps 1–7 are purely in-memory; step 8 is the only disk contact, and after the
139
139
The roadmap has shipped far enough that the original "deliberately missing" list mostly turned into shipped features. What's still left:
140
140
141
141
-**No query optimizer** beyond the bounded-heap top-k pass for KNN (Phase 7c) and the HNSW probe shortcut (7d.2). Equality-on-PK probes are direct; everything else is a table scan. Joins use plain nested-loop (O(N×M) per join level); hash / merge joins on equi-join shapes are a future increment.
142
-
-**Aggregates / GROUP BY / DISTINCT over joined results.** The single-table aggregator is wired against one rowid stream; the multi-table join executor produces joined rows but doesn't yet feed them through the aggregator. Surfaces as a clean `NotImplemented` at parse time. The single-table aggregation path (SQLR-3) is fully shipped.
143
142
-**No network layer.** SQLRite is embedded-only. The closest thing is the [`sqlrite-mcp`](mcp.md) server, which is stdio (not network). A real wire protocol isn't on the roadmap.
144
143
-**No streaming row cursor.**`Rows` is currently backed by an eager `Vec` (Phase 5a). The `Rows::next` API is shaped to support a real cursor — the swap is deferred to **5a.2**.
Copy file name to clipboardExpand all lines: docs/roadmap.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -561,7 +561,7 @@ The biggest single SQL-surface jump in the project's history.
561
561
- Self-joins require an alias on at least one side.
562
562
-`WHERE` runs after joins (the standard `LEFT JOIN ... WHERE right.col IS NULL` anti-join idiom works).
563
563
564
-
`ON`, `USING (...)`, `NATURAL`, and `CROSS JOIN` are all supported. Not yet supported: comma-separated FROMs (`FROM a, b`), aggregates / `GROUP BY` / `DISTINCT`*over* a join, `fts_match` / `bm25_score` inside a join expression. Algorithm: plain nested-loop, O(N×M) per level — hash / merge joins are a future optimization.
564
+
`ON`, `USING (...)`, `NATURAL`, and `CROSS JOIN` are all supported, and aggregates / `GROUP BY` / `DISTINCT` / `HAVING` compose over join results (SQLR-6). Not yet supported: comma-separated FROMs (`FROM a, b`), `fts_match` / `bm25_score` inside a join expression. Algorithm: plain nested-loop, O(N×M) per level — hash / merge joins are a future optimization.
-~~`HAVING` (post-aggregation filter)~~ ✅ Shipped (SQLR-52) — group-row filter after aggregation; references GROUP BY keys, aggregate aliases, and direct aggregate calls (hidden-slot computation for HAVING-only aggregates). `HAVING` without `GROUP BY` stays rejected in v0.
738
738
-`CASE WHEN … THEN … END`, `BETWEEN`, `GLOB`, `REGEXP`, `LIKE … ESCAPE '<char>'`
739
-
- Aggregates / `GROUP BY` / `DISTINCT`*over* joins (needs a single executor pass that knows about multiple input streams)
739
+
-~~Aggregates / `GROUP BY` / `DISTINCT`*over* joins~~ ✅ Shipped (SQLR-6) — the joined row stream feeds the same scope-generic aggregation pipeline the single-table path uses; `GROUP BY` keys accept `t.col` qualifiers; `HAVING` and `SELECT DISTINCT` compose too.
Copy file name to clipboardExpand all lines: docs/sql-engine.md
+10-2Lines changed: 10 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,11 +49,11 @@ The `sqlparser` AST is designed to cover every SQL dialect, so its types are hug
49
49
50
50
`UPDATE` and `DELETE` don't have a dedicated internal struct — the executor pattern-matches the sqlparser types directly because there's less transformation needed.
51
51
52
-
`SelectQuery::projection` is now `Projection::All | Projection::Items(Vec<ProjectionItem>)`, where each item carries a `ProjectionKind::Column { qualifier, name }` (qualifier is `Some` for `t.col` shapes, used by JOIN execution to disambiguate) or `ProjectionKind::Aggregate(AggregateCall)` plus an optional `AS alias`. `AggregateCall` covers `COUNT(*)`, `COUNT([DISTINCT] col)`, `SUM` / `AVG` / `MIN` / `MAX` of a bare column.`group_by` is a `Vec<String>` of bare column names (empty = no GROUP BY); the parser validates that every non-aggregate projection item appears in `GROUP BY`.
52
+
`SelectQuery::projection` is now `Projection::All | Projection::Items(Vec<ProjectionItem>)`, where each item carries a `ProjectionKind::Column { qualifier, name }` (qualifier is `Some` for `t.col` shapes, used by JOIN execution to disambiguate) or `ProjectionKind::Aggregate(AggregateCall)` plus an optional `AS alias`. `AggregateCall` covers `COUNT(*)`, `COUNT([DISTINCT] col)`, `SUM` / `AVG` / `MIN` / `MAX` of a column reference (optionally qualified, `SUM(o.amount)`).`group_by` is a `Vec<GroupByKey>` of optionally-qualified column references (`GROUP BY dept`, `GROUP BY customers.name`; empty = no GROUP BY). The parser validates that every non-aggregate projection item appears in `GROUP BY` for single-table queries; joined queries defer that check to the executor, which resolves qualifiers against the in-scope table schemas (SQLR-6).
53
53
54
54
`SelectQuery::joins` (SQLR-5) is a `Vec<JoinClause>` evaluated left-to-right by `execute_select_rows_joined`. Each clause carries a `JoinType` (`Inner` / `LeftOuter` / `RightOuter` / `FullOuter`), the right-table name + optional alias, and a required `ON` expression. Empty = single-table SELECT, the existing fast path with HNSW / FTS / bounded-heap optimizations.
55
55
56
-
Each parser module still rejects features we don't implement with `SQLRiteError::NotImplemented` — comma joins (`FROM a, b`), aggregates / GROUP BY / DISTINCT over JOINs, `HAVING`, `DISTINCT ON (...)`, `GROUP BY` on expressions, `LIKE … ESCAPE '<char>'`, `IN (subquery)`, `OFFSET`, multi-table DELETE, tuple assignment targets, etc. These errors carry the feature name in the message so the user knows what isn't there. (`JOIN ... USING`, `NATURAL JOIN`, and `CROSS JOIN` are now supported — see [`supported-sql.md`](supported-sql.md#join-semantics-sqlr-5).)
56
+
Each parser module still rejects features we don't implement with `SQLRiteError::NotImplemented` — comma joins (`FROM a, b`), `HAVING` without `GROUP BY`, `DISTINCT ON (...)`, `GROUP BY` on expressions, `LIKE … ESCAPE '<char>'`, `IN (subquery)`, `OFFSET`, multi-table DELETE, tuple assignment targets, etc. These errors carry the feature name in the message so the user knows what isn't there. (`JOIN ... USING`, `NATURAL JOIN`, and `CROSS JOIN` are now supported — see [`supported-sql.md`](supported-sql.md#join-semantics-sqlr-5).)
57
57
58
58
## Statement dispatch
59
59
@@ -176,6 +176,14 @@ contributes to its group), so the executor takes a separate path:
176
176
against the *output* row by alias, bare column name, or aggregate
177
177
display form), then LIMIT.
178
178
179
+
SQLR-6 made steps 2–6 scope-generic: the accumulator consumes
180
+
`RowScope`s instead of `(table, rowid)` pairs, so a joined SELECT feeds
181
+
its fully-joined, WHERE-filtered rows (each one a `JoinedScope`)
182
+
through the exact same pipeline. `GROUP BY` keys and aggregate args
183
+
carry an optional `t.` qualifier for disambiguation; NULL-padded
184
+
outer-join rows group under a `NULL` key and are skipped by
185
+
`COUNT(col)` like any other NULL.
186
+
179
187
Aggregate function names (`COUNT`/`SUM`/`AVG`/`MIN`/`MAX`) used in WHERE
180
188
or any other scalar position get a friendly error redirecting the user
181
189
to the projection list (`HAVING` is where post-aggregate filters go).
-**Projection**: `*` (all columns in declaration order), a bare column list, or an explicit list mixing bare columns and aggregate calls. Each item can carry an optional `AS alias` (the alias becomes the output column header and is recognized by `ORDER BY`).
205
205
-**`WHERE`**: any [expression](#expressions). Evaluated per row; NULL-as-false in WHERE context (three-valued logic collapsed to two-valued for filtering). Includes **`IS NULL`** / **`IS NOT NULL`** for explicit null tests, **`LIKE` / `NOT LIKE` / `ILIKE`** for pattern matching, and **`IN (list) / NOT IN (list)`** for set-membership against literal lists.
206
206
-**`DISTINCT`**: `SELECT DISTINCT` deduplicates result rows after projection (and after aggregation, when both apply). `NULL` values compare equal to other `NULL`s for dedupe, matching SQL's DISTINCT semantic.
207
-
-**`GROUP BY`**: one or more bare column names. Every non-aggregate item in the projection must appear in the `GROUP BY` list (the parser rejects the violation with a clear message). `GROUP BY <col>` without any aggregate behaves like an implicit `DISTINCT <col>`.
207
+
-**`GROUP BY`**: one or more column names, optionally qualified (`GROUP BY customers.name`) — the qualifier disambiguates same-named columns across joined tables (SQLR-6). Every non-aggregate item in the projection must appear in the `GROUP BY` list (rejected with a clear message — by the parser for single-table queries, by the executor for joined ones, where resolving qualifiers needs the schemas). `GROUP BY <col>` without any aggregate behaves like an implicit `DISTINCT <col>`.
208
208
-**`HAVING`** (SQLR-52): post-aggregation filter over the grouped output. `WHERE` filters rows before grouping; `HAVING` filters groups after aggregation. Requires `GROUP BY` (see [HAVING semantics](#having-semantics-sqlr-52)).
209
209
-**Aggregates** (SQLR-3): `COUNT(*)`, `COUNT(col)`, `COUNT(DISTINCT col)`, `SUM(col)`, `AVG(col)`, `MIN(col)`, `MAX(col)`. `SUM` over an integer column stays `INTEGER` until a `REAL` input arrives or the running sum overflows `i64` (one-time promotion to `REAL`). `AVG` always returns `REAL` (or `NULL` on empty / all-NULL groups). `MIN` / `MAX` skip NULLs and use the same total order as `ORDER BY`. Aggregates over an empty table or empty group return `0` for `COUNT(*)` / `COUNT(col)` and `NULL` for the rest.
210
210
-**`ORDER BY`**: single sort key, `ASC` (default) or `DESC`. For non-aggregating queries the key is any expression — including function calls — so KNN queries like `ORDER BY vec_distance_l2(embedding, [...]) LIMIT k` work end-to-end *(Phase 7b)*. For aggregating queries the key resolves against the *output* row by name: a bare identifier matches an alias or a `GROUP BY` column, and a function call like `COUNT(*)` matches an aggregate projection by its canonical display form. Sort key types must match across rows.
@@ -237,12 +237,12 @@ conditions, plus `CROSS JOIN`:
237
237
-**Self-joins** require an alias on at least one side: `FROM nodes AS p INNER JOIN nodes AS c ON p.id = c.parent_id`. Without one, you get a `duplicate table reference` error so qualifiers stay unambiguous.
238
238
-**`WHERE` runs after joins.** A `WHERE right.col IS NULL` filter on a `LEFT JOIN` correctly returns left rows with no match (the standard "anti-join via outer-join" idiom).
239
239
-**`ORDER BY` and `LIMIT`** apply to the fully joined row stream.
240
+
-**Aggregates / `GROUP BY` / `DISTINCT` / `HAVING` over joins** (SQLR-6): the fully-joined, `WHERE`-filtered row stream feeds the same aggregation pipeline single-table queries use. `GROUP BY` keys may be qualified (`GROUP BY customers.name`) and must resolve unambiguously; NULL-padded outer-join rows group under a `NULL` key, and `COUNT(col)` skips their NULLs while `COUNT(*)` counts them. `SELECT DISTINCT` dedupes the projected join output (with `LIMIT` applied after the dedupe).
240
241
-**Algorithm:** plain nested-loop join, O(N×M) per join level. Adequate for an embedded learning database; hash / merge joins on equi-join shapes are a future optimization.
241
242
242
243
#### What's not supported in JOINs
243
244
244
245
- Comma-separated FROM lists (`FROM a, b`) — use an explicit `JOIN` / `CROSS JOIN` instead.
245
-
- Aggregates / `GROUP BY` / `DISTINCT`*over* a join. The single-table aggregator is wired against one rowid stream; rewiring it for joined rows is a separate increment. Surfaces as a clean `NotImplemented` at parse time.
246
246
-`fts_match` / `bm25_score` inside a JOIN expression. They need to look up an FTS index by column, which is single-table-bound today. Use them on a single-table SELECT first, or fold the FTS lookup into the FROM side.
247
247
248
248
### Index probing
@@ -281,7 +281,6 @@ SELECT dept FROM emp GROUP BY dept HAVING COUNT(*) > 1 AND SUM(salary) > 100;
281
281
### What doesn't work
282
282
283
283
-**Comma-separated FROM lists** (`FROM a, b`) — use an explicit `JOIN` / `CROSS JOIN`. `INNER` / `LEFT` / `RIGHT` / `FULL OUTER` / `CROSS` with `ON` / `USING` / `NATURAL` are all supported (see [JOIN semantics](#join-semantics-sqlr-5))
284
-
-**Aggregates** / **`GROUP BY`** / **`DISTINCT`** over a JOIN — pipe through a subquery once subqueries land
285
284
-**Subqueries**, CTEs (`WITH`), views
286
285
-**`HAVING` without `GROUP BY`** — the degenerate single-group form is rejected; `HAVING` with `GROUP BY` works (see [HAVING semantics](#having-semantics-sqlr-52))
287
286
-**`DISTINCT`** on `SUM` / `AVG` / `MIN` / `MAX` (only `COUNT(DISTINCT col)` is supported)
@@ -725,8 +724,7 @@ A REPL launched with `sqlrite --readonly foo.sqlrite` (or `sqlrite::open_databas
725
724
For context when you hit `NotImplemented`. See [Roadmap](roadmap.md) for when these land:
726
725
727
726
### Joins & composition
728
-
-`INNER` / `LEFT` / `RIGHT` / `FULL OUTER` / `CROSS JOIN` with `ON` / `USING (...)` / `NATURAL` all work (SQLR-5). Comma-separated FROM joins (`FROM a, b`) don't — use an explicit `JOIN` / `CROSS JOIN`
729
-
- Aggregates / `GROUP BY` / `DISTINCT`*over* a JOIN — pipe through a subquery once subqueries land
727
+
-`INNER` / `LEFT` / `RIGHT` / `FULL OUTER` / `CROSS JOIN` with `ON` / `USING (...)` / `NATURAL` all work (SQLR-5), and aggregates / `GROUP BY` / `DISTINCT` / `HAVING` compose over join results (SQLR-6). Comma-separated FROM joins (`FROM a, b`) don't — use an explicit `JOIN` / `CROSS JOIN`
730
728
-`fts_match` / `bm25_score` inside a JOIN expression — single-table-bound today
Copy file name to clipboardExpand all lines: docs/usage.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -74,7 +74,7 @@ Quick hits worth knowing when you're working at the REPL:
74
74
-**Arithmetic stays honest.** Integer-only operations stay integer; any `REAL` operand promotes to `f64`; divide-by-zero is a typed runtime error, never a panic.
75
75
-**NULL follows three-valued logic.**`NULL = NULL` is unknown (not true) — treated as false in `WHERE`. Use `IS NULL` / `IS NOT NULL` for explicit null tests, e.g. `SELECT id FROM t WHERE qty IS NULL;`.
76
76
-**Identifiers are case-sensitive** (table / column names; no normalization), but keywords aren't. String literals preserve case.
77
-
-**Not yet supported**: joins, subqueries, `GROUP BY` / aggregates, `DISTINCT`, `LIKE` / `IN`, projection expressions, column aliases, `OFFSET`, multi-column `ORDER BY`, savepoints, `ALTER TABLE`, `DROP TABLE`, `DROP INDEX`. See the [full list in the reference](supported-sql.md#not-yet-supported).
77
+
-**Not yet supported**: subqueries, CTEs, views, comma joins (`FROM a, b`), projection expressions beyond aggregate calls, `OFFSET`, multi-column `ORDER BY`, savepoints. See the [full list in the reference](supported-sql.md#not-yet-supported).
0 commit comments