Skip to content

Commit 79eff1d

Browse files
joaoh82claude
andauthored
feat(sql): aggregates / GROUP BY / DISTINCT / HAVING over JOIN results (SQLR-6) (#164)
Generalize the SQLR-3 aggregation pipeline from (table, rowid) to the RowScope trait, so the joined row stream feeds the same accumulator the single-table path uses: - aggregate_rows is now generic over an iterator of RowScopes; group keys and aggregate args resolve through scope.lookup, so NULL-padded outer-join rows group under NULL and COUNT(col) skips their NULLs. - GROUP BY keys carry an optional t. qualifier (GROUP BY customers.name) via the new GroupByKey struct; AggregateArg::Column keeps its qualifier too (SUM(orders.amount)). - The bare-column-must-be-in-GROUP-BY check stays in the parser for single-table queries and moves to the executor for joined ones, where qualifier resolution needs the schemas (resolve_scope_column). - SELECT DISTINCT over a join dedupes the projected output rows, with LIMIT deferred past the dedupe (mirrors the single-table path). - HAVING composes over joins through the shared lower_having_into_hidden_slots + run_aggregation_pipeline helpers. - Bonus fix: SELECT * FROM t GROUP BY c used to panic on the 'validated to be in GROUP BY' expect (parser validation skips Projection::All); it now surfaces the standard 'must appear in GROUP BY' error. - Stale 'HAVING is not yet supported' error message and docs updated. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
1 parent e67ca01 commit 79eff1d

11 files changed

Lines changed: 666 additions & 180 deletions

File tree

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,7 @@ sqlrite> DELETE FROM users WHERE age < 30;
175175
| `CREATE TABLE` | `PRIMARY KEY`, `UNIQUE`, `NOT NULL`; `IF NOT EXISTS` (idempotent re-create); duplicate-column detection; types `INTEGER`/`INT`/`BIGINT`/`SMALLINT`, `TEXT`/`VARCHAR`, `REAL`/`FLOAT`/`DOUBLE`/`DECIMAL`, `BOOLEAN`. Auto-creates `sqlrite_autoindex_<table>_<col>` for every PK + UNIQUE column |
176176
| `CREATE [UNIQUE] INDEX` | Single-column, named indexes; `IF NOT EXISTS`; persists as a dedicated cell-based B-Tree. INTEGER + TEXT columns only |
177177
| `INSERT INTO` | Explicit column list required; auto-ROWID for `INTEGER PRIMARY KEY`; multi-row `VALUES (…), (…)`; UNIQUE enforcement; clean type errors (no panics); NULL padding for omitted columns |
178-
| `SELECT` | `*` or column list with optional `AS alias`; `WHERE`; `DISTINCT`; `GROUP BY col[, col …]`; aggregate projections `COUNT(*)` / `COUNT([DISTINCT] col)` / `SUM` / `AVG` / `MIN` / `MAX`; `[INNER\|LEFT OUTER\|RIGHT OUTER\|FULL OUTER\|CROSS] JOIN` with `ON ...` / `USING (...)` / `NATURAL` constraints, table aliases and qualified `t.col` references; single-column `ORDER BY [ASC\|DESC]` (also resolves alias and aggregate display names); `LIMIT n`. `WHERE col = literal` probes an index when one exists. Catalog introspection via `SELECT … FROM sqlrite_master` |
178+
| `SELECT` | `*` or column list with optional `AS alias`; `WHERE`; `DISTINCT`; `GROUP BY col[, t.col …]` (qualified keys allowed); `HAVING`; aggregate projections `COUNT(*)` / `COUNT([DISTINCT] col)` / `SUM` / `AVG` / `MIN` / `MAX`; `[INNER\|LEFT OUTER\|RIGHT OUTER\|FULL OUTER\|CROSS] JOIN` with `ON ...` / `USING (...)` / `NATURAL` constraints, table aliases and qualified `t.col` references — aggregates / `GROUP BY` / `DISTINCT` / `HAVING` all compose over join results; single-column `ORDER BY [ASC\|DESC]` (also resolves alias and aggregate display names); `LIMIT n`. `WHERE col = literal` probes an index when one exists. Catalog introspection via `SELECT … FROM sqlrite_master` |
179179
| `UPDATE` | Multi-column `SET`; `WHERE`; UNIQUE + type enforcement; arithmetic in assignments (`SET age = age + 1`) |
180180
| `DELETE` | `WHERE` predicate or full-table delete |
181181
| `BEGIN` / `COMMIT` / `ROLLBACK` | Real transactions, snapshot-based; WAL-backed commit; single-level (no savepoints); auto-rollback if `COMMIT`'s disk write fails |
@@ -193,7 +193,7 @@ Expressions in `WHERE` and `UPDATE`'s `SET` RHS:
193193
- String concat — `||`
194194
- Literals — integer + real numbers, `'single-quoted strings'`, `TRUE` / `FALSE`, `NULL`; parentheses for grouping
195195

196-
**Not yet supported** (common ones): subqueries, CTEs, `HAVING`, `LIKE … ESCAPE '<char>'`, `IN (subquery)`, `DISTINCT` on `SUM`/`AVG`/`MIN`/`MAX`, GROUP BY on expressions, expressions in the projection list, `OFFSET`, multi-column `ORDER BY`, savepoints, comma joins (`FROM a, b`), aggregates / DISTINCT / GROUP BY *over* JOIN results. The [full list with context](docs/supported-sql.md#not-yet-supported) lives in the reference.
196+
**Not yet supported** (common ones): subqueries, CTEs, `HAVING` without `GROUP BY`, `LIKE … ESCAPE '<char>'`, `IN (subquery)`, `DISTINCT` on `SUM`/`AVG`/`MIN`/`MAX`, GROUP BY on expressions, expressions in the projection list, `OFFSET`, multi-column `ORDER BY`, savepoints, comma joins (`FROM a, b`). The [full list with context](docs/supported-sql.md#not-yet-supported) lives in the reference.
197197

198198
#### Meta commands
199199

docs/architecture.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,6 @@ Steps 1–7 are purely in-memory; step 8 is the only disk contact, and after the
139139
The roadmap has shipped far enough that the original "deliberately missing" list mostly turned into shipped features. What's still left:
140140

141141
- **No query optimizer** beyond the bounded-heap top-k pass for KNN (Phase 7c) and the HNSW probe shortcut (7d.2). Equality-on-PK probes are direct; everything else is a table scan. Joins use plain nested-loop (O(N×M) per join level); hash / merge joins on equi-join shapes are a future increment.
142-
- **Aggregates / GROUP BY / DISTINCT over joined results.** The single-table aggregator is wired against one rowid stream; the multi-table join executor produces joined rows but doesn't yet feed them through the aggregator. Surfaces as a clean `NotImplemented` at parse time. The single-table aggregation path (SQLR-3) is fully shipped.
143142
- **No network layer.** SQLRite is embedded-only. The closest thing is the [`sqlrite-mcp`](mcp.md) server, which is stdio (not network). A real wire protocol isn't on the roadmap.
144143
- **No streaming row cursor.** `Rows` is currently backed by an eager `Vec` (Phase 5a). The `Rows::next` API is shaped to support a real cursor — the swap is deferred to **5a.2**.
145144

docs/roadmap.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -561,7 +561,7 @@ The biggest single SQL-surface jump in the project's history.
561561
- Self-joins require an alias on at least one side.
562562
- `WHERE` runs after joins (the standard `LEFT JOIN ... WHERE right.col IS NULL` anti-join idiom works).
563563

564-
`ON`, `USING (...)`, `NATURAL`, and `CROSS JOIN` are all supported. Not yet supported: comma-separated FROMs (`FROM a, b`), aggregates / `GROUP BY` / `DISTINCT` *over* a join, `fts_match` / `bm25_score` inside a join expression. Algorithm: plain nested-loop, O(N×M) per level — hash / merge joins are a future optimization.
564+
`ON`, `USING (...)`, `NATURAL`, and `CROSS JOIN` are all supported, and aggregates / `GROUP BY` / `DISTINCT` / `HAVING` compose over join results (SQLR-6). Not yet supported: comma-separated FROMs (`FROM a, b`), `fts_match` / `bm25_score` inside a join expression. Algorithm: plain nested-loop, O(N×M) per level — hash / merge joins are a future optimization.
565565

566566
### ✅ Phase 9g — Prepared statements + parameter binding *(v0.9.0, SQLR-23)*
567567

@@ -736,7 +736,7 @@ The remaining items — actually open, not retroactively rewritten:
736736
- Subqueries (scalar, `IN (SELECT ...)`, correlated) and CTEs (`WITH`, recursive)
737737
- ~~`HAVING` (post-aggregation filter)~~ ✅ Shipped (SQLR-52) — group-row filter after aggregation; references GROUP BY keys, aggregate aliases, and direct aggregate calls (hidden-slot computation for HAVING-only aggregates). `HAVING` without `GROUP BY` stays rejected in v0.
738738
- `CASE WHEN … THEN … END`, `BETWEEN`, `GLOB`, `REGEXP`, `LIKE … ESCAPE '<char>'`
739-
- Aggregates / `GROUP BY` / `DISTINCT` *over* joins (needs a single executor pass that knows about multiple input streams)
739+
- ~~Aggregates / `GROUP BY` / `DISTINCT` *over* joins~~ ✅ Shipped (SQLR-6) — the joined row stream feeds the same scope-generic aggregation pipeline the single-table path uses; `GROUP BY` keys accept `t.col` qualifiers; `HAVING` and `SELECT DISTINCT` compose too.
740740
- Multi-column / expression `ORDER BY`, `OFFSET`, `NULLS FIRST/LAST`
741741
- `UNION` / `INTERSECT` / `EXCEPT`, `INSERT ... SELECT`
742742
- Composite + expression indexes

docs/sql-engine.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,11 +49,11 @@ The `sqlparser` AST is designed to cover every SQL dialect, so its types are hug
4949

5050
`UPDATE` and `DELETE` don't have a dedicated internal struct — the executor pattern-matches the sqlparser types directly because there's less transformation needed.
5151

52-
`SelectQuery::projection` is now `Projection::All | Projection::Items(Vec<ProjectionItem>)`, where each item carries a `ProjectionKind::Column { qualifier, name }` (qualifier is `Some` for `t.col` shapes, used by JOIN execution to disambiguate) or `ProjectionKind::Aggregate(AggregateCall)` plus an optional `AS alias`. `AggregateCall` covers `COUNT(*)`, `COUNT([DISTINCT] col)`, `SUM` / `AVG` / `MIN` / `MAX` of a bare column. `group_by` is a `Vec<String>` of bare column names (empty = no GROUP BY); the parser validates that every non-aggregate projection item appears in `GROUP BY`.
52+
`SelectQuery::projection` is now `Projection::All | Projection::Items(Vec<ProjectionItem>)`, where each item carries a `ProjectionKind::Column { qualifier, name }` (qualifier is `Some` for `t.col` shapes, used by JOIN execution to disambiguate) or `ProjectionKind::Aggregate(AggregateCall)` plus an optional `AS alias`. `AggregateCall` covers `COUNT(*)`, `COUNT([DISTINCT] col)`, `SUM` / `AVG` / `MIN` / `MAX` of a column reference (optionally qualified, `SUM(o.amount)`). `group_by` is a `Vec<GroupByKey>` of optionally-qualified column references (`GROUP BY dept`, `GROUP BY customers.name`; empty = no GROUP BY). The parser validates that every non-aggregate projection item appears in `GROUP BY` for single-table queries; joined queries defer that check to the executor, which resolves qualifiers against the in-scope table schemas (SQLR-6).
5353

5454
`SelectQuery::joins` (SQLR-5) is a `Vec<JoinClause>` evaluated left-to-right by `execute_select_rows_joined`. Each clause carries a `JoinType` (`Inner` / `LeftOuter` / `RightOuter` / `FullOuter`), the right-table name + optional alias, and a required `ON` expression. Empty = single-table SELECT, the existing fast path with HNSW / FTS / bounded-heap optimizations.
5555

56-
Each parser module still rejects features we don't implement with `SQLRiteError::NotImplemented` — comma joins (`FROM a, b`), aggregates / GROUP BY / DISTINCT over JOINs, `HAVING`, `DISTINCT ON (...)`, `GROUP BY` on expressions, `LIKE … ESCAPE '<char>'`, `IN (subquery)`, `OFFSET`, multi-table DELETE, tuple assignment targets, etc. These errors carry the feature name in the message so the user knows what isn't there. (`JOIN ... USING`, `NATURAL JOIN`, and `CROSS JOIN` are now supported — see [`supported-sql.md`](supported-sql.md#join-semantics-sqlr-5).)
56+
Each parser module still rejects features we don't implement with `SQLRiteError::NotImplemented` — comma joins (`FROM a, b`), `HAVING` without `GROUP BY`, `DISTINCT ON (...)`, `GROUP BY` on expressions, `LIKE … ESCAPE '<char>'`, `IN (subquery)`, `OFFSET`, multi-table DELETE, tuple assignment targets, etc. These errors carry the feature name in the message so the user knows what isn't there. (`JOIN ... USING`, `NATURAL JOIN`, and `CROSS JOIN` are now supported — see [`supported-sql.md`](supported-sql.md#join-semantics-sqlr-5).)
5757

5858
## Statement dispatch
5959

@@ -176,6 +176,14 @@ contributes to its group), so the executor takes a separate path:
176176
against the *output* row by alias, bare column name, or aggregate
177177
display form), then LIMIT.
178178

179+
SQLR-6 made steps 2–6 scope-generic: the accumulator consumes
180+
`RowScope`s instead of `(table, rowid)` pairs, so a joined SELECT feeds
181+
its fully-joined, WHERE-filtered rows (each one a `JoinedScope`)
182+
through the exact same pipeline. `GROUP BY` keys and aggregate args
183+
carry an optional `t.` qualifier for disambiguation; NULL-padded
184+
outer-join rows group under a `NULL` key and are skipped by
185+
`COUNT(col)` like any other NULL.
186+
179187
Aggregate function names (`COUNT`/`SUM`/`AVG`/`MIN`/`MAX`) used in WHERE
180188
or any other scalar position get a friendly error redirecting the user
181189
to the projection list (`HAVING` is where post-aggregate filters go).

docs/supported-sql.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -204,7 +204,7 @@ COUNT([DISTINCT] <column>) -- counts non-NULL values, option
204204
- **Projection**: `*` (all columns in declaration order), a bare column list, or an explicit list mixing bare columns and aggregate calls. Each item can carry an optional `AS alias` (the alias becomes the output column header and is recognized by `ORDER BY`).
205205
- **`WHERE`**: any [expression](#expressions). Evaluated per row; NULL-as-false in WHERE context (three-valued logic collapsed to two-valued for filtering). Includes **`IS NULL`** / **`IS NOT NULL`** for explicit null tests, **`LIKE` / `NOT LIKE` / `ILIKE`** for pattern matching, and **`IN (list) / NOT IN (list)`** for set-membership against literal lists.
206206
- **`DISTINCT`**: `SELECT DISTINCT` deduplicates result rows after projection (and after aggregation, when both apply). `NULL` values compare equal to other `NULL`s for dedupe, matching SQL's DISTINCT semantic.
207-
- **`GROUP BY`**: one or more bare column names. Every non-aggregate item in the projection must appear in the `GROUP BY` list (the parser rejects the violation with a clear message). `GROUP BY <col>` without any aggregate behaves like an implicit `DISTINCT <col>`.
207+
- **`GROUP BY`**: one or more column names, optionally qualified (`GROUP BY customers.name`) — the qualifier disambiguates same-named columns across joined tables (SQLR-6). Every non-aggregate item in the projection must appear in the `GROUP BY` list (rejected with a clear message — by the parser for single-table queries, by the executor for joined ones, where resolving qualifiers needs the schemas). `GROUP BY <col>` without any aggregate behaves like an implicit `DISTINCT <col>`.
208208
- **`HAVING`** (SQLR-52): post-aggregation filter over the grouped output. `WHERE` filters rows before grouping; `HAVING` filters groups after aggregation. Requires `GROUP BY` (see [HAVING semantics](#having-semantics-sqlr-52)).
209209
- **Aggregates** (SQLR-3): `COUNT(*)`, `COUNT(col)`, `COUNT(DISTINCT col)`, `SUM(col)`, `AVG(col)`, `MIN(col)`, `MAX(col)`. `SUM` over an integer column stays `INTEGER` until a `REAL` input arrives or the running sum overflows `i64` (one-time promotion to `REAL`). `AVG` always returns `REAL` (or `NULL` on empty / all-NULL groups). `MIN` / `MAX` skip NULLs and use the same total order as `ORDER BY`. Aggregates over an empty table or empty group return `0` for `COUNT(*)` / `COUNT(col)` and `NULL` for the rest.
210210
- **`ORDER BY`**: single sort key, `ASC` (default) or `DESC`. For non-aggregating queries the key is any expression — including function calls — so KNN queries like `ORDER BY vec_distance_l2(embedding, [...]) LIMIT k` work end-to-end *(Phase 7b)*. For aggregating queries the key resolves against the *output* row by name: a bare identifier matches an alias or a `GROUP BY` column, and a function call like `COUNT(*)` matches an aggregate projection by its canonical display form. Sort key types must match across rows.
@@ -237,12 +237,12 @@ conditions, plus `CROSS JOIN`:
237237
- **Self-joins** require an alias on at least one side: `FROM nodes AS p INNER JOIN nodes AS c ON p.id = c.parent_id`. Without one, you get a `duplicate table reference` error so qualifiers stay unambiguous.
238238
- **`WHERE` runs after joins.** A `WHERE right.col IS NULL` filter on a `LEFT JOIN` correctly returns left rows with no match (the standard "anti-join via outer-join" idiom).
239239
- **`ORDER BY` and `LIMIT`** apply to the fully joined row stream.
240+
- **Aggregates / `GROUP BY` / `DISTINCT` / `HAVING` over joins** (SQLR-6): the fully-joined, `WHERE`-filtered row stream feeds the same aggregation pipeline single-table queries use. `GROUP BY` keys may be qualified (`GROUP BY customers.name`) and must resolve unambiguously; NULL-padded outer-join rows group under a `NULL` key, and `COUNT(col)` skips their NULLs while `COUNT(*)` counts them. `SELECT DISTINCT` dedupes the projected join output (with `LIMIT` applied after the dedupe).
240241
- **Algorithm:** plain nested-loop join, O(N×M) per join level. Adequate for an embedded learning database; hash / merge joins on equi-join shapes are a future optimization.
241242

242243
#### What's not supported in JOINs
243244

244245
- Comma-separated FROM lists (`FROM a, b`) — use an explicit `JOIN` / `CROSS JOIN` instead.
245-
- Aggregates / `GROUP BY` / `DISTINCT` *over* a join. The single-table aggregator is wired against one rowid stream; rewiring it for joined rows is a separate increment. Surfaces as a clean `NotImplemented` at parse time.
246246
- `fts_match` / `bm25_score` inside a JOIN expression. They need to look up an FTS index by column, which is single-table-bound today. Use them on a single-table SELECT first, or fold the FTS lookup into the FROM side.
247247

248248
### Index probing
@@ -281,7 +281,6 @@ SELECT dept FROM emp GROUP BY dept HAVING COUNT(*) > 1 AND SUM(salary) > 100;
281281
### What doesn't work
282282

283283
- **Comma-separated FROM lists** (`FROM a, b`) — use an explicit `JOIN` / `CROSS JOIN`. `INNER` / `LEFT` / `RIGHT` / `FULL OUTER` / `CROSS` with `ON` / `USING` / `NATURAL` are all supported (see [JOIN semantics](#join-semantics-sqlr-5))
284-
- **Aggregates** / **`GROUP BY`** / **`DISTINCT`** over a JOIN — pipe through a subquery once subqueries land
285284
- **Subqueries**, CTEs (`WITH`), views
286285
- **`HAVING` without `GROUP BY`** — the degenerate single-group form is rejected; `HAVING` with `GROUP BY` works (see [HAVING semantics](#having-semantics-sqlr-52))
287286
- **`DISTINCT`** on `SUM` / `AVG` / `MIN` / `MAX` (only `COUNT(DISTINCT col)` is supported)
@@ -725,8 +724,7 @@ A REPL launched with `sqlrite --readonly foo.sqlrite` (or `sqlrite::open_databas
725724
For context when you hit `NotImplemented`. See [Roadmap](roadmap.md) for when these land:
726725

727726
### Joins & composition
728-
- `INNER` / `LEFT` / `RIGHT` / `FULL OUTER` / `CROSS JOIN` with `ON` / `USING (...)` / `NATURAL` all work (SQLR-5). Comma-separated FROM joins (`FROM a, b`) don't — use an explicit `JOIN` / `CROSS JOIN`
729-
- Aggregates / `GROUP BY` / `DISTINCT` *over* a JOIN — pipe through a subquery once subqueries land
727+
- `INNER` / `LEFT` / `RIGHT` / `FULL OUTER` / `CROSS JOIN` with `ON` / `USING (...)` / `NATURAL` all work (SQLR-5), and aggregates / `GROUP BY` / `DISTINCT` / `HAVING` compose over join results (SQLR-6). Comma-separated FROM joins (`FROM a, b`) don't — use an explicit `JOIN` / `CROSS JOIN`
730728
- `fts_match` / `bm25_score` inside a JOIN expression — single-table-bound today
731729
- Subqueries (scalar, `IN (SELECT ...)`, correlated)
732730
- CTEs (`WITH`), recursive CTEs

docs/usage.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ Quick hits worth knowing when you're working at the REPL:
7474
- **Arithmetic stays honest.** Integer-only operations stay integer; any `REAL` operand promotes to `f64`; divide-by-zero is a typed runtime error, never a panic.
7575
- **NULL follows three-valued logic.** `NULL = NULL` is unknown (not true) — treated as false in `WHERE`. Use `IS NULL` / `IS NOT NULL` for explicit null tests, e.g. `SELECT id FROM t WHERE qty IS NULL;`.
7676
- **Identifiers are case-sensitive** (table / column names; no normalization), but keywords aren't. String literals preserve case.
77-
- **Not yet supported**: joins, subqueries, `GROUP BY` / aggregates, `DISTINCT`, `LIKE` / `IN`, projection expressions, column aliases, `OFFSET`, multi-column `ORDER BY`, savepoints, `ALTER TABLE`, `DROP TABLE`, `DROP INDEX`. See the [full list in the reference](supported-sql.md#not-yet-supported).
77+
- **Not yet supported**: subqueries, CTEs, views, comma joins (`FROM a, b`), projection expressions beyond aggregate calls, `OFFSET`, multi-column `ORDER BY`, savepoints. See the [full list in the reference](supported-sql.md#not-yet-supported).
7878

7979
## History
8080

0 commit comments

Comments
 (0)