Skip to content

Commit d285c38

Browse files
joaoh82claude
andauthored
feat(sql): HAVING — post-aggregation filter (SQLR-52) (#161)
WHERE filters rows before grouping; HAVING filters groups after aggregation. Closes the Phase 9e aggregates story. Parser (src/sql/parser/select.rs): - SelectQuery grows `having: Option<Expr>`, passed through raw from sqlparser like WHERE. parse_aggregate_call / AggregateFn::from_name exposed pub(crate) for the executor's HAVING lowering. - HAVING without GROUP BY rejected with a typed NotImplemented (the degenerate single-group form SQLite allows isn't worth the executor branch in v0). HAVING + JOIN stays covered by the existing GROUP-BY-over-JOIN rejection (SQLR-6 is the follow-up). Executor (src/sql/executor.rs): - lower_having_expr rewrites aggregate calls in the HAVING tree to identifiers naming their output slot (SUM(salary) → "SUM(salary)"), registering hidden trailing projection slots for aggregates and GROUP BY keys referenced only in HAVING so aggregate_rows computes them alongside the visible ones. - New GroupRowScope resolves those identifiers against the group's output row through the shared expression evaluator — comparisons, AND/OR/NOT, arithmetic, IS NULL, LIKE, IN all work, with the same NULL-as-false collapse WHERE applies (design-decisions §13). - filter_groups_by_having runs after aggregation, before DISTINCT / ORDER BY / LIMIT; hidden slots are stripped after filtering. Tests: +13 executor tests (612 → 625 in the engine suite): COUNT/SUM thresholds, aggregate alias, aggregate-only-in-HAVING, group-key-only- in-HAVING, compound AND, ORDER BY/LIMIT composition, all-groups- excluded, NULL-aggregate collapse, lowercase call form, no-GROUP-BY rejection, out-of-scope column rejection, all four JOIN flavors rejected cleanly. Docs: supported-sql.md (HAVING semantics section + syntax block), sql-engine.md aggregation pipeline, roadmap.md shipped entry, README, design-decisions §13 wording, web docs page / sql-ref / roadmap. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
1 parent 2a04fc1 commit d285c38

10 files changed

Lines changed: 488 additions & 24 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -332,7 +332,7 @@ Lockstep versioning — one dispatch bumps every product to the same `vX.Y.Z`. T
332332
- [ ] *(deferred to Phase 8)* Full-text search with BM25 + hybrid retrieval
333333

334334
**Possible extras** *(no committed phase)*
335-
- `HAVING`, `IN (subquery)`, `BETWEEN`, `GLOB` / `REGEXP`, `GROUP_CONCAT`, window functions
335+
- `IN (subquery)`, `BETWEEN`, `GLOB` / `REGEXP`, `GROUP_CONCAT`, window functions
336336
- Composite and expression indexes (with cost analysis)
337337
- Alternate storage engines — LSM/SSTable for write-heavy workloads alongside the B-Tree
338338
- Benchmarks against SQLite

docs/design-decisions.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -398,7 +398,7 @@ A hidden primary that the registry itself owns sidesteps both problems: every `*
398398

399399
**Decision.** In [`eval_predicate`](../src/sql/executor.rs), a `WHERE` expression evaluating to `NULL` is treated as `false` — the row does *not* match.
400400

401-
**Why.** Matches SQL's three-valued logic in spirit: `NULL` propagates through comparisons, and a `WHERE` requires a definitely-true predicate. Doing strict 3VL would mean threading an explicit `Option<bool>` / "unknown" state through the evaluator. For a query surface that doesn't have `HAVING` or aggregate post-filters, implicit coercion to `false` at the `WHERE` boundary is equivalent for every statement we execute.
401+
**Why.** Matches SQL's three-valued logic in spirit: `NULL` propagates through comparisons, and a `WHERE` requires a definitely-true predicate. Doing strict 3VL would mean threading an explicit `Option<bool>` / "unknown" state through the evaluator. Implicit coercion to `false` at the filter boundary is equivalent for every statement we execute; `HAVING` (SQLR-52) reuses the same collapse — a group whose predicate evaluates to `NULL` is dropped.
402402

403403
**Cost.** Diverges subtly from strict SQL on edge cases involving `NULL` through `NOT` / `AND` / `OR`. If this matters later, the evaluator can be upgraded to 3VL without touching callers.
404404

docs/roadmap.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -734,7 +734,7 @@ Promotes the plan to a canonical user-facing reference at [`docs/concurrent-writ
734734
The remaining items — actually open, not retroactively rewritten:
735735

736736
- Subqueries (scalar, `IN (SELECT ...)`, correlated) and CTEs (`WITH`, recursive)
737-
- `HAVING` (post-aggregation filter)
737+
- ~~`HAVING` (post-aggregation filter)~~ ✅ Shipped (SQLR-52) — group-row filter after aggregation; references GROUP BY keys, aggregate aliases, and direct aggregate calls (hidden-slot computation for HAVING-only aggregates). `HAVING` without `GROUP BY` stays rejected in v0.
738738
- `CASE WHEN … THEN … END`, `BETWEEN`, `GLOB`, `REGEXP`, `LIKE … ESCAPE '<char>'`
739739
- Aggregates / `GROUP BY` / `DISTINCT` *over* joins (needs a single executor pass that knows about multiple input streams)
740740
- Multi-column / expression `ORDER BY`, `OFFSET`, `NULLS FIRST/LAST`

docs/sql-engine.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -166,14 +166,20 @@ contributes to its group), so the executor takes a separate path:
166166
4. Emit one output row per group, in projection order — bare-column
167167
slots emit the captured group-key value, aggregate slots emit
168168
`AggState::finalize()`.
169-
5. Apply DISTINCT (post-projection dedup), then ORDER BY (resolved
169+
5. Apply `HAVING` (SQLR-52): the expression is lowered once — aggregate
170+
calls become identifiers naming their output slot, with *hidden*
171+
trailing slots appended for aggregates / GROUP BY keys referenced
172+
only in HAVING — then evaluated per group row through a
173+
`GroupRowScope` using the same expression evaluator as WHERE
174+
(NULL-as-false). Hidden slots are stripped after filtering.
175+
6. Apply DISTINCT (post-projection dedup), then ORDER BY (resolved
170176
against the *output* row by alias, bare column name, or aggregate
171177
display form), then LIMIT.
172178

173179
Aggregate function names (`COUNT`/`SUM`/`AVG`/`MIN`/`MAX`) used in WHERE
174180
or any other scalar position get a friendly error redirecting the user
175-
to the projection list (since `HAVING` isn't supported yet). DISTINCT
176-
on `SUM`/`AVG`/`MIN`/`MAX` is rejected at parse time; only
181+
to the projection list (`HAVING` is where post-aggregate filters go).
182+
DISTINCT on `SUM`/`AVG`/`MIN`/`MAX` is rejected at parse time; only
177183
`COUNT(DISTINCT col)` is in v1.
178184

179185
`LIKE` / `ILIKE` use a hand-rolled iterative two-pointer matcher in

docs/supported-sql.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,7 @@ FROM <table> [AS <alias>]
184184
[{INNER | LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} JOIN <table> [AS <alias>] ON <expr>]*
185185
[WHERE <expr>]
186186
[GROUP BY <col>[, <col>, ...]]
187+
[HAVING <expr>]
187188
[ORDER BY <expr> [ASC|DESC]]
188189
[LIMIT <non-negative-integer>];
189190
```
@@ -204,6 +205,7 @@ COUNT([DISTINCT] <column>) -- counts non-NULL values, option
204205
- **`WHERE`**: any [expression](#expressions). Evaluated per row; NULL-as-false in WHERE context (three-valued logic collapsed to two-valued for filtering). Includes **`IS NULL`** / **`IS NOT NULL`** for explicit null tests, **`LIKE` / `NOT LIKE` / `ILIKE`** for pattern matching, and **`IN (list) / NOT IN (list)`** for set-membership against literal lists.
205206
- **`DISTINCT`**: `SELECT DISTINCT` deduplicates result rows after projection (and after aggregation, when both apply). `NULL` values compare equal to other `NULL`s for dedupe, matching SQL's DISTINCT semantic.
206207
- **`GROUP BY`**: one or more bare column names. Every non-aggregate item in the projection must appear in the `GROUP BY` list (the parser rejects the violation with a clear message). `GROUP BY <col>` without any aggregate behaves like an implicit `DISTINCT <col>`.
208+
- **`HAVING`** (SQLR-52): post-aggregation filter over the grouped output. `WHERE` filters rows before grouping; `HAVING` filters groups after aggregation. Requires `GROUP BY` (see [HAVING semantics](#having-semantics-sqlr-52)).
207209
- **Aggregates** (SQLR-3): `COUNT(*)`, `COUNT(col)`, `COUNT(DISTINCT col)`, `SUM(col)`, `AVG(col)`, `MIN(col)`, `MAX(col)`. `SUM` over an integer column stays `INTEGER` until a `REAL` input arrives or the running sum overflows `i64` (one-time promotion to `REAL`). `AVG` always returns `REAL` (or `NULL` on empty / all-NULL groups). `MIN` / `MAX` skip NULLs and use the same total order as `ORDER BY`. Aggregates over an empty table or empty group return `0` for `COUNT(*)` / `COUNT(col)` and `NULL` for the rest.
208210
- **`ORDER BY`**: single sort key, `ASC` (default) or `DESC`. For non-aggregating queries the key is any expression — including function calls — so KNN queries like `ORDER BY vec_distance_l2(embedding, [...]) LIMIT k` work end-to-end *(Phase 7b)*. For aggregating queries the key resolves against the *output* row by name: a bare identifier matches an alias or a `GROUP BY` column, and a function call like `COUNT(*)` matches an aggregate projection by its canonical display form. Sort key types must match across rows.
209211
- **`LIMIT`**: non-negative integer literal. `LIMIT 0` is valid (returns zero rows). When `DISTINCT` is in play, `LIMIT` is applied after deduplication so it counts unique rows.
@@ -260,12 +262,28 @@ The executor includes a tiny optimizer: if the `WHERE` is exactly `<indexed_col>
260262
- Three-valued logic: if the LHS is `NULL`, the result is `NULL`; if the RHS list contains a `NULL` and no other entry matches, the result is `NULL`. In a `WHERE` both cases collapse to "row excluded", matching SQLite.
261263
- `IN (subquery)`, `IN UNNEST(...)`, and `BETWEEN` are not supported yet.
262264

265+
### `HAVING` semantics (SQLR-52)
266+
267+
- Post-aggregation filter: groups whose `HAVING` expression evaluates to false or `NULL` are dropped (NULL-as-false, the same three-valued-logic collapse `WHERE` applies).
268+
- **Requires `GROUP BY`.** The degenerate no-`GROUP-BY` single-group form SQLite allows is rejected with a clear `NotImplemented` — use `WHERE` for row-level filters.
269+
- **What's in scope:** the `GROUP BY` key columns (their per-group values), aggregate output columns by alias (`SUM(salary) AS total … HAVING total > 100`), and aggregate calls written out directly (`HAVING COUNT(*) > 1`, matched case-insensitively by canonical display form).
270+
- Aggregates and `GROUP BY` keys referenced **only** in `HAVING` work too — `SELECT dept FROM emp GROUP BY dept HAVING COUNT(*) > 1` computes the count without projecting it.
271+
- Any other column reference is an error (matches SQLite: `HAVING` sees the grouped output, not the raw rows).
272+
- The expression surface is the same as `WHERE`: comparisons, `AND` / `OR` / `NOT`, arithmetic, `IS [NOT] NULL`, `LIKE`, `IN (list)`.
273+
- Runs after `WHERE` + aggregation, before `DISTINCT`, `ORDER BY`, and `LIMIT`.
274+
275+
```sql
276+
SELECT dept, COUNT(*) FROM emp GROUP BY dept HAVING COUNT(*) > 1;
277+
SELECT dept, SUM(salary) AS total FROM emp GROUP BY dept HAVING total > 100000;
278+
SELECT dept FROM emp GROUP BY dept HAVING COUNT(*) > 1 AND SUM(salary) > 100;
279+
```
280+
263281
### What doesn't work
264282

265283
- **Comma-separated FROM lists** (`FROM a, b`) — use an explicit `JOIN` / `CROSS JOIN`. `INNER` / `LEFT` / `RIGHT` / `FULL OUTER` / `CROSS` with `ON` / `USING` / `NATURAL` are all supported (see [JOIN semantics](#join-semantics-sqlr-5))
266284
- **Aggregates** / **`GROUP BY`** / **`DISTINCT`** over a JOIN — pipe through a subquery once subqueries land
267285
- **Subqueries**, CTEs (`WITH`), views
268-
- **`HAVING`**pre-aggregation `WHERE` works; post-aggregation filtering does not yet
286+
- **`HAVING` without `GROUP BY`**the degenerate single-group form is rejected; `HAVING` with `GROUP BY` works (see [HAVING semantics](#having-semantics-sqlr-52))
269287
- **`DISTINCT`** on `SUM` / `AVG` / `MIN` / `MAX` (only `COUNT(DISTINCT col)` is supported)
270288
- **`GROUP BY` on expressions** — bare column names only in v1
271289
- **`LIKE … ESCAPE '<char>'`**, **`IN (subquery)`**, **`BETWEEN`**, **`GLOB`**, **`REGEXP`**
@@ -715,7 +733,7 @@ For context when you hit `NotImplemented`. See [Roadmap](roadmap.md) for when th
715733
- Views (`CREATE VIEW`)
716734

717735
### Aggregation & grouping
718-
- `HAVING` — pre-aggregation `WHERE` works; post-aggregation filtering doesn't yet
736+
- `HAVING` without `GROUP BY` — the degenerate single-group form; `HAVING` over grouped output works (SQLR-52)
719737
- `DISTINCT` on `SUM` / `AVG` / `MIN` / `MAX` (only `COUNT(DISTINCT col)` is supported)
720738
- `GROUP BY` on expressions — bare column names only
721739
- Other aggregate functions (`GROUP_CONCAT`, `STRING_AGG`, …) — only `COUNT` / `SUM` / `AVG` / `MIN` / `MAX` are wired

0 commit comments

Comments
 (0)