Skip to content

Commit 9671dbf

Browse files
Phase 8: query protocol — hardened wire decode, dual-surface AST, logical-plan IR, lowering, result/cursor encoding
1 parent e55d70c commit 9671dbf

17 files changed

Lines changed: 4126 additions & 26 deletions

File tree

CHANGELOG.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,48 @@ under a category (`Added` / `Changed` / `Fixed` / `Removed` / `Security`).
88

99
## [Unreleased]
1010

11+
### Phase 8 — Query protocol, surfaces & IR
12+
13+
#### Added
14+
- `proto`: the hardened wire layer — MessagePack bytes ⇄ a bounded `Doc`
15+
tree under `DecodeLimits` (max message size checked up front, max depth
16+
via explicit counter, max node count charged before any allocation);
17+
rejects the reserved byte `0xC1`, ext types, non-string/duplicate map
18+
keys, invalid UTF-8, over-`i64` ints, and trailing bytes (`DECISIONS.md`
19+
D20).
20+
- `proto`: the typed query AST for **both surfaces** — pipeline stages
21+
(scan/match/join/group/sort/project/distinct/limit/cursor) and the clause
22+
form (from/joins/where/group_by/having/order_by/select/distinct/limit/
23+
offset/cursor) — the full §5.2 expression grammar, and DML
24+
(insert/update/delete/transaction/explain) with faithful selector
25+
decoding (`where`/`{all:true}`/absent) for the Phase 9 validator.
26+
- `proto`: strict grammar enforcement at decode — unknown ops, stages,
27+
expression nodes, and fields are typed `Validation` errors (queries are
28+
data, never code); plus the canonical AST → wire encoding (decode ∘
29+
encode = identity).
30+
- `proto`: protocol versioning — optional request `v` (missing = 1, other
31+
values rejected); results always carry `v:1` (`DECISIONS.md` D21).
32+
- `proto`: the logical-plan IR (`Plan`): Scan, IndexScan (planner-only),
33+
Filter, Join, Aggregate, Project, Distinct, Sort, Limit, Cursor.
34+
- `proto`: result encoding per `SPEC.md` §5.6 (`{v, ok, columns, rows,
35+
cursor, applied, affected}`), the error-result shape (`{v, ok:false,
36+
code, error}` with the §9 category as `code`), and the opaque cursor-token
37+
envelope `[version][crc32c][payload]` with tamper rejection.
38+
- `query`: surface → IR lowering — the pipeline folds directly into a
39+
`Plan`; the clause form desugars into its fixed-order pipeline and reuses
40+
the same fold, making clause↔pipeline equivalence true by construction;
41+
select-list aggregates become named `group` outputs (`DECISIONS.md` D22).
42+
- Exit-criteria tests: decode round-trips for every grammar node; oversized/
43+
over-deep/over-budget messages rejected; a seeded 200k-input adversarial
44+
fuzz suite over the decoder (random bytes + corpus mutations + hostile
45+
container shapes, `DECISIONS.md` D23); clause↔pipeline equivalence on the
46+
SPEC worked example plus 2 000 random generated pairs.
47+
48+
#### Changed
49+
- `common`: gained the in-house CRC32C routine; `pager` now re-exports it
50+
(`pager::crc32c` unchanged for callers) so `proto` can checksum cursor
51+
tokens without depending on storage crates.
52+
1153
### Phase 7 — Indexing
1254

1355
#### Added

DECISIONS.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,92 @@ Per `PLAN.md` §1 rule 6, every resolution of an ambiguity or deviation from
55

66
---
77

8+
## D23 — Phase 8 fuzzing is a seeded in-house harness; libFuzzer waits for Phase 11
9+
10+
**Phase:** 8 · **Status:** accepted
11+
12+
`PLAN.md` Phase 8's exit demands "fuzzer clean on adversarial input", while
13+
the dedicated fuzz harness (cargo-fuzz/libFuzzer) is a Phase 11 deliverable.
14+
Pulling `cargo-fuzz` forward would add a nightly-only toolchain and external
15+
deps that D4's reasoning avoids.
16+
17+
**Decision:** Phase 8 ships a deterministic, seeded adversarial-input suite
18+
(`proto/tests/proto_fuzz.rs`): 100k random byte strings, 100k corpus
19+
mutations (bit flips, truncations, overwrites, splices), and hand-built
20+
hostile container shapes (depth bombs, claimed-giant containers,
21+
str32/bin32 length lies). Every decoded request must also survive a
22+
canonical re-encode/re-decode round-trip. Coverage-guided fuzzing arrives
23+
with Phase 11's harness; this suite stays as the fast deterministic gate.
24+
25+
## D22 — Clause desugaring semantics: aggregates, distinct, having
26+
27+
**Phase:** 8 · **Status:** accepted
28+
29+
`SPEC.md` §5.4 fixes the clause order (FROM → WHERE → GROUP → HAVING →
30+
PROJECT → ORDER → LIMIT) but leaves three details open. The clause lowerer
31+
desugars into a pipeline and reuses the pipeline fold, so equivalence is by
32+
construction; these rules define the desugaring:
33+
34+
- **Aggregates in the clause `select` list** become named `group` outputs:
35+
the alias names the output (`{as:["spent",{sum:…}]}` → agg `spent`), an
36+
unaliased aggregate gets its function name (`{count:1}``count`), and a
37+
name collision is a typed error. v1 allows aggregates only as the *whole*
38+
select item (no `{add:[{sum:x},1]}`); grouping is implied by `group_by`
39+
*or* select-list aggregates.
40+
- **`distinct` is an IR operator** (`Plan::Distinct`), placed after PROJECT
41+
and before ORDER in the clause order. `ARCHITECTURE.md` §3.7's operator
42+
list omits it though the stage exists in `SPEC.md` §5.3/§11; an explicit
43+
operator beats desugaring into `Aggregate`, which would entangle lowering
44+
with planning. (Addition, not contradiction — flagged per rule 6.)
45+
- **`having` without grouping** is rejected at lowering; `{distinct:false}`
46+
is an explicit no-op stage. Structural shape (pipeline starts at a `scan`,
47+
later sources arrive via `join`, `group` outputs really are aggregate
48+
calls) is also enforced at lowering — names/types/§6 safety stay in the
49+
Phase 9 validator.
50+
51+
## D21 — Protocol version field and cursor-token envelope
52+
53+
**Phase:** 8 · **Status:** accepted
54+
55+
`PLAN.md` Phase 8 requires a protocol version field and a keyset cursor
56+
token; `SPEC.md` §5's grammar shows neither a version nor the token's bytes.
57+
58+
**Decision:**
59+
- **Version:** requests may carry a top-level `v` (int). Missing means
60+
version 1; any other value is a typed `UnsupportedVersion` error. Results
61+
always carry `v:1`. Error results are `{v, ok:false, code, error}` with
62+
`code` the stable `SPEC.md` §9 category identifier (the shape §5.6 leaves
63+
implicit for the failure case).
64+
- **Cursor token:** opaque bytes `[version 0x01][crc32c(payload) BE][payload]`.
65+
The payload (keyset position) is defined with the executor in Phase 9; the
66+
envelope is fixed now so a truncated or mangled token is a clean
67+
`Validation` error instead of a nonsense seek. CRC32C moved from `pager`
68+
to `common` (same in-house routine, D3) so `proto` shares it without
69+
depending on storage crates.
70+
71+
## D20 — Two-stage hardened decode: bytes → bounded Doc tree → AST
72+
73+
**Phase:** 8 · **Status:** accepted
74+
75+
`ARCHITECTURE.md` §6 requires limits enforced *before* allocating and no
76+
unbounded recursion, but does not fix the decoder's architecture.
77+
78+
**Decision:** decoding is two stages. A small hardened reader produces a
79+
generic `Doc` tree (null/bool/int/float/str/bin/array/map) under
80+
`DecodeLimits` — max message size (checked before reading), max depth
81+
(explicit counter), max node count (budget charged per node; container item
82+
counts are validated against both the remaining bytes and the remaining
83+
budget before any `Vec` allocation). It rejects the reserved byte `0xC1`,
84+
ext types, non-string and duplicate map keys, invalid UTF-8, ints outside
85+
`i64`, and trailing bytes. The AST mapping then works on the already-safe
86+
tree and enforces the grammar (unknown ops/stages/expressions/fields are
87+
typed errors). Defaults: 1 MiB / depth 64 (matching `types::MAX_JSON_DEPTH`)
88+
/ 100k nodes, embedder-configurable per `SPEC.md` §8. The node cap bounds
89+
the intermediate tree's memory, so the two-stage shape costs nothing
90+
adversarially and keeps all byte-level hardening in one ~200-line module.
91+
Insert-row values that are containers become `json` values (re-encoded
92+
canonical MessagePack), matching §5.5's `data:{role:"admin"}` example.
93+
894
## D19 — Reclamation only runs inside a committing batch
995

1096
**Phase:** 7 · **Status:** accepted (bug fix of Phase 4 behavior)
Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,8 @@
22
//!
33
//! Implemented in-house rather than pulled as a dependency: it is a small,
44
//! well-understood, non-security-sensitive integrity check (used to detect page
5-
//! corruption, not to resist tampering). See `DECISIONS.md` D3.
5+
//! corruption and mangled cursor tokens, not to resist tampering). See
6+
//! `DECISIONS.md` D3. Lives in `common` so both `pager` and `proto` share it.
67
78
/// Reflected Castagnoli polynomial (0x1EDC6F41 reflected).
89
const POLY: u32 = 0x82F6_3B78;
@@ -35,7 +36,7 @@ static TABLE: [u32; 256] = build_table();
3536
///
3637
/// ```
3738
/// // Standard CRC32C check value for the ASCII string "123456789".
38-
/// assert_eq!(pager::crc32c(b"123456789"), 0xE306_9283);
39+
/// assert_eq!(common::crc32c(b"123456789"), 0xE306_9283);
3940
/// ```
4041
pub fn crc32c(data: &[u8]) -> u32 {
4142
let mut crc = 0xFFFF_FFFFu32;

crates/common/src/lib.rs

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,20 +4,24 @@
44
//! (see `ARCHITECTURE.md` §2). It holds only genuinely shared abstractions:
55
//!
66
//! - the [`ErrorCategory`] taxonomy from `SPEC.md` §9 and the
7-
//! [`CategorizedError`] trait each crate's error enum implements, and
7+
//! [`CategorizedError`] trait each crate's error enum implements,
88
//! - the injectable host services [`Clock`], [`Rng`], and [`IoBackend`] (with
99
//! real-file, in-memory, and fault-injecting backends) that make the lower
10-
//! layers testable and deterministically simulatable from day one.
10+
//! layers testable and deterministically simulatable from day one, and
11+
//! - the in-house [`crc32c`] checksum shared by `pager` (page integrity) and
12+
//! `proto` (cursor-token integrity).
1113
//!
1214
//! Domain newtypes (`PageId`, `TxnId`, `Value`, …) intentionally live in their
1315
//! owning crates, not here — `common` is not a junk drawer.
1416
1517
mod clock;
18+
mod crc;
1619
mod error;
1720
mod io;
1821
mod rng;
1922

2023
pub use clock::{Clock, ManualClock, SystemClock};
24+
pub use crc::crc32c;
2125
pub use error::{CategorizedError, ErrorCategory};
2226
pub use io::{FaultInjectingBackend, FaultPoint, IoBackend, IoError, IoResult, MemoryBackend};
2327
pub use rng::{Rng, SeededRng};

crates/dbms/src/lib.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ mod tests {
9393

9494
#[test]
9595
fn protocol_malformed_is_validation() {
96-
let err: Error = proto::ProtoError::Malformed.into();
96+
let err: Error = proto::ProtoError::Truncated.into();
9797
assert_eq!(err.category(), ErrorCategory::Validation);
9898
}
9999

crates/pager/src/lib.rs

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,14 @@
1919
//! ```
2020
2121
mod cache;
22-
mod crc;
2322
mod freelist;
2423
mod meta;
2524
mod page;
2625
mod pager;
2726

2827
use common::{CategorizedError, ErrorCategory};
2928

30-
pub use crc::crc32c;
29+
pub use common::crc32c;
3130
pub use meta::Meta;
3231
pub use page::{Frame, PageType, HEADER_SIZE, PAGE_PAYLOAD_SIZE, PAGE_SIZE};
3332
pub use pager::{Pager, PagerStats, DEFAULT_CACHE_BYTES};

crates/pager/src/page.rs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ pub fn page_id(frame: &Frame) -> u64 {
105105
pub fn verify(bytes: &[u8], expected: PageId) -> Result<&Frame> {
106106
let frame = as_frame(bytes, expected)?;
107107
let stored = read_u32(frame, CRC_OFF);
108-
let computed = crate::crc::crc32c(&frame[TYPE_OFF..]);
108+
let computed = common::crc32c(&frame[TYPE_OFF..]);
109109
if stored != computed {
110110
return Err(corrupt(CorruptionKind::Checksum {
111111
page: expected.get(),
@@ -129,7 +129,7 @@ pub fn finalize(frame: &mut Frame, page_type: PageType, id: PageId) {
129129
frame[6] = 0;
130130
frame[7] = 0;
131131
write_u64(frame, ID_OFF, id.get());
132-
let crc = crate::crc::crc32c(&frame[TYPE_OFF..]);
132+
let crc = common::crc32c(&frame[TYPE_OFF..]);
133133
write_u32(frame, CRC_OFF, crc);
134134
}
135135

@@ -203,7 +203,7 @@ mod tests {
203203
finalize(&mut frame, PageType::Data, PageId::new(2));
204204
frame[TYPE_OFF] = 99;
205205
// Re-checksum so the type byte is what fails, not the CRC.
206-
let crc = crate::crc::crc32c(&frame[TYPE_OFF..]);
206+
let crc = common::crc32c(&frame[TYPE_OFF..]);
207207
write_u32(&mut frame, CRC_OFF, crc);
208208
let verified = verify(&frame[..], PageId::new(2)).unwrap();
209209
assert!(matches!(

0 commit comments

Comments
 (0)