You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: SS-2 invariant — nulls always sort last, regardless of direction (#6343)
* fix: SS-2 invariant — nulls always sort last, regardless of direction
The SS-2 invariant incorrectly specified that nulls sort before non-null
for descending columns. The actual production code uses nulls_first=false
everywhere — nulls always sort after non-null values, regardless of
direction. This is required for sorted_series key correctness: null
columns are omitted from the key, and their keys naturally sort after
keys that include those columns.
Fixed in all locations:
- TLA+ SortSchema.tla: CompareValues and SS2_NullOrdering
- Rust invariants/sort.rs: compare_with_null_ordering
- Rust models/sort_schema.rs: SS-2 property check and compare_values test
- Rust invariants/registry.rs: SS-2 description
- ADR-002: SS-2 invariant table
- Phase 1 design doc: null handling descriptions
Stateright exhaustive_sort_schema BFS passes with corrected model.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: verify model detects null ordering violations
Added two tests to the Stateright sort_schema model:
- ss2_detects_null_before_non_null_in_descending: constructs rows
with null before non-null in a descending column, verifies both
row_leq and is_sorted reject this ordering (nulls always last).
- ss2_accepts_non_null_before_null_in_descending: verifies the
correct ordering (non-null then null) is accepted.
These confirm the model can catch the exact class of bug that the
SS-2 invariant fix addresses.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/internals/adr/002-sort-schema-parquet-splits.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -119,7 +119,7 @@ These invariants must hold across all code paths (ingestion, compaction, query).
119
119
| ID | Invariant | Rationale |
120
120
|----|-----------|-----------|
121
121
|**SS-1**| All rows within a split are sorted according to the sort schema recorded in that split's metadata | Foundation for page-level pruning and sorted merge. Violated data produces incorrect merge results |
122
-
|**SS-2**| Nulls sort after non-null values for ascending columns and before non-null values for descending columns | Consistent null ordering across ingestion and merge. Matches Husky convention|
122
+
|**SS-2**| Nulls always sort after non-null values, regardless of sort direction (nulls last) | Consistent null ordering across ingestion and merge. Enables nulls to be implicit in sorted_series key encoding|
123
123
|**SS-3**| If a sort column is missing from a split, all rows in that split are treated as null for that column. This is not an error | Enables schema evolution — columns can be added to the sort schema without rewriting existing splits |
124
124
|**SS-4**| The sort schema stored in a split's metadata is the schema that was in effect when that split was written. Already-written splits are never re-sorted | Changes propagate forward only. Old splits age out via retention |
125
125
|**SS-5**| The sort schema string is the same in the metastore (per-split metadata), the Parquet `key_value_metadata`, and the Parquet `sorting_columns` field for a given split | Three representations of the same truth. Inconsistency between them would cause incorrect merge or pruning behavior |
-**Float types** (f32, f64): Sorted numerically, with NaN handling matching IEEE 754 total order (NaN sorts after all other values).
88
-
-**Null handling:** Null values sort **after** all non-null values for ascending columns, and **before** all non-null values for descending columns. This matches Husky's behavior and ensures nulls cluster at the end of each column's range.
88
+
-**Null handling:** Null values always sort **after** all non-null values, regardless of sort direction (`nulls_first: false`). This enables nulls to be implicit in the sorted_series key encoding — absent columns are simply omitted from the key, and their keys naturally sort after keys that include those columns.
89
89
90
90
### Timeseries ID (Optional Locality Tiebreaker)
91
91
@@ -286,7 +286,7 @@ The Parquet reader always reads the file footer first \-- the footer contains th
286
286
287
287
Input splits may not have identical column sets. Schema evolution (adding or removing columns over time) means that splits from different time periods may have different columns. The merge handles this as follows:
288
288
289
-
-**Sort columns.** If a sort column is missing from an input split, all rows from that split are treated as having null values for that column. Nulls sort after non-null values (ascending) or before non-null values (descending), as specified in the sort schema. The k-way merge handles this naturally.
289
+
-**Sort columns.** If a sort column is missing from an input split, all rows from that split are treated as having null values for that column. Nulls always sort after non-null values (`nulls_first: false`), regardless of direction. The k-way merge handles this naturally.
290
290
291
291
-**Non-sort columns.** The merge computes the **union** of all column names across all input splits. The output split contains every column that appears in at least one input. When streaming a column through the merge, rows originating from inputs that lack that column are filled with nulls. The output Parquet schema uses the type from whichever input(s) contain the column; if multiple inputs have the same column name with different types, the merge fails with an error (this indicates a schema evolution conflict that must be resolved at the index configuration level).
0 commit comments