[phase-31 3/4] Writer + pipeline wiring#6244
[phase-31 3/4] Writer + pipeline wiring#6244g-talbot wants to merge 1 commit intogtt/phase-31-compaction-metadatafrom
Conversation
c012908 to
780c585
Compare
3bbfb71 to
95c3596
Compare
08577b5 to
955f230
Compare
179ccd2 to
ed6d687
Compare
955f230 to
3e73d80
Compare
ed6d687 to
a4d0d36
Compare
3e73d80 to
2f78fe8
Compare
a4d0d36 to
f05d4e7
Compare
2f78fe8 to
2703ca5
Compare
f05d4e7 to
4a0507e
Compare
2703ca5 to
8fce718
Compare
4a0507e to
bc9458d
Compare
8fce718 to
018a265
Compare
598de1a to
c95095b
Compare
2599f67 to
46903b3
Compare
c95095b to
5b1c080
Compare
46903b3 to
74bfd04
Compare
5b1c080 to
6ebf40b
Compare
de0f8c6 to
2d9e6eb
Compare
6ebf40b to
295f59c
Compare
295f59c to
f3c03dc
Compare
2d9e6eb to
00c7245
Compare
f3c03dc to
acb5d28
Compare
00c7245 to
cff2cc1
Compare
acb5d28 to
a8dc7a2
Compare
0d97e82 to
c48b0de
Compare
a8dc7a2 to
4ce16a3
Compare
c48b0de to
ef0ba36
Compare
b89e965 to
b9566a6
Compare
ef0ba36 to
df6e699
Compare
Wire TableConfig-driven sort order into ParquetWriter and add self-describing Parquet file metadata for compaction: - ParquetWriter::new() takes &TableConfig, resolves sort fields at construction via parse_sort_fields() + ParquetField::from_name() - sort_batch() uses resolved fields with per-column direction (ASC/DESC) - SS-1 debug_assert verification: re-sort and check identity permutation - build_compaction_key_value_metadata(): embeds sort_fields, window_start, window_duration, num_merge_ops, row_keys (base64) in Parquet kv_metadata - SS-5 verify_ss5_kv_consistency(): kv_metadata matches source struct - write_to_file_with_metadata() replaces write_to_file() - prepare_write() shared method for bytes and file paths - ParquetWriterConfig gains to_writer_properties_with_metadata() - ParquetSplitWriter passes TableConfig through - All callers in quickwit-indexing updated with TableConfig::default() - 23 storage tests pass including META-07 self-describing roundtrip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b9566a6 to
b6eb595
Compare
df6e699 to
76b703a
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 76b703ad24
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| schema | ||
| .index_of(sf.name.as_str()) | ||
| .ok() |
There was a problem hiding this comment.
Normalize logical tag names before sort-column lookup
resolve_sort_fields() preserves sort keys like tag_service/tag_env, and sort_batch() then does schema.index_of(sf.name) directly; with current metrics batches using physical columns such as service, env, and host, these keys are silently dropped from the actual sort. That means splits are no longer ordered by those tag dimensions (only by the remaining matched keys), which regresses clustering/pruning behavior compared to the previous SORT_ORDER path and can change compaction ordering whenever rows share the same metric/timestamp but differ by tags.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
can ignore, we're updating sort order string to include the physical column names (service, env, etc.)
| let metadata = MetricsSplitMetadata::builder() | ||
| // Compute window_start from the earliest timestamp in the batch. | ||
| let window_duration = self.table_config.window_duration_secs; | ||
| let window_start_secs = if window_duration > 0 && time_range.start_secs > 0 { |
There was a problem hiding this comment.
Compute window_start when earliest timestamp is zero
The window_start calculation is guarded by time_range.start_secs > 0, so batches whose minimum timestamp is exactly Unix epoch 0 skip window_start even when window_duration_secs is configured. This causes Phase-31 window metadata to be omitted for those splits and, in debug builds, violates the builder’s TW-1 pairing invariant (window_start absent while duration is non-zero), leading to avoidable assertion failures.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
i guess technically, we should try and set window_start_secs first, but we should never have the minimum timestamp be epoch 0
| let sorted_batch = take_record_batch(batch, &indices)?; | ||
|
|
||
| // SS-1: verify the output is actually sorted. | ||
| #[cfg(debug_assertions)] |
There was a problem hiding this comment.
[nit] do we need this assertion still? can we not rely on the tests?
Summary
Wire TableConfig into ParquetWriter sort path and add self-describing Parquet file metadata for compaction (Phase 31 Metadata Foundation, PR 3 of 4).
Stacks on
gtt/phase-31-compaction-metadata(PR #6243).What's included
storage/writer.rs (rewritten):
ParquetWriter::new()takes&TableConfig, resolves sort field names to physical columnssort_batch()uses resolved fields with per-column ASC/DESC directiondebug_assertverification: re-sort output and check identity permutationbuild_compaction_key_value_metadata(): embeds sort_fields, window_start, window_duration, num_merge_ops, row_keys (base64+JSON) in Parquet kv_metadataverify_ss5_kv_consistency(): kv entries must match source structwrite_to_file_with_metadata()replaceswrite_to_file()prepare_write()shared prep for both bytes and file write pathsresolve_sort_fields(): parse sort schema, map to ParquetField, skip missing columnsstorage/config.rs:
to_writer_properties_with_metadata(sorting_cols, kv_metadata)accepts dynamic sort columns and optional KV metadatato_writer_properties()delegates with empty defaultssorting_columns()method (now in writer)storage/split_writer.rs:
ParquetSplitWriter::new()takes&TableConfigparameterquickwit-indexing (5 files):
ParquetSplitWriter::new()callers updated with&TableConfig::default()Verification
cargo build -p quickwit-parquet-engine -p quickwit-indexing✅cargo test -p quickwit-parquet-engine -- storage::✅ (23 tests)cargo clippy -p quickwit-parquet-engine --all-features --tests✅Test plan
🤖 Generated with Claude Code