You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spec: add extension directory and value trie design
Frozen 32-byte header with self-describing extension directory for
future format features. Repurpose reserved field as data_stream_start.
Tagged sections with optional/required semantics enable backward-
compatible evolution without header changes across all 8 bindings.
Define core extension tags: value trie, value suffix table, fuzzy
index, bloom filter, metadata. Resolve open question #4 (value trie
vs value store) with adaptive encoder strategy.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@@ -289,11 +293,13 @@ The key trie consists of two parts: the **Key Prefix Trie** (stems) and the **Ke
289
293
290
294
When the symbol table is shared (header flag), both key and value tries use the same symbol-to-character mapping -- useful when keys and values draw from the same character set. When not shared, the value trie has its own symbol config.
291
295
292
-
The File Header is byte-aligned so it can be read without a bit-reader. Everything between the header and the integrity footer is a packed data stream. The integrity footer is byte-aligned (padded to the next byte boundary after the stream ends).
296
+
The File Header is byte-aligned so it can be read without a bit-reader. The Extension Directory (if present) is also byte-aligned and immediately follows the header. Everything after the extension directory is the packed data stream. The integrity footer is byte-aligned (padded to the next byte boundary after the stream ends).
297
+
298
+
**Data stream start:** When `HAS_EXTENSIONS` is clear, the data stream begins at byte 32 (immediately after the header). When `HAS_EXTENSIONS` is set, the data stream begins at byte `32 + extension_directory_size`. The `data_stream_start` field in the header (bytes 28-31) gives the byte offset where the data stream begins, removing any ambiguity.
The header is the only byte-aligned structure. It bootstraps the bit-stream reader.
302
+
The header is exactly 32 bytes and **will never change size or layout**. It bootstraps the reader and tells it where to find everything else. All future format extensions use the Extension Directory (Section 3.2.1) rather than modifying the header.
297
303
298
304
| Offset | Size | Field | Description |
299
305
|--------|------|-------|-------------|
@@ -306,9 +312,11 @@ The header is the only byte-aligned structure. It bootstraps the bit-stream read
306
312
| 16 | 4 | value_store_offset | Bit offset to Value Store |
307
313
| 20 | 4 | suffix_table_offset | Bit offset to Suffix Table (0 if none) |
308
314
| 24 | 4 | total_data_bits | Total length of the data stream in bits |
309
-
| 28 | 4 | reserved | Reserved, must be 0 |
315
+
| 28 | 4 | data_stream_start | Byte offset from file start to data stream (0 or 32 = legacy, >32 = extension directory present) |
316
+
317
+
All `*_offset` fields (trie_data_offset, value_store_offset, suffix_table_offset) are **bit offsets** relative to `data_stream_start`, regardless of the section's addressing mode. The reader converts bit offsets to byte/symbol positions as needed based on the declared mode. Using bit offsets universally ensures the header format is mode-independent.
310
318
311
-
The data stream begins immediately after the header (byte offset 32). All `*_offset` fields are **bit offsets** relative to this starting point, regardless of the section's addressing mode. The reader converts bit offsets to byte/symbol positions as needed based on the declared mode. Using bit offsets universally ensures the header format is mode-independent.
319
+
**`data_stream_start` (bytes 28-31):** For legacy files (format version 1.0), this field is 0 and the data stream begins at byte 32. For files with an extension directory, this field gives the byte offset where the data stream begins (i.e., `32 + size_of_extension_directory`). Decoders MUST treat a value of 0 as 32 for backward compatibility.
312
320
313
321
**Flags (bit field):**
314
322
@@ -322,7 +330,134 @@ The data stream begins immediately after the header (byte offset 32). All `*_off
| 10 | shared_symbols: key trie and value trie share the same symbol table |
325
-
| 11-15 | reserved |
333
+
| 11 | has_extensions: extension directory present after header (see Section 3.2.1) |
334
+
| 12-15 | reserved, must be 0 |
335
+
336
+
**Header freeze guarantee:** This 32-byte layout is the permanent contract. Decoders for any version of the format parse these 32 bytes identically. All future features are discovered via the extension directory, not by adding header fields.
The extension directory is present when flag bit 11 (`has_extensions`) is set. It begins at byte 32 (immediately after the header) and is byte-aligned throughout. It describes additional data stream sections beyond the three defined in the base header (key trie, value store, suffix table).
341
+
342
+
**Design principle:** Unknown extension tags are safe to skip. A decoder that encounters a tag it does not recognize ignores it and continues. This allows format evolution without version bumps for backward-compatible additions, and enables vendor-specific extensions without polluting the core spec.
343
+
344
+
**Directory format:**
345
+
346
+
| Field | Size | Description |
347
+
|-------|------|-------------|
348
+
| num_extensions | uint16 | Number of extension entries (0-65535) |
349
+
| entries[]| 10 bytes each | Array of `num_extensions` extension entries |
350
+
351
+
Each extension entry:
352
+
353
+
| Field | Size | Description |
354
+
|-------|------|-------------|
355
+
| tag | uint16 | Extension type identifier (see tag registry below) |
356
+
| offset | uint32 | Bit offset from data stream start to this section |
357
+
| size | uint32 | Size of this section in bits (0 = section is empty/placeholder) |
The `data_stream_start` header field (bytes 28-31) MUST equal `32 + 2 + (num_extensions * 10)`, i.e., the data stream begins immediately after the last extension entry.
362
+
363
+
**Byte order:** All multi-byte fields in the extension directory use the same byte order as the file header (little-endian).
364
+
365
+
**Tag Registry:**
366
+
367
+
Tags are partitioned into ranges:
368
+
369
+
| Range | Purpose |
370
+
|-------|---------|
371
+
| 0x0000 | Reserved (invalid) |
372
+
| 0x0001 - 0x00FF |**Core tags** -- defined by this spec |
373
+
| 0x0100 - 0x0FFF |**Reserved** -- future spec use |
| 0x8000 - 0xFFFF |**Required tags** -- decoder MUST understand or reject the file |
376
+
377
+
**Core tags (0x0001 - 0x00FF):**
378
+
379
+
| Tag | Name | Description |
380
+
|-----|------|-------------|
381
+
| 0x0001 |`EXT_VALUE_TRIE`| Value trie: compressed trie for string/blob values (replaces linear value store for trie-eligible values). Section contains its own trie config + symbol table + trie data. |
382
+
| 0x0002 |`EXT_VALUE_SUFFIX`| Value suffix table: suffix sharing for the value trie. Format matches Section 3.5. |
| 0x0006 |`EXT_BLOOM_FILTER`| Bloom filter for fast negative lookups (key does not exist). |
387
+
| 0x0007 |`EXT_METADATA`| Arbitrary key-value metadata (creation date, encoder version, etc.). Stored as a nested TXZ dict. |
388
+
389
+
**Required tag semantics (0x8000+):**
390
+
391
+
Tags in the required range signal that the section is essential for correct decoding. A decoder that encounters a required tag it does not recognize MUST reject the file with a clear error message rather than silently producing incorrect output. This is the mechanism for introducing breaking format changes without bumping the major version.
392
+
393
+
| Tag | Name | Description |
394
+
|-----|------|-------------|
395
+
| 0x8001 |`EXT_VALUE_TRIE_REQUIRED`| Like `EXT_VALUE_TRIE`, but the linear value store in the base header is absent -- decoder MUST use the value trie. |
use value trie for string/blob lookups instead of linear value store
419
+
```
420
+
421
+
**Ordering:** Extension entries in the directory MAY appear in any order. The decoder searches for tags it needs. Entries SHOULD be ordered by offset for cache-friendly sequential reading, but this is not required.
422
+
423
+
**Duplicate tags:** Each tag MUST appear at most once in the directory. A decoder encountering duplicate tags SHOULD use the first occurrence and ignore subsequent ones.
424
+
425
+
### 3.2.2 Versioning and Compatibility Contract
426
+
427
+
The combination of `version_major`, `version_minor`, flags, and the extension directory tag ranges provides a layered compatibility model:
428
+
429
+
**Major version (breaking changes):**
430
+
- A change in `version_major` means the header layout, trie encoding, or value encoding has changed incompatibly.
431
+
- A decoder MUST reject files with a major version it does not support.
432
+
- The 32-byte header layout is frozen, so major version bumps should be extremely rare.
433
+
434
+
**Minor version (backward-compatible additions):**
435
+
- A change in `version_minor` means new optional features are available.
436
+
- A decoder for version 1.0 CAN read a version 1.1 file by ignoring unknown optional extensions.
437
+
- A decoder MUST still reject files with unknown required extensions (tag >= 0x8000), regardless of minor version.
438
+
439
+
**Extension tags (feature-level granularity):**
440
+
- New features are expressed as extension tags, not version bumps.
441
+
- Optional tags (0x0001-0x7FFF): skip if unknown. File is still usable with reduced functionality.
442
+
- Required tags (0x8000-0xFFFF): reject if unknown. File cannot be correctly decoded without this feature.
443
+
444
+
**Compatibility matrix:**
445
+
446
+
| File has | Decoder knows | Result |
447
+
|----------|---------------|--------|
448
+
| No extensions | Any decoder | OK (base format) |
449
+
| Optional extension | Old decoder | OK (extension ignored, base data used) |
450
+
| Optional extension | New decoder | OK (extension data used) |
-**Optional (`EXT_VALUE_TRIE`, 0x0001):** The encoder writes BOTH a linear value store (at `value_store_offset`) AND a value trie (in the extension). Old decoders use the linear store. New decoders use the value trie for better compression awareness. The file is larger (contains both), but backward-compatible.
458
+
-**Required (`EXT_VALUE_TRIE_REQUIRED`, 0x8001):** The encoder writes ONLY the value trie. No linear fallback. Smaller file, but old decoders cannot read it. Use when file size matters more than backward compatibility.
459
+
460
+
This gives the encoder (and the user) explicit control over the compatibility/size tradeoff.
326
461
327
462
### 3.3 Trie Config
328
463
@@ -495,11 +630,22 @@ function lookup(prefix_trie, suffix_trie, key):
495
630
496
631
Values are stored in a **separate structure** from the key trie. The key trie's END_VAL terminals reference values by index into this section. Two forms are supported:
497
632
498
-
-**Value store** (simple): a sequential array of typed values, referenced by index. Fast to build, no compression of values against each other.
499
-
-**Value trie** (compressed): string values that share prefixes/suffixes are stored in their own compressed trie (same format as the key trie). The `txz-json` library uses this when it detects significant sharing among string values. Non-string values (ints, floats, bools, blobs) remain in a flat store section.
633
+
-**Value store** (simple, base format): a sequential array of typed values, referenced by index. Fast to build, no compression of values against each other. Located at `value_store_offset` in the header. Always present in backward-compatible files.
634
+
-**Value trie** (compressed, via extension directory): string/blob values that share prefixes/suffixes are stored in their own compressed trie (same format as the key trie). Signaled by the `EXT_VALUE_TRIE` (0x0001, optional) or `EXT_VALUE_TRIE_REQUIRED` (0x8001, required) extension tag (see Section 3.2.1). Non-string values (null, bool, int, uint, float32, float64) remain inline -- they are too small to benefit from trie compression.
635
+
636
+
**Adaptive strategy:** The encoder analyzes values at build time and decides whether a value trie is worthwhile. The heuristic considers:
637
+
638
+
- Number of string/blob values (few strings = no benefit)
639
+
- Degree of prefix/suffix sharing among string values (unique strings = no benefit)
640
+
- Fixed overhead of a value trie config + symbol table vs. savings from sharing
641
+
- Caller preference (API option to force linear-only or force value trie)
642
+
643
+
When the value trie is used with the optional tag (0x0001), the encoder writes BOTH the linear value store and the value trie. Old decoders use the linear store; new decoders use the value trie. When file size matters more than backward compatibility, the encoder uses the required tag (0x8001) and omits the linear store.
500
644
501
645
When the header flag `shared_symbols` is set, the value trie shares the symbol table with the key trie. Otherwise, the value trie has its own symbol config (allowing different character sets or bit widths for keys vs. values).
502
646
647
+
The value trie may also have its own suffix table, signaled by the `EXT_VALUE_SUFFIX` (0x0002) extension tag.
648
+
503
649
Packing follows the `value_addr_mode` declared in the header flags.
504
650
505
651
Each value is encoded as:
@@ -736,6 +882,7 @@ This is version 1.0 -- a clean break. No backward compatibility with the origina
736
882
| Integrity | SHA-256 of whole file | Selectable: CRC-32, SHA-256, or xxHash64 |**Changed**|
737
883
| Versioning | None | Major/minor version in header |**New**|
738
884
| Key/value separation | Keys and values interleaved in one trie | Two separate tries (key trie + value trie), optionally shared symbol table |**Changed**|
885
+
| Extensibility | None | Extension directory with tagged sections; 32-byte header is frozen forever |**New**|
739
886
| JSON support | Not applicable | Separate `txz-json` library using TXZ under the hood |**New**|
740
887
| Implementation | C with C++ wrapper | C with C++ wrapper, 100% test coverage |**Kept**|
@@ -762,7 +909,7 @@ These ideas from the original notes are acknowledged but deferred from this spec
762
909
763
910
3.**Maximum nesting depth**: Should there be a spec-defined limit on nested dict depth for ROM/embedded safety? (e.g. 32 levels)
764
911
765
-
4.**Value trie vs. value store**: When should string values be compressed into a value trie (expensive to build, better compression) vs. stored as a flat value store (simple, fast)? Should this be encoder heuristic or caller-specified?
912
+
4.~~**Value trie vs. value store**~~: **RESOLVED** (Section 3.2.1). The encoder decides adaptively. It can write a linear value store only (base format, maximum compatibility), a linear store + value trie extension (backward-compatible, larger), or a value trie only via a required extension (smallest, requires new decoder). The choice is encoder heuristic based on estimated compression savings, with an API option for the caller to force a specific strategy.
766
913
767
914
5.**Update/append semantics**: The current spec is build-once, read-many. Mutation is decode + modify + re-encode (Section 7). Is this sufficient for all 1.0 use cases?
0 commit comments