Skip to content

Commit 98cf4eb

Browse files
deftioclaude
andcommitted
Spec: add extension directory and value trie design
Frozen 32-byte header with self-describing extension directory for future format features. Repurpose reserved field as data_stream_start. Tagged sections with optional/required semantics enable backward- compatible evolution without header changes across all 8 bindings. Define core extension tags: value trie, value suffix table, fuzzy index, bloom filter, metadata. Resolve open question #4 (value trie vs value store) with adaptive encoder strategy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d502ade commit 98cf4eb

1 file changed

Lines changed: 158 additions & 11 deletions

File tree

plan/txz-spec.md

Lines changed: 158 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -265,8 +265,10 @@ In **bit mode** (original TXZ behavior), symbols, control codes, VarInts, skip d
265265
### 3.1 Overall Layout
266266

267267
```
268-
+------------------------+ byte-aligned (bootstrap)
269-
| File Header |
268+
+------------------------+ byte-aligned, FROZEN (never changes)
269+
| File Header (32 bytes) |
270+
+------------------------+ byte-aligned (present when HAS_EXTENSIONS flag set)
271+
| Extension Directory | self-describing list of (tag, offset, size)
270272
+------------------------+ <-- data stream begins here
271273
| Trie Config | \
272274
+- - - - - - - - - - - - + | post-header metadata
@@ -279,6 +281,8 @@ In **bit mode** (original TXZ behavior), symbols, control codes, VarInts, skip d
279281
| Key Prefix Trie | |
280282
+------------------------+ |
281283
| Value Trie / Store | |
284+
+------------------------+ |
285+
| Value Suffix Table | | (optional, when value trie has suffix sharing)
282286
+------------------------+ /
283287
+------------------------+ byte-aligned (fixed footer)
284288
| Integrity Check |
@@ -289,11 +293,13 @@ The key trie consists of two parts: the **Key Prefix Trie** (stems) and the **Ke
289293

290294
When the symbol table is shared (header flag), both key and value tries use the same symbol-to-character mapping -- useful when keys and values draw from the same character set. When not shared, the value trie has its own symbol config.
291295

292-
The File Header is byte-aligned so it can be read without a bit-reader. Everything between the header and the integrity footer is a packed data stream. The integrity footer is byte-aligned (padded to the next byte boundary after the stream ends).
296+
The File Header is byte-aligned so it can be read without a bit-reader. The Extension Directory (if present) is also byte-aligned and immediately follows the header. Everything after the extension directory is the packed data stream. The integrity footer is byte-aligned (padded to the next byte boundary after the stream ends).
297+
298+
**Data stream start:** When `HAS_EXTENSIONS` is clear, the data stream begins at byte 32 (immediately after the header). When `HAS_EXTENSIONS` is set, the data stream begins at byte `32 + extension_directory_size`. The `data_stream_start` field in the header (bytes 28-31) gives the byte offset where the data stream begins, removing any ambiguity.
293299

294-
### 3.2 File Header (fixed size: 32 bytes, byte-aligned)
300+
### 3.2 File Header (fixed size: 32 bytes, byte-aligned, FROZEN)
295301

296-
The header is the only byte-aligned structure. It bootstraps the bit-stream reader.
302+
The header is exactly 32 bytes and **will never change size or layout**. It bootstraps the reader and tells it where to find everything else. All future format extensions use the Extension Directory (Section 3.2.1) rather than modifying the header.
297303

298304
| Offset | Size | Field | Description |
299305
|--------|------|-------|-------------|
@@ -306,9 +312,11 @@ The header is the only byte-aligned structure. It bootstraps the bit-stream read
306312
| 16 | 4 | value_store_offset | Bit offset to Value Store |
307313
| 20 | 4 | suffix_table_offset | Bit offset to Suffix Table (0 if none) |
308314
| 24 | 4 | total_data_bits | Total length of the data stream in bits |
309-
| 28 | 4 | reserved | Reserved, must be 0 |
315+
| 28 | 4 | data_stream_start | Byte offset from file start to data stream (0 or 32 = legacy, >32 = extension directory present) |
316+
317+
All `*_offset` fields (trie_data_offset, value_store_offset, suffix_table_offset) are **bit offsets** relative to `data_stream_start`, regardless of the section's addressing mode. The reader converts bit offsets to byte/symbol positions as needed based on the declared mode. Using bit offsets universally ensures the header format is mode-independent.
310318

311-
The data stream begins immediately after the header (byte offset 32). All `*_offset` fields are **bit offsets** relative to this starting point, regardless of the section's addressing mode. The reader converts bit offsets to byte/symbol positions as needed based on the declared mode. Using bit offsets universally ensures the header format is mode-independent.
319+
**`data_stream_start` (bytes 28-31):** For legacy files (format version 1.0), this field is 0 and the data stream begins at byte 32. For files with an extension directory, this field gives the byte offset where the data stream begins (i.e., `32 + size_of_extension_directory`). Decoders MUST treat a value of 0 as 32 for backward compatibility.
312320

313321
**Flags (bit field):**
314322

@@ -322,7 +330,134 @@ The data stream begins immediately after the header (byte offset 32). All `*_off
322330
| 6-7 | value_addr_mode: 00=bit, 01=byte, 10=symbol-fixed, 11=reserved |
323331
| 8-9 | symbol_encoding: 00=fixed-width, 01=huffman, 10=multi-width, 11=reserved |
324332
| 10 | shared_symbols: key trie and value trie share the same symbol table |
325-
| 11-15 | reserved |
333+
| 11 | has_extensions: extension directory present after header (see Section 3.2.1) |
334+
| 12-15 | reserved, must be 0 |
335+
336+
**Header freeze guarantee:** This 32-byte layout is the permanent contract. Decoders for any version of the format parse these 32 bytes identically. All future features are discovered via the extension directory, not by adding header fields.
337+
338+
### 3.2.1 Extension Directory (byte-aligned, self-describing)
339+
340+
The extension directory is present when flag bit 11 (`has_extensions`) is set. It begins at byte 32 (immediately after the header) and is byte-aligned throughout. It describes additional data stream sections beyond the three defined in the base header (key trie, value store, suffix table).
341+
342+
**Design principle:** Unknown extension tags are safe to skip. A decoder that encounters a tag it does not recognize ignores it and continues. This allows format evolution without version bumps for backward-compatible additions, and enables vendor-specific extensions without polluting the core spec.
343+
344+
**Directory format:**
345+
346+
| Field | Size | Description |
347+
|-------|------|-------------|
348+
| num_extensions | uint16 | Number of extension entries (0-65535) |
349+
| entries[] | 10 bytes each | Array of `num_extensions` extension entries |
350+
351+
Each extension entry:
352+
353+
| Field | Size | Description |
354+
|-------|------|-------------|
355+
| tag | uint16 | Extension type identifier (see tag registry below) |
356+
| offset | uint32 | Bit offset from data stream start to this section |
357+
| size | uint32 | Size of this section in bits (0 = section is empty/placeholder) |
358+
359+
**Total directory size:** `2 + (num_extensions * 10)` bytes.
360+
361+
The `data_stream_start` header field (bytes 28-31) MUST equal `32 + 2 + (num_extensions * 10)`, i.e., the data stream begins immediately after the last extension entry.
362+
363+
**Byte order:** All multi-byte fields in the extension directory use the same byte order as the file header (little-endian).
364+
365+
**Tag Registry:**
366+
367+
Tags are partitioned into ranges:
368+
369+
| Range | Purpose |
370+
|-------|---------|
371+
| 0x0000 | Reserved (invalid) |
372+
| 0x0001 - 0x00FF | **Core tags** -- defined by this spec |
373+
| 0x0100 - 0x0FFF | **Reserved** -- future spec use |
374+
| 0x1000 - 0x7FFF | **Vendor tags** -- third-party extensions, safe to skip |
375+
| 0x8000 - 0xFFFF | **Required tags** -- decoder MUST understand or reject the file |
376+
377+
**Core tags (0x0001 - 0x00FF):**
378+
379+
| Tag | Name | Description |
380+
|-----|------|-------------|
381+
| 0x0001 | `EXT_VALUE_TRIE` | Value trie: compressed trie for string/blob values (replaces linear value store for trie-eligible values). Section contains its own trie config + symbol table + trie data. |
382+
| 0x0002 | `EXT_VALUE_SUFFIX` | Value suffix table: suffix sharing for the value trie. Format matches Section 3.5. |
383+
| 0x0003 | `EXT_KEY_SUFFIX_V2` | Extended key suffix table (future: DAWG-style merged suffix subtrees). |
384+
| 0x0004 | `EXT_HUFFMAN_TABLE` | Standalone Huffman symbol table (when shared between key and value tries). |
385+
| 0x0005 | `EXT_FUZZY_INDEX` | Fuzzy search index: precomputed edit-distance structure for d<=2 queries. |
386+
| 0x0006 | `EXT_BLOOM_FILTER` | Bloom filter for fast negative lookups (key does not exist). |
387+
| 0x0007 | `EXT_METADATA` | Arbitrary key-value metadata (creation date, encoder version, etc.). Stored as a nested TXZ dict. |
388+
389+
**Required tag semantics (0x8000+):**
390+
391+
Tags in the required range signal that the section is essential for correct decoding. A decoder that encounters a required tag it does not recognize MUST reject the file with a clear error message rather than silently producing incorrect output. This is the mechanism for introducing breaking format changes without bumping the major version.
392+
393+
| Tag | Name | Description |
394+
|-----|------|-------------|
395+
| 0x8001 | `EXT_VALUE_TRIE_REQUIRED` | Like `EXT_VALUE_TRIE`, but the linear value store in the base header is absent -- decoder MUST use the value trie. |
396+
397+
**Decoder algorithm:**
398+
399+
```
400+
read 32-byte header
401+
if has_extensions flag is set:
402+
read num_extensions (uint16 at byte 32)
403+
for i in 0..num_extensions:
404+
read tag, offset, size (10 bytes each)
405+
if tag in known_tags:
406+
record (tag, offset, size) for later use
407+
elif tag >= 0x8000:
408+
REJECT: "unsupported required extension 0x{tag:04X}"
409+
else:
410+
SKIP: unknown optional tag, ignore
411+
seek to data_stream_start
412+
else:
413+
data stream starts at byte 32
414+
415+
parse trie config, key trie, value store as usual
416+
if EXT_VALUE_TRIE recorded:
417+
seek to its offset, decode value trie
418+
use value trie for string/blob lookups instead of linear value store
419+
```
420+
421+
**Ordering:** Extension entries in the directory MAY appear in any order. The decoder searches for tags it needs. Entries SHOULD be ordered by offset for cache-friendly sequential reading, but this is not required.
422+
423+
**Duplicate tags:** Each tag MUST appear at most once in the directory. A decoder encountering duplicate tags SHOULD use the first occurrence and ignore subsequent ones.
424+
425+
### 3.2.2 Versioning and Compatibility Contract
426+
427+
The combination of `version_major`, `version_minor`, flags, and the extension directory tag ranges provides a layered compatibility model:
428+
429+
**Major version (breaking changes):**
430+
- A change in `version_major` means the header layout, trie encoding, or value encoding has changed incompatibly.
431+
- A decoder MUST reject files with a major version it does not support.
432+
- The 32-byte header layout is frozen, so major version bumps should be extremely rare.
433+
434+
**Minor version (backward-compatible additions):**
435+
- A change in `version_minor` means new optional features are available.
436+
- A decoder for version 1.0 CAN read a version 1.1 file by ignoring unknown optional extensions.
437+
- A decoder MUST still reject files with unknown required extensions (tag >= 0x8000), regardless of minor version.
438+
439+
**Extension tags (feature-level granularity):**
440+
- New features are expressed as extension tags, not version bumps.
441+
- Optional tags (0x0001-0x7FFF): skip if unknown. File is still usable with reduced functionality.
442+
- Required tags (0x8000-0xFFFF): reject if unknown. File cannot be correctly decoded without this feature.
443+
444+
**Compatibility matrix:**
445+
446+
| File has | Decoder knows | Result |
447+
|----------|---------------|--------|
448+
| No extensions | Any decoder | OK (base format) |
449+
| Optional extension | Old decoder | OK (extension ignored, base data used) |
450+
| Optional extension | New decoder | OK (extension data used) |
451+
| Required extension | Old decoder | REJECT (clear error) |
452+
| Required extension | New decoder | OK |
453+
| Unknown major version | Any decoder | REJECT |
454+
455+
**When to use optional vs. required extensions:**
456+
457+
- **Optional (`EXT_VALUE_TRIE`, 0x0001):** The encoder writes BOTH a linear value store (at `value_store_offset`) AND a value trie (in the extension). Old decoders use the linear store. New decoders use the value trie for better compression awareness. The file is larger (contains both), but backward-compatible.
458+
- **Required (`EXT_VALUE_TRIE_REQUIRED`, 0x8001):** The encoder writes ONLY the value trie. No linear fallback. Smaller file, but old decoders cannot read it. Use when file size matters more than backward compatibility.
459+
460+
This gives the encoder (and the user) explicit control over the compatibility/size tradeoff.
326461

327462
### 3.3 Trie Config
328463

@@ -495,11 +630,22 @@ function lookup(prefix_trie, suffix_trie, key):
495630

496631
Values are stored in a **separate structure** from the key trie. The key trie's END_VAL terminals reference values by index into this section. Two forms are supported:
497632

498-
- **Value store** (simple): a sequential array of typed values, referenced by index. Fast to build, no compression of values against each other.
499-
- **Value trie** (compressed): string values that share prefixes/suffixes are stored in their own compressed trie (same format as the key trie). The `txz-json` library uses this when it detects significant sharing among string values. Non-string values (ints, floats, bools, blobs) remain in a flat store section.
633+
- **Value store** (simple, base format): a sequential array of typed values, referenced by index. Fast to build, no compression of values against each other. Located at `value_store_offset` in the header. Always present in backward-compatible files.
634+
- **Value trie** (compressed, via extension directory): string/blob values that share prefixes/suffixes are stored in their own compressed trie (same format as the key trie). Signaled by the `EXT_VALUE_TRIE` (0x0001, optional) or `EXT_VALUE_TRIE_REQUIRED` (0x8001, required) extension tag (see Section 3.2.1). Non-string values (null, bool, int, uint, float32, float64) remain inline -- they are too small to benefit from trie compression.
635+
636+
**Adaptive strategy:** The encoder analyzes values at build time and decides whether a value trie is worthwhile. The heuristic considers:
637+
638+
- Number of string/blob values (few strings = no benefit)
639+
- Degree of prefix/suffix sharing among string values (unique strings = no benefit)
640+
- Fixed overhead of a value trie config + symbol table vs. savings from sharing
641+
- Caller preference (API option to force linear-only or force value trie)
642+
643+
When the value trie is used with the optional tag (0x0001), the encoder writes BOTH the linear value store and the value trie. Old decoders use the linear store; new decoders use the value trie. When file size matters more than backward compatibility, the encoder uses the required tag (0x8001) and omits the linear store.
500644

501645
When the header flag `shared_symbols` is set, the value trie shares the symbol table with the key trie. Otherwise, the value trie has its own symbol config (allowing different character sets or bit widths for keys vs. values).
502646

647+
The value trie may also have its own suffix table, signaled by the `EXT_VALUE_SUFFIX` (0x0002) extension tag.
648+
503649
Packing follows the `value_addr_mode` declared in the header flags.
504650

505651
Each value is encoded as:
@@ -736,6 +882,7 @@ This is version 1.0 -- a clean break. No backward compatibility with the origina
736882
| Integrity | SHA-256 of whole file | Selectable: CRC-32, SHA-256, or xxHash64 | **Changed** |
737883
| Versioning | None | Major/minor version in header | **New** |
738884
| Key/value separation | Keys and values interleaved in one trie | Two separate tries (key trie + value trie), optionally shared symbol table | **Changed** |
885+
| Extensibility | None | Extension directory with tagged sections; 32-byte header is frozen forever | **New** |
739886
| JSON support | Not applicable | Separate `txz-json` library using TXZ under the hood | **New** |
740887
| Implementation | C with C++ wrapper | C with C++ wrapper, 100% test coverage | **Kept** |
741888
| Streaming | Mentioned in notes | Deferred | -- |
@@ -762,7 +909,7 @@ These ideas from the original notes are acknowledged but deferred from this spec
762909

763910
3. **Maximum nesting depth**: Should there be a spec-defined limit on nested dict depth for ROM/embedded safety? (e.g. 32 levels)
764911

765-
4. **Value trie vs. value store**: When should string values be compressed into a value trie (expensive to build, better compression) vs. stored as a flat value store (simple, fast)? Should this be encoder heuristic or caller-specified?
912+
4. ~~**Value trie vs. value store**~~: **RESOLVED** (Section 3.2.1). The encoder decides adaptively. It can write a linear value store only (base format, maximum compatibility), a linear store + value trie extension (backward-compatible, larger), or a value trie only via a required extension (smallest, requires new decoder). The choice is encoder heuristic based on estimated compression savings, with an API option for the caller to force a specific strategy.
766913

767914
5. **Update/append semantics**: The current spec is build-once, read-many. Mutation is decode + modify + re-encode (Section 7). Is this sufficient for all 1.0 use cases?
768915

0 commit comments

Comments
 (0)