Date: May 2026 Format Version: 5
This document describes the on-disk binary format of a ZXC compressed file. It formalizes the current reference implementation of format version 5.
- Byte order: all multi-byte integers are little-endian.
- Unit: offsets are in bytes, zero-based from the start of each structure.
- Checksum mode: enabled globally by a flag in the file header.
- Block model: a file is a sequence of blocks terminated by an EOF block, then a footer.
+----------------------+ 16 bytes
| File Header |
+----------------------+
| Block #0 |
| - 8B Block Header |
| - Block Payload |
| - Optional 4B CRC32 |
+----------------------+
| Block #1 |
| ... |
+----------------------+
| EOF Block | 8 bytes (type=255, comp_size=0)
+----------------------+
| SEK Block (Optional) | table of contents for random access
+----------------------+
| File Footer | 12 bytes
+----------------------+
Offset Size Field
0x00 4 Magic Word
0x04 1 Format Version
0x05 1 Chunk Size Code
0x06 1 Flags
0x07 7 Reserved (must be 0)
0x0E 2 Header CRC16
- Magic Word (
u32):0x9CB02EF5. - Format Version (
u8): currently5. - Chunk Size Code (
u8):- If the value is in the range
[12, 21], it is an exponent:block_size = 2^code.12= 4 KB,13= 8 KB, ...,19= 512 KB (default), ...,21= 2 MB.
- The legacy value
64(from older encoders) is accepted and maps to 256 KB. - All other values are rejected (
ZXC_ERROR_BAD_BLOCK_SIZE). - Valid block sizes are powers of 2 in the range 4 KB – 2 MB.
- If the value is in the range
- Flags (
u8):- Bit 7 (
0x80):HAS_CHECKSUM. - Bits 0..3: checksum algorithm id (
0= RapidHash-based folding). - Bits 4..6: reserved.
- Bit 7 (
- Reserved: 7 bytes set to zero.
- Header CRC16 (
u16): computed withzxc_hash16on the 16-byte header where bytes0x0E..0x0Fare zeroed.
Each block starts with a fixed 8-byte block header.
Offset Size Field
0x00 1 Block Type
0x01 1 Block Flags
0x02 1 Reserved
0x03 4 Compressed Payload Size (comp_size)
0x07 1 Header CRC8
- Block Type:
0= RAW1= GLO2= NUM3= GHI254= SEK255= EOF
- Block Flags: currently not used by implementation (written as
0). - Reserved: must be 0.
- comp_size: payload size in bytes (does not include the optional trailing 4-byte block checksum).
- Header CRC8:
zxc_hash8over the 8-byte header with byte0x07forced to zero before hashing.
[8B Block Header] + [comp_size bytes payload] + [optional 4B checksum]
When checksums are enabled at file level, each non-EOF block carries one trailing 4-byte checksum of its compressed payload.
Payload is uncompressed data.
Payload = raw bytes
raw_size = comp_size
No internal sub-header.
Used for numeric data (32-bit integer stream), delta/zigzag + bitpacking.
+--------------------------+
| NUM Header (16 bytes) |
+--------------------------+
| Frame #0 header (16B) |
| Frame #0 packed bits |
+--------------------------+
| Frame #1 header (16B) |
| Frame #1 packed bits |
+--------------------------+
| ... |
+--------------------------+
Offset Size Field
0x00 8 n_values (u64)
0x08 2 frame_size (u16, currently 128)
0x0A 6 reserved
Offset Size Field
0x00 2 nvals in frame (u16)
0x02 2 bits per value (u16)
0x04 8 base/running seed (u64)
0x0C 4 packed_size in bytes (u32)
0x10 ... packed delta bitstream
Notes:
- Values are reconstructed by bit-unpacking, zigzag decode, then prefix accumulation.
packed_sizebytes immediately follow each 16-byte frame header.
General LZ-style format with separated streams.
+-------------------------------+
| GLO Header (16 bytes) |
+-------------------------------+
| 4 Section Descriptors (32B) |
+-------------------------------+
| Literals stream |
+-------------------------------+
| Tokens stream |
+-------------------------------+
| Offsets stream |
+-------------------------------+
| Extras stream |
+-------------------------------+
Offset Size Field
0x00 4 n_sequences (u32)
0x04 4 n_literals (u32)
0x08 1 enc_lit (0=RAW, 1=RLE, 2=HUFFMAN)
0x09 1 enc_litlen (reserved)
0x0A 1 enc_mlen (reserved)
0x0B 1 enc_off (0=16-bit offsets, 1=8-bit offsets)
0x0C 4 reserved
Descriptor format (packed u64):
- low 32 bits: compressed size
- high 32 bits: raw size
Section order:
- Literals
- Tokens
- Offsets
- Extras
- Literals stream:
- raw literal bytes if
enc_lit=0, or - RLE tokenized if
enc_lit=1, or - Huffman-coded if
enc_lit=2(see § 5.3.1 Huffman literal section).
- raw literal bytes if
- Tokens stream:
- one byte per sequence:
(LL << 4) | ML. LLandMLare 4-bit fields.
- one byte per sequence:
- Offsets stream:
n_sequences × 1byte ifenc_off=1, elsen_sequences × 2bytes LE.- Values are biased: stored value =
actual_offset - 1. Decoder adds+ 1. - This makes
offset == 0impossible by construction (minimum decoded offset = 1).
- Extras stream:
- Prefix-varint overflow values for token saturations:
- if
LL == 15, read varint and add to LL - if
ML == 15, read varint and add to ML
- if
- actual match length is
ML + 5(minimum match = 5).
- Prefix-varint overflow values for token saturations:
Selected by the encoder only at compression level ≥ 6, only when at least
ZXC_HUF_MIN_LITERALS = 1024 literals are present, and only when the Huffman
payload is at least ~3 % smaller than the corresponding RAW or RLE encoding of
the same literals. Any block where the heuristic does not pick HUFFMAN keeps
enc_lit ∈ {0, 1}.
The Huffman literal section payload is structured as follows:
Offset Size Field
0x00 128 Code-length header
256 × 4-bit code lengths, packed two-per-byte (low nibble first).
code_len[i] ∈ [0, 8] (0 means symbol absent).
0x80 6 Sub-stream sizes
s1, s2, s3 as little-endian u16 (size of streams 0, 1, 2 in bytes).
The size of stream 3 is implied: s4 = total_payload_size - 134 - s1 - s2 - s3.
0x86 var Stream 0 bit-stream (s1 bytes, LSB-first)
var Stream 1 bit-stream (s2 bytes)
var Stream 2 bit-stream (s3 bytes)
var Stream 3 bit-stream (s4 bytes)
Codes are canonical, length-limited at L = 8, emitted LSB-first.
The n_literals value from the GLO header is split into 4 contiguous regions
of size Q = ceil(n_literals / 4) (the last region may be shorter), each
encoded into its own bit-stream so that 4 decoders can run in parallel.
The decoder reconstructs the canonical code table from the 128-byte length header, validates the Kraft equality, and decodes each sub-stream into its output region. See WHITEPAPER §5.8 for the multi-symbol 2048-entry lookup table strategy used on the decode hot path.
Decoder validation requirements:
- Every code length must satisfy
code_len[i] ≤ 8. - At least one symbol must be present (
code_len[i] != 0for somei). - The Kraft sum
Σ 2^(8 − code_len[i])over present symbols must equal2^8, except for the single-present-symbol degenerate case where exactly one symbol hascode_len = 1and the Kraft sum is2^7. - A failure on any of the above results in
ZXC_ERROR_CORRUPT_DATA.
High-throughput LZ format with packed 32-bit sequences.
+-------------------------------+
| GHI Header (16 bytes) |
+-------------------------------+
| 3 Section Descriptors (24B) |
+-------------------------------+
| Literals stream |
+-------------------------------+
| Sequences stream (N * 4B) |
+-------------------------------+
| Extras stream |
+-------------------------------+
Same binary layout as GLO header:
n_sequences,n_literals,enc_lit,enc_litlen,enc_mlen,enc_off, reserved.
In practice for GHI:
enc_lit = 0(raw literals)enc_offis metadata (sequence words always store 16-bit offsets)
Section order:
- Literals
- Sequences
- Extras
Each descriptor uses the same packed size encoding as GLO (u64: comp32|raw32).
Bits 31..24 : LL (literal length, 8 bits)
Bits 23..16 : ML (match length minus 5, 8 bits)
Bits 15..0 : Offset - 1 (16 bits, biased; decode: stored + 1)
Memory order (little-endian word):
byte0 = offset low
byte1 = offset high
byte2 = ML
byte3 = LL
Overflow rules:
- if
LL == 255, read varint from Extras and add it to LL. - if
ML == 255, read varint, then add minimum match (+5). - otherwise decoded match length is
ML + 5.
EOF marks end of block stream.
Constraints:
- block header is present (8 bytes)
comp_sizemust be 0- no payload
- no per-block trailing checksum
Immediately after EOF block header comes the Optional SEK block, followed by the 12-byte file footer.
The Seek Table block is an optional block appended between the EOF block and the File Footer. It provides O(1) random-access capabilities by recording the compressed size of every block in the archive. Decompressed sizes and block indices are derived from the file header's block_size (all blocks are block_size except the last, which may be smaller).
Layout of a SEK Block:
Offset Size Field
0x00 8 Block Header (type=254, comp_size=N*4)
0x08 4 Block 0 Compressed Size (u32 LE)
0x0C 4 Block 1 Compressed Size (u32 LE)
... ... ...
8 + (N-1)*4 4 Block N-1 Compressed Size (u32 LE)
Backward Detection Strategy:
- Read the File Header (first 16 bytes) -> extract
block_size. - Read the File Footer (last 12 bytes) -> extract
total_decompressed_size. - Derive
num_blocks = ceil(total_decompressed_size / block_size). - Calculate
seek_block_size = 8 + (N × 4). - Seek backward by
seek_block_sizebytes from the start of the footer to read the Block Header. - Validate
block_type == 254 (SEK)andcomp_size == N × 4.
ZXC extras use a prefix-length varint.
The length is encoded in unary form in the high bits of the first byte: the
number of leading 1 bits, followed by a terminating 0, indicates how
many additional payload bytes follow. The scheme generalizes to N bytes
(11110xxx = 5, 111110xx = 6, ...), but the current ZXC spec caps the
encoding at 3 bytes because no legitimate value exceeds 21 bits (see below).
Encodings used:
0xxxxxxx-> 1 byte total (7 bits payload, value < 128)10xxxxxx-> 2 bytes total (14 bits, value < 16384)110xxxxx-> 3 bytes total (21 bits, value < 2 MiB)
Payload bits from the following bytes are concatenated little-endian style (low bits first). Used by GLO/GHI to carry LL/ML overflows beyond token/sequence inline limits.
Value bound: a varint encodes (LL - MASK) or (ML - MASK).
Since LL/ML are bounded by ZXC_BLOCK_SIZE_MAX = 2 MiB (2^21), every
legitimate varint value is strictly less than 2^21 and therefore fits in
at most 3 bytes.
Any prefix indicating a length >= 4 bytes (first byte >= 0xE0) is out of
spec for this format version: encoders must never emit such a varint, and
conforming decoders reject it as corrupt input. This caps the varint
surface to the format-defined block size limit and neutralizes
integer-overflow attacks in downstream bounds arithmetic. A future version
of the format that raises ZXC_BLOCK_SIZE_MAX would also extend the
accepted prefix lengths.
- File header: 16-bit (
zxc_hash16). - Block header: 8-bit (
zxc_hash8).
These protect metadata/navigation fields.
When file header has HAS_CHECKSUM=1:
- each data block appends a 4-byte checksum after payload.
- checksum input is compressed payload bytes only (not block header).
- algorithm id currently
0(RapidHash folded to 32-bit).
A rolling global hash is maintained from per-block checksums in stream order:
global = 0
for each data block checksum b:
global = ((global << 1) | (global >> 31)) XOR b
This value is stored in the file footer (or zeroed when checksum mode is disabled).
Footer is mandatory and placed immediately after EOF block header.
Offset Size Field
0x00 8 original_source_size (u64)
0x08 4 global_hash (u32)
- original_source_size: full uncompressed size of the file.
- global_hash:
- valid when checksum mode is active;
- set to zero when checksum mode is disabled.
- Validate file header magic/version/CRC16.
- Parse blocks sequentially:
- validate block header CRC8,
- check block bounds using
comp_size, - if enabled, verify trailing block checksum.
- Decode payload according to block type.
- On EOF:
- require
comp_size == 0, - read footer,
- compare footer
original_source_sizewith produced output size, - if enabled, compare footer
global_hashwith recomputed rolling hash.
- require
The format version is a single byte at offset 0x04 of the file header.
A conforming decoder MUST reject any file whose version it does not support.
| Change class | Version action | Example |
|---|---|---|
| New block type added | No bump (forward-compatible) | Adding a hypothetical GLR block type |
| New flag bit defined | No bump (forward-compatible) | Using a reserved flag bit |
| Existing block encoding changed | Major bump | Changing GLO token layout |
| Header/footer layout changed | Major bump | Resizing the file header |
| Checksum algorithm changed | Major bump | Replacing RapidHash with Komihash |
- Backward compatibility: a decoder supporting version N MUST decode all files produced by encoders of version N. It MAY also accept earlier versions.
- Forward compatibility: a decoder encountering an unknown block type (not RAW, GLO, NUM, GHI, or EOF) SHOULD skip it using
comp_sizeto advance past its payload (and optional checksum), rather than rejecting the file outright. This allows older decoders to partially process files from newer encoders that introduce additive block types. - Reserved fields: all reserved bytes and flag bits MUST be written as zero by encoders. Decoders MUST ignore reserved fields (not reject non-zero values), unless a future version assigns them meaning.
A minimal conforming decoder for version 5 MUST support:
- File header parsing and CRC16 validation.
- RAW blocks (type 0) - passthrough copy.
- GLO blocks (type 1) - full LZ decode with extras varint.
- GHI blocks (type 3) - full LZ decode with extras varint.
- EOF block (type 255) - stream termination.
- File footer validation (source size check).
Support for NUM (type 2) and checksum verification is RECOMMENDED but not strictly required for a minimal implementation.
Decoders MUST detect and handle the following error conditions. The recommended behavior for each class is specified below.
| Error | Detection point | Required behavior |
|---|---|---|
| Bad magic | File header, offset 0x00 | Reject immediately. Not a ZXC file. |
| Unsupported version | File header, offset 0x04 | Reject immediately. Version not supported. |
| Header CRC16 mismatch | File header, offset 0x0E | Reject. Header is corrupt or truncated. |
| Invalid chunk size code | File header, offset 0x05 | Reject. Code outside valid range [12..21] and not legacy 64. |
| Block header CRC8 mismatch | Block header, offset 0x07 | Reject block. Stream is corrupt. |
| Unknown block type | Block header, offset 0x00 | Skip block using comp_size (see §10.3), or reject. |
| Block payload truncated | During fread of comp_size bytes |
Reject. Unexpected end of stream. |
| Block checksum mismatch | Trailing 4-byte checksum | Reject block. Payload is corrupt. |
| EOF block with non-zero comp_size | EOF block header | Reject. Malformed EOF marker. |
| Footer source size mismatch | File footer, offset 0x00 | Reject. Output size does not match declared original size. |
| Footer global hash mismatch | File footer, offset 0x08 | Reject (if checksum mode active). Integrity failure. |
| Decompressed output exceeds chunk size | During LZ decode | Reject. Corrupt or malicious payload. |
| Match offset out of bounds | During LZ copy | Reject. Offset references data before output start. |
| Varint exceeds maximum length | Extras stream | Reject. Overflow or corrupt extras data. |
- Fatal: the decoder MUST stop processing and report an error. All errors in the table above are fatal by default.
- Warning: not currently defined. Future versions may introduce non-fatal conditions (e.g. unknown flag bits set in reserved positions).
When a fatal error occurs mid-stream, the decoder SHOULD:
- Stop producing output immediately.
- Report the specific error condition (see
zxc_error_namein the reference implementation). - Not return partially decompressed data as a valid result.
Buffer-mode decoders MUST return a negative error code. Stream-mode decoders MUST signal the error and cease writing to the output.
For decoders processing untrusted input (e.g. network data, user uploads):
- Validate all header checksums before processing payloads.
- Enforce maximum allocation limits based on
comp_sizeand chunk size code. - Reject files where
comp_sizeexceedszxc_compress_bound(chunk_size). - Use bounded memory copies - never trust decoded lengths without cross-checking against output buffer capacity.
- File header: 16 bytes
- Block header: 8 bytes
- Block checksum (optional): 4 bytes
- NUM header: 16 bytes
- GLO header: 16 bytes
- GHI header: 16 bytes
- Section descriptor: 8 bytes
- GLO descriptors total: 32 bytes
- GHI descriptors total: 24 bytes
- File footer: 12 bytes
This example was produced with the CLI from a 10-byte input (Hello ZXC\n) using:
zxc -z -C -1 sample.txtGenerated archive size: 58 bytes.
00000000: F5 2E B0 9C 05 13 80 00 00 00 00 00 00 00 B8 90
00000010: 00 00 00 0A 00 00 00 69 48 65 6C 6C 6F 20 5A 58
00000020: 43 0A 90 BB A1 75 FF 00 00 00 00 00 00 02 0A 00
00000030: 00 00 00 00 00 00 90 BB A1 75
F5 2E B0 9C | 05 | 13 | 80 | 00 00 00 00 00 00 00 | B8 90
F5 2E B0 9C-> magic word (LE) =0x9CB02EF5.05-> format version 5.13-> chunk-size code 19 (exponent encoding:2^19 = 524288bytes, i.e. 512 KiB, the default).80-> checksum enabled (HAS_CHECKSUM=1, algo id 0).- next 7 bytes are reserved zeros.
B8 90-> header CRC16.
Block header at offset 0x10:
00 | 00 | 00 | 0A 00 00 00 | 69
- type
00= RAW. - flags
00, reserved00. comp_size = 0x0000000A = 10bytes.- header CRC8 =
0x69.
Payload at 0x18..0x21 (10 bytes):
48 65 6C 6C 6F 20 5A 58 43 0A
ASCII: Hello ZXC\n.
Trailing block checksum at 0x22..0x25:
90 BB A1 75
LE value: 0x75A1BB90.
FF | 00 | 00 | 00 00 00 00 | 02
- type
FF= EOF. comp_size = 0(mandatory).- header CRC8 =
0x02.
0A 00 00 00 00 00 00 00 | 90 BB A1 75
- original source size =
10bytes. - global hash =
0x75A1BB90.
Since there is exactly one data block, the global hash equals that block checksum:
global0 = 0
global1 = rotl1(global0) XOR block_crc = block_crc
0x00..0x0F File Header (16)
0x10..0x17 RAW Block Header (8)
0x18..0x21 RAW Payload (10)
0x22..0x25 RAW Block Checksum (4)
0x26..0x2D EOF Block Header (8)
0x2E..0x39 File Footer (12)
Same 10-byte input (Hello ZXC\n), compressed with seekable mode enabled:
zxc -z -C -1 -S sample.txtGenerated archive size: 70 bytes (12 bytes larger than the non-seekable variant).
00000000: F5 2E B0 9C 05 13 80 00 00 00 00 00 00 00 B8 90
00000010: 00 00 00 0A 00 00 00 69 48 65 6C 6C 6F 20 5A 58
00000020: 43 0A 90 BB A1 75 FF 00 00 00 00 00 00 02 FE 00
00000030: 00 04 00 00 00 D2 16 00 00 00 0A 00 00 00 00 00
00000040: 00 00 90 BB A1 75
A) File Header (offset 0x00, 16 bytes) - identical to non-seekable.
B) Data Block #0 (RAW) (offset 0x10, 22 bytes) - identical to non-seekable.
C) EOF Block (offset 0x26, 8 bytes) - identical to non-seekable.
D) SEK Block (offset 0x2E, 12 bytes)
Block header at 0x2E:
FE | 00 | 00 | 04 00 00 00 | D2
FE-> type 254 = SEK (Seek Table).- flags
00, reserved00. comp_size = 0x00000004 = 4bytes (one entry x 4 bytes/entry).- header CRC8 =
0xD2.
Seek table entry at 0x36:
16 00 00 00
- Entry #0: compressed block size =
0x00000016 = 22bytes. This is the total size of data block #0 including its header (8) + payload (10) + checksum (4) = 22. ✓
E) File Footer (offset 0x3A, 12 bytes)
0A 00 00 00 00 00 00 00 | 90 BB A1 75
- original source size =
10bytes. - global hash =
0x75A1BB90.
0x00..0x0F File Header (16)
0x10..0x17 RAW Block Header (8)
0x18..0x21 RAW Payload (10)
0x22..0x25 RAW Block Checksum (4)
0x26..0x2D EOF Block Header (8)
0x2E..0x35 SEK Block Header (8) <- seek table
0x36..0x39 SEK Entry #0 (4) <- comp_size of block #0
0x3A..0x45 File Footer (12)
Compatibility note: The SEK block is inserted between the EOF block and the file footer. The footer always remains the last 12 bytes of the file, so decoders that locate the footer from the end of the file (e.g.
src + src_size - 12for buffer APIs, orfseek(END - 12)for file APIs) work unchanged with seekable archives. However, streaming decoders that read the footer sequentially immediately after the EOF block must be updated to detect and skip the SEK block. In practice, all ZXC decoders since v0.9.0 handle both seekable and non-seekable archives transparently.