Commit de566f6
committed
feat(quasicryth-research): phase 4 — compress/decompress pipeline
End-to-end pipeline wiring phases 1-3 into a working compress() →
decompress() round-trip for BOTH codebook variants.
New module src/pipeline.rs (~460 LOC):
Public API
- Variant enum: Flat | CowRadix — selects which codebook backs
the pipeline
- compress(text: &[u8], variant) -> Result<Vec<u8>, PipelineError>
- decompress(bytes: &[u8]) -> Result<Vec<u8>, PipelineError>
- PipelineError: OutOfVocabulary, BadMagic, Truncated, DecodeRange
Compressed stream format (v1, "QRS1" magic):
- magic [4] || orig_size [u64] || n_tokens [u32] || n_words [u32]
|| n_unique [u32]
- lowered byte stream (length-prefixed)
- per-token spans: (offset u32, len u32, case_flag u8)
- case-flag payload (AC over Model256, length-prefixed)
- word-ID payload (AC over VModel with codebook alphabet,
length-prefixed; round-trip witness for the codebook variant)
Pipeline shape
1. tokenize(text) → TokenStream + lowered byte stream + case flags
2. Intern token byte slices → word_ids + unique pool
3. Build codebook via the Codebook trait (Flat OR CowRadix)
4. Verify every word is in the unigram tier (OutOfVocabulary fails)
5. Encode word_ids stream via VModel + Encoder
6. Encode case flags via Model256 + Encoder
7. Serialize header + spans + lowered + AC payloads
Deliberate simplifications (documented in module-level doc + README)
- SINGLE-TIER codebook (unigrams only). The Fibonacci tiling
+ substitution hierarchy + deep-position detection from phase 1
remain verified-against-paper-theorems via tests/paper_theorems.rs,
but the bit-stream itself is single-tier. Multi-tier n-gram
encoding is a phase 5+ extension.
- NO LZMA escape stream (OOV → error). Reference C compressor has
a parallel LZMA stream for OOV words.
- NO multi-tile selection (the 36-tiling greedy engine isn't
wired into the bit-stream).
- NOT byte-identical to the C reference output. Round-trip
correctness within the Rust pipeline is the property tested;
byte-compat with the upstream .qm56 is out of scope.
Tests added (9):
- round_trips_empty
- round_trips_simple_lowercase — "the quick brown fox..."
- round_trips_mixed_case — "Hello WORLD foo Bar..."
- round_trips_punctuation_and_newlines — "Hi, world!\nFoo bar..."
- round_trips_repeated_phrase — 2000-byte cyclic phrase
- round_trips_pseudo_random_text — 500 random English words
- round_trips_utf8_high_bit — "café naïve façade"
- variants_produce_same_decompressed_output
— Flat and COW agree
- bad_magic_is_rejected
- truncated_stream_is_rejected
Every round-trip test runs against BOTH variants — the assert_round_trips
helper iterates Variant::{Flat, CowRadix} and verifies compress→
decompress is identity for both.
Bug caught during phase 4 (recorded for posterity): initial
implementation conflated two distinct "lowered" byte streams — the
full TokenStream.lowered vs a per-unique-word pool built during
interning. Token spans index into the former; I was indexing them
into the latter. Fixed by serializing TokenStream.lowered directly
and treating the per-unique pool as a build-only intermediate.
Total tests: 76 (was 65). +9 from pipeline + 2 error-path tests.
Verification:
cargo test → 67 unit + 9 integration = 76 passed, 0 failed
cargo clippy --all-targets -- -D warnings clean
(added 1 doc allow: doc_lazy_continuation)
cargo fmt clean
Zero-dep preserved. No unsafe. Stable Rust.1 parent afd7969 commit de566f6
2 files changed
Lines changed: 455 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
| 73 | + | |
73 | 74 | | |
74 | 75 | | |
75 | 76 | | |
76 | 77 | | |
77 | 78 | | |
78 | 79 | | |
| 80 | + | |
79 | 81 | | |
80 | 82 | | |
81 | 83 | | |
| |||
89 | 91 | | |
90 | 92 | | |
91 | 93 | | |
| 94 | + | |
92 | 95 | | |
93 | 96 | | |
94 | 97 | | |
| |||
0 commit comments