Skip to content

Commit de566f6

Browse files
committed
feat(quasicryth-research): phase 4 — compress/decompress pipeline
End-to-end pipeline wiring phases 1-3 into a working compress() → decompress() round-trip for BOTH codebook variants. New module src/pipeline.rs (~460 LOC): Public API - Variant enum: Flat | CowRadix — selects which codebook backs the pipeline - compress(text: &[u8], variant) -> Result<Vec<u8>, PipelineError> - decompress(bytes: &[u8]) -> Result<Vec<u8>, PipelineError> - PipelineError: OutOfVocabulary, BadMagic, Truncated, DecodeRange Compressed stream format (v1, "QRS1" magic): - magic [4] || orig_size [u64] || n_tokens [u32] || n_words [u32] || n_unique [u32] - lowered byte stream (length-prefixed) - per-token spans: (offset u32, len u32, case_flag u8) - case-flag payload (AC over Model256, length-prefixed) - word-ID payload (AC over VModel with codebook alphabet, length-prefixed; round-trip witness for the codebook variant) Pipeline shape 1. tokenize(text) → TokenStream + lowered byte stream + case flags 2. Intern token byte slices → word_ids + unique pool 3. Build codebook via the Codebook trait (Flat OR CowRadix) 4. Verify every word is in the unigram tier (OutOfVocabulary fails) 5. Encode word_ids stream via VModel + Encoder 6. Encode case flags via Model256 + Encoder 7. Serialize header + spans + lowered + AC payloads Deliberate simplifications (documented in module-level doc + README) - SINGLE-TIER codebook (unigrams only). The Fibonacci tiling + substitution hierarchy + deep-position detection from phase 1 remain verified-against-paper-theorems via tests/paper_theorems.rs, but the bit-stream itself is single-tier. Multi-tier n-gram encoding is a phase 5+ extension. - NO LZMA escape stream (OOV → error). Reference C compressor has a parallel LZMA stream for OOV words. - NO multi-tile selection (the 36-tiling greedy engine isn't wired into the bit-stream). - NOT byte-identical to the C reference output. Round-trip correctness within the Rust pipeline is the property tested; byte-compat with the upstream .qm56 is out of scope. Tests added (9): - round_trips_empty - round_trips_simple_lowercase — "the quick brown fox..." - round_trips_mixed_case — "Hello WORLD foo Bar..." - round_trips_punctuation_and_newlines — "Hi, world!\nFoo bar..." - round_trips_repeated_phrase — 2000-byte cyclic phrase - round_trips_pseudo_random_text — 500 random English words - round_trips_utf8_high_bit — "café naïve façade" - variants_produce_same_decompressed_output — Flat and COW agree - bad_magic_is_rejected - truncated_stream_is_rejected Every round-trip test runs against BOTH variants — the assert_round_trips helper iterates Variant::{Flat, CowRadix} and verifies compress→ decompress is identity for both. Bug caught during phase 4 (recorded for posterity): initial implementation conflated two distinct "lowered" byte streams — the full TokenStream.lowered vs a per-unique-word pool built during interning. Token spans index into the former; I was indexing them into the latter. Fixed by serializing TokenStream.lowered directly and treating the per-unique pool as a build-only intermediate. Total tests: 76 (was 65). +9 from pipeline + 2 error-path tests. Verification: cargo test → 67 unit + 9 integration = 76 passed, 0 failed cargo clippy --all-targets -- -D warnings clean (added 1 doc allow: doc_lazy_continuation) cargo fmt clean Zero-dep preserved. No unsafe. Stable Rust.
1 parent afd7969 commit de566f6

2 files changed

Lines changed: 455 additions & 0 deletions

File tree

crates/quasicryth-research/src/lib.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,12 +70,14 @@
7070
#![allow(clippy::assigning_clones)] // clone-into would obscure ownership intent
7171
#![allow(clippy::single_match_else)] // explicit match reads cleaner here
7272
#![allow(clippy::only_used_in_recursion)] // self-recursive insert keeps trie context
73+
#![allow(clippy::doc_lazy_continuation)] // module-level docs use multi-line list items
7374

7475
pub mod arith_coder;
7576
pub mod codebook;
7677
pub mod constants;
7778
pub mod hierarchy;
7879
pub mod md5;
80+
pub mod pipeline;
7981
pub mod tiling;
8082
pub mod tok;
8183
pub mod types;
@@ -89,6 +91,7 @@ pub use codebook::{Codebook, CodebookSizes, CowArt, CowRadixCodebook, FlatCodebo
8991
pub use constants::{tiling_descs, HIER_WORD_LENS, INV_PHI, MAX_HIER, N_TILINGS, PHI};
9092
pub use hierarchy::{build_hierarchy, deep_counts, detect_deep_positions, hier_context};
9193
pub use md5::{md5, Md5};
94+
pub use pipeline::{compress, decompress, PipelineError, Variant};
9295
pub use tiling::{
9396
gen_from_desc, period5_tiling, period_doubling_tiling, qc_word_tiling, qc_word_tiling_alpha,
9497
rudin_shapiro_tiling, sanddrift_tiling, thue_morse_tiling, verify_no_adjacent_s,

0 commit comments

Comments
 (0)