|
| 1 | +# quasicryth-research |
| 2 | + |
| 3 | +Direct Rust transcode of **Quasicryth** (Tacconelli 2026, |
| 4 | +[arxiv 2603.14999](https://arxiv.org/abs/2603.14999), upstream |
| 5 | +[github.com/robtacconelli/quasicryth](https://github.com/robtacconelli/quasicryth) |
| 6 | +v5.6.0). |
| 7 | + |
| 8 | +**Purpose:** research and testing. Two goals: |
| 9 | + |
| 10 | +1. Validate the workspace's φ-substrate decisions (bgz17's `17φ/11`, helix's |
| 11 | + golden-spiral hemisphere) against the reference algebra — without |
| 12 | + depending on the upstream C build. |
| 13 | +2. Demonstrate the **codebook architecture in two variants** behind one |
| 14 | + trait: the original flat storage shape from the C reference, and a |
| 15 | + COW Adaptive Radix Tree variant that fits the workspace's append-only |
| 16 | + substrate doctrine. |
| 17 | + |
| 18 | +## What's transcoded |
| 19 | + |
| 20 | +| Phase | Upstream C | Rust module | Status | Tests | |
| 21 | +|---|---|---|---|---| |
| 22 | +| 0 | `fib.c` + types from `qtc.h` | `tiling`, `hierarchy`, `constants`, `types` | shipped | 19 unit + 9 paper-theorem integration | |
| 23 | +| 1 | `md5.c` + `tok.c` (partial) | `md5`, `tok` | shipped | 8 RFC-1321 vectors + 12 tokenizer round-trips | |
| 24 | +| 2 | `cb.c` (algorithmic shape) | `codebook` (FlatCodebook + CowRadixCodebook) | shipped | 8 codebook tests, cross-variant validation | |
| 25 | +| 3 | `ac.c` | `arith_coder` (Model256, VModel, Encoder, Decoder) | shipped | 9 round-trip tests at multiple scales | |
| 26 | +| 4 | `compress.c` + `decompress.c` (simplified) | `pipeline` (compress/decompress) | shipped | 11 round-trip tests covering both variants | |
| 27 | +| 5 | integration | `tests/round_trip.rs` + `tests/paper_theorems.rs` | shipped | 7 cross-variant integration tests | |
| 28 | +| 6 | `main.c` | `bin/qresearch.rs` | shipped | CLI: compress / decompress / round-trip | |
| 29 | + |
| 30 | +**Total tests: 83 passing.** Zero dependencies. No `unsafe`. Stable Rust. |
| 31 | + |
| 32 | +## What this is NOT |
| 33 | + |
| 34 | +The Rust pipeline is **NOT byte-compatible** with the upstream `.qm56` |
| 35 | +format. The full v5.6 production compressor ships: |
| 36 | + |
| 37 | +- multi-level adaptive arithmetic coding with 144 specialised level |
| 38 | + context models + 12 per-index models + recency caches + two-tier |
| 39 | + unigram model + word-level LZ77, |
| 40 | +- 36-tiling greedy selection per block, |
| 41 | +- LZMA escape stream for OOV words, |
| 42 | +- frequency-counter pruning for memory bounds on large inputs, |
| 43 | +- multi-level n-gram codebooks (3-gram through 144-gram). |
| 44 | + |
| 45 | +The Rust pipeline here is **simplified to a single-tier (unigram) |
| 46 | +encoding** so the codebook trait abstraction is the clean variation |
| 47 | +point and the COW radix trie variant is exercised end-to-end. Multi-tier |
| 48 | +n-gram encoding is a phase 5+ extension. The Fibonacci tiling + |
| 49 | +substitution hierarchy + deep-position detection are verified to |
| 50 | +satisfy the paper's five core theorems (see |
| 51 | +`tests/paper_theorems.rs`), but the bit-stream itself only encodes |
| 52 | +word-ID symbols at the unigram tier. |
| 53 | + |
| 54 | +The Rust pipeline **round-trips with itself**: `decompress(compress(x)) |
| 55 | +== x` for every test input under both `Variant::Flat` and |
| 56 | +`Variant::CowRadix`. This is the property the integration tests verify. |
| 57 | + |
| 58 | +## Verification |
| 59 | + |
| 60 | +```bash |
| 61 | +# All 83 tests: |
| 62 | +cargo test --manifest-path crates/quasicryth-research/Cargo.toml |
| 63 | + |
| 64 | +# Paper-theorem suite only (the original 9 algebraic claims): |
| 65 | +cargo test --manifest-path crates/quasicryth-research/Cargo.toml \ |
| 66 | + --test paper_theorems |
| 67 | + |
| 68 | +# Cross-variant integration suite (Phase 5): |
| 69 | +cargo test --manifest-path crates/quasicryth-research/Cargo.toml \ |
| 70 | + --test round_trip |
| 71 | +``` |
| 72 | + |
| 73 | +Lint discipline: |
| 74 | + |
| 75 | +```bash |
| 76 | +cargo clippy --manifest-path crates/quasicryth-research/Cargo.toml \ |
| 77 | + --all-targets -- -D warnings |
| 78 | +cargo fmt --manifest-path crates/quasicryth-research/Cargo.toml --check |
| 79 | +``` |
| 80 | + |
| 81 | +Both clean. |
| 82 | + |
| 83 | +## CLI |
| 84 | + |
| 85 | +```bash |
| 86 | +cargo build --release --manifest-path crates/quasicryth-research/Cargo.toml \ |
| 87 | + --bin qresearch |
| 88 | +./crates/quasicryth-research/target/release/qresearch round-trip /path/to/file.txt |
| 89 | +./crates/quasicryth-research/target/release/qresearch round-trip -v cow /path/to/file.txt |
| 90 | +./crates/quasicryth-research/target/release/qresearch compress -v flat in.txt out.qrs1 |
| 91 | +./crates/quasicryth-research/target/release/qresearch decompress out.qrs1 in.txt.recovered |
| 92 | +``` |
| 93 | + |
| 94 | +The `round-trip` subcommand compresses to memory, decompresses, and |
| 95 | +verifies identity — useful for quick fuzz-style validation on real text. |
| 96 | + |
| 97 | +## Paper-theorem verification (algebraic substrate) |
| 98 | + |
| 99 | +Tests in `tests/paper_theorems.rs` verify, on synthetic L/S sequences: |
| 100 | + |
| 101 | +- **Thm 2** Fibonacci hierarchy never collapses (both L and S supertiles persist). |
| 102 | +- **Cor 4** Period-5 collapses by level 4 or 5 (vs Fibonacci's unbounded depth). |
| 103 | +- **Thm 9** Golden Compensation: L:S ratio = φ at every level. |
| 104 | +- **Thm 13/Cor 15** Aperiodic advantage grows with scale. |
| 105 | +- **Sturmian** Factor complexity ≤ n+1 (the minimality property that |
| 106 | + gives maximal codebook efficiency, Thm 7). |
| 107 | + |
| 108 | +Plus algebraic and structural invariants: PV-property (φ² = φ+1), |
| 109 | +`HIER_WORD_LENS = F_3..F_12`, no-adjacent-S on all 36 canonical tilings. |
| 110 | + |
| 111 | +## Codebook variants |
| 112 | + |
| 113 | +| Property | `FlatCodebook` | `CowRadixCodebook` | |
| 114 | +|---|---|---| |
| 115 | +| Storage | flat `Vec<u32>` per tier + `HashMap` for lookup | Adaptive Radix Tree (Node4 / Node16 / Node256) per tier | |
| 116 | +| Build cost | O(n log n) — frequency sort | O(n log n) sort + O(n · key_len) trie inserts | |
| 117 | +| Lookup | O(1) average (HashMap) | O(key_len) tree walk | |
| 118 | +| Memory | dense | sparse, shared across versions (Arc) | |
| 119 | +| Versioning | no | **path-copy COW** — every insert returns a new root, prior roots stay valid | |
| 120 | +| Append-only | no | **yes** — fits the workspace's substrate doctrine | |
| 121 | +| Threading | `Send + Sync` (immutable post-build) | `Send + Sync` (Arc-shared subtrees, immutable) | |
| 122 | + |
| 123 | +Both implement the `Codebook` trait. The `pipeline::Variant` enum |
| 124 | +picks between them. Tests validate equivalence: under identical |
| 125 | +inputs, both produce the same `compress → decompress` output. |
| 126 | + |
| 127 | +The COW property is **explicitly exercised** in |
| 128 | +`codebook::tests::cow_art_path_copy_preserves_old_root` — `art_v0` |
| 129 | +stays empty after `art_v1.insert(...)` and `art_v2.insert(...)`; |
| 130 | +each version sees only its own inserts. This is the architectural |
| 131 | +property the workspace's append-only doctrine requires. |
| 132 | + |
| 133 | +## Compressed stream format (v1) |
| 134 | + |
| 135 | +The Rust pipeline writes a self-contained format with the magic |
| 136 | +`QRS1` ("Quasicryth Research Simplified v1"): |
| 137 | + |
| 138 | +``` |
| 139 | +magic : [4] "QRS1" |
| 140 | +orig_size : u64 little-endian |
| 141 | +n_tokens : u32 |
| 142 | +n_words : u32 |
| 143 | +n_unique : u32 |
| 144 | +lowered_size: u32 |
| 145 | +lowered : [u8; lowered_size] the lowered byte stream |
| 146 | +spans : [(u32 offset, u32 len, u8 case_flag); n_tokens] |
| 147 | +case_size : u32 |
| 148 | +case_data : [u8; case_size] AC over Model256 (token case flags) |
| 149 | +word_size : u32 |
| 150 | +word_data : [u8; word_size] AC over VModel (codebook indices) |
| 151 | +``` |
| 152 | + |
| 153 | +This is **not** the upstream `.qm56` format — by design, see the |
| 154 | +"What this is NOT" section above. |
| 155 | + |
| 156 | +## Relationship to workspace crates |
| 157 | + |
| 158 | +- **bgz17** — uses `17φ/11 ≈ 5/2` (major tenth) as octave-stacking |
| 159 | + constant for codebook hierarchy depth; this crate verifies the |
| 160 | + non-collapse theorem that justifies φ over rational stacking |
| 161 | + approximations. |
| 162 | +- **helix** — uses pure φ for golden-angle azimuth and `√u` for |
| 163 | + equal-area hemisphere placement; this crate verifies the Sturmian |
| 164 | + minimality that makes φ optimal among irrational slopes. |
| 165 | +- **jc::weyl** — proves 1-D `{k·φ⁻¹ mod 1}` star-discrepancy is |
| 166 | + minimal at N=144 and N=1000; this crate's `qc_word_tiling` |
| 167 | + exercises the same φ-stride at hierarchy scale. |
| 168 | + |
| 169 | +## Upstream |
| 170 | + |
| 171 | +`https://github.com/robtacconelli/quasicryth` — v5.6.0 as of the |
| 172 | +transcode date. The upstream is the canonical reference; this crate |
| 173 | +tracks its algebraic surface and pipeline shape only and does NOT |
| 174 | +attempt byte-for-byte compatibility with its compressed output. |
| 175 | + |
| 176 | +## Crate policy |
| 177 | + |
| 178 | +Standalone, zero-dependency, `exclude`d from the lance-graph |
| 179 | +workspace — same convention as `bgz17`, `deepnsm`, `helix`, |
| 180 | +`bgz-tensor`. Verified via `cargo test --manifest-path`. |
0 commit comments