forked from lance-format/lance-graph
-
Notifications
You must be signed in to change notification settings - Fork 0
feat(quasicryth-research): direct C→Rust transcode + COW radix trie variant #461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 6 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
f0dfe88
feat(quasicryth-research): direct Rust transcode of the algebraic core
claude 68f754e
feat(quasicryth-research): phase 1 — tok + md5
claude 9e229d5
feat(quasicryth-research): phase 2 — codebook + COW radix variant
claude afd7969
feat(quasicryth-research): phase 3 — arithmetic coder
claude de566f6
feat(quasicryth-research): phase 4 — compress/decompress pipeline
claude 7fed9b9
feat(quasicryth-research): phase 5+6 — cross-variant integration + CLI
claude bd628e3
fix(quasicryth-research): codex P2 + coderabbit review — bug + 4 nits
claude File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| /target | ||
| Cargo.lock |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| [package] | ||
| name = "quasicryth-research" | ||
| version = "0.1.0" | ||
| edition = "2021" | ||
| license = "Apache-2.0" | ||
| publish = false | ||
| description = "Direct Rust transcode of the algebraic core of Quasicryth (Tacconelli 2026, arxiv 2603.14999) — Fibonacci quasicrystal tilings, substitution hierarchy, deep n-gram position detection. Research/testing crate, NOT a production compressor." | ||
|
|
||
| # Standalone codec constitution (matches helix / bgz17 / deepnsm): the default | ||
| # build is ZERO dependencies. Algebraic core only — no arithmetic coding, no | ||
| # LZMA escape, no tokenization, no codebook construction (those live in the | ||
| # upstream C reference implementation and are out of scope for this transcode). | ||
|
coderabbitai[bot] marked this conversation as resolved.
Outdated
|
||
| # | ||
| # Upstream: https://github.com/robtacconelli/quasicryth (MIT/Apache, v5.6.0). | ||
| # This transcode covers fib.h / fib.c (598 LOC) + the algebraic types from | ||
| # qtc.h. Paper proves five theorems (non-collapse, PV-property, Sturmian | ||
| # minimality, Golden Compensation, bounded overhead) — the included tests | ||
| # verify all five on synthetic data. | ||
| [dependencies] | ||
|
|
||
| [dev-dependencies] | ||
|
|
||
| # Empty [workspace] so cargo treats this crate as standalone when invoked via | ||
| # --manifest-path. Listed under the root Cargo.toml `exclude` so workspace-wide | ||
| # commands never pull it into the deterministic main compile graph. | ||
| [workspace] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,180 @@ | ||
| # quasicryth-research | ||
|
|
||
| Direct Rust transcode of **Quasicryth** (Tacconelli 2026, | ||
| [arxiv 2603.14999](https://arxiv.org/abs/2603.14999), upstream | ||
| [github.com/robtacconelli/quasicryth](https://github.com/robtacconelli/quasicryth) | ||
| v5.6.0). | ||
|
|
||
| **Purpose:** research and testing. Two goals: | ||
|
|
||
| 1. Validate the workspace's φ-substrate decisions (bgz17's `17φ/11`, helix's | ||
| golden-spiral hemisphere) against the reference algebra — without | ||
| depending on the upstream C build. | ||
| 2. Demonstrate the **codebook architecture in two variants** behind one | ||
| trait: the original flat storage shape from the C reference, and a | ||
| COW Adaptive Radix Tree variant that fits the workspace's append-only | ||
| substrate doctrine. | ||
|
|
||
| ## What's transcoded | ||
|
|
||
| | Phase | Upstream C | Rust module | Status | Tests | | ||
| |---|---|---|---|---| | ||
| | 0 | `fib.c` + types from `qtc.h` | `tiling`, `hierarchy`, `constants`, `types` | shipped | 19 unit + 9 paper-theorem integration | | ||
| | 1 | `md5.c` + `tok.c` (partial) | `md5`, `tok` | shipped | 8 RFC-1321 vectors + 12 tokenizer round-trips | | ||
| | 2 | `cb.c` (algorithmic shape) | `codebook` (FlatCodebook + CowRadixCodebook) | shipped | 8 codebook tests, cross-variant validation | | ||
| | 3 | `ac.c` | `arith_coder` (Model256, VModel, Encoder, Decoder) | shipped | 9 round-trip tests at multiple scales | | ||
| | 4 | `compress.c` + `decompress.c` (simplified) | `pipeline` (compress/decompress) | shipped | 11 round-trip tests covering both variants | | ||
| | 5 | integration | `tests/round_trip.rs` + `tests/paper_theorems.rs` | shipped | 7 cross-variant integration tests | | ||
| | 6 | `main.c` | `bin/qresearch.rs` | shipped | CLI: compress / decompress / round-trip | | ||
|
|
||
| **Total tests: 83 passing.** Zero dependencies. No `unsafe`. Stable Rust. | ||
|
|
||
| ## What this is NOT | ||
|
|
||
| The Rust pipeline is **NOT byte-compatible** with the upstream `.qm56` | ||
| format. The full v5.6 production compressor ships: | ||
|
|
||
| - multi-level adaptive arithmetic coding with 144 specialised level | ||
| context models + 12 per-index models + recency caches + two-tier | ||
| unigram model + word-level LZ77, | ||
| - 36-tiling greedy selection per block, | ||
| - LZMA escape stream for OOV words, | ||
| - frequency-counter pruning for memory bounds on large inputs, | ||
| - multi-level n-gram codebooks (3-gram through 144-gram). | ||
|
|
||
| The Rust pipeline here is **simplified to a single-tier (unigram) | ||
| encoding** so the codebook trait abstraction is the clean variation | ||
| point and the COW radix trie variant is exercised end-to-end. Multi-tier | ||
| n-gram encoding is a phase 5+ extension. The Fibonacci tiling + | ||
| substitution hierarchy + deep-position detection are verified to | ||
| satisfy the paper's five core theorems (see | ||
| `tests/paper_theorems.rs`), but the bit-stream itself only encodes | ||
| word-ID symbols at the unigram tier. | ||
|
|
||
| The Rust pipeline **round-trips with itself**: `decompress(compress(x)) | ||
| == x` for every test input under both `Variant::Flat` and | ||
| `Variant::CowRadix`. This is the property the integration tests verify. | ||
|
|
||
| ## Verification | ||
|
|
||
| ```bash | ||
| # All 83 tests: | ||
| cargo test --manifest-path crates/quasicryth-research/Cargo.toml | ||
|
|
||
| # Paper-theorem suite only (the original 9 algebraic claims): | ||
| cargo test --manifest-path crates/quasicryth-research/Cargo.toml \ | ||
| --test paper_theorems | ||
|
|
||
| # Cross-variant integration suite (Phase 5): | ||
| cargo test --manifest-path crates/quasicryth-research/Cargo.toml \ | ||
| --test round_trip | ||
| ``` | ||
|
|
||
| Lint discipline: | ||
|
|
||
| ```bash | ||
| cargo clippy --manifest-path crates/quasicryth-research/Cargo.toml \ | ||
| --all-targets -- -D warnings | ||
| cargo fmt --manifest-path crates/quasicryth-research/Cargo.toml --check | ||
| ``` | ||
|
|
||
| Both clean. | ||
|
|
||
| ## CLI | ||
|
|
||
| ```bash | ||
| cargo build --release --manifest-path crates/quasicryth-research/Cargo.toml \ | ||
| --bin qresearch | ||
| ./crates/quasicryth-research/target/release/qresearch round-trip /path/to/file.txt | ||
| ./crates/quasicryth-research/target/release/qresearch round-trip -v cow /path/to/file.txt | ||
| ./crates/quasicryth-research/target/release/qresearch compress -v flat in.txt out.qrs1 | ||
| ./crates/quasicryth-research/target/release/qresearch decompress out.qrs1 in.txt.recovered | ||
| ``` | ||
|
|
||
| The `round-trip` subcommand compresses to memory, decompresses, and | ||
| verifies identity — useful for quick fuzz-style validation on real text. | ||
|
|
||
| ## Paper-theorem verification (algebraic substrate) | ||
|
|
||
| Tests in `tests/paper_theorems.rs` verify, on synthetic L/S sequences: | ||
|
|
||
| - **Thm 2** Fibonacci hierarchy never collapses (both L and S supertiles persist). | ||
| - **Cor 4** Period-5 collapses by level 4 or 5 (vs Fibonacci's unbounded depth). | ||
| - **Thm 9** Golden Compensation: L:S ratio = φ at every level. | ||
| - **Thm 13/Cor 15** Aperiodic advantage grows with scale. | ||
| - **Sturmian** Factor complexity ≤ n+1 (the minimality property that | ||
| gives maximal codebook efficiency, Thm 7). | ||
|
|
||
| Plus algebraic and structural invariants: PV-property (φ² = φ+1), | ||
| `HIER_WORD_LENS = F_3..F_12`, no-adjacent-S on all 36 canonical tilings. | ||
|
|
||
| ## Codebook variants | ||
|
|
||
| | Property | `FlatCodebook` | `CowRadixCodebook` | | ||
| |---|---|---| | ||
| | Storage | flat `Vec<u32>` per tier + `HashMap` for lookup | Adaptive Radix Tree (Node4 / Node16 / Node256) per tier | | ||
| | Build cost | O(n log n) — frequency sort | O(n log n) sort + O(n · key_len) trie inserts | | ||
| | Lookup | O(1) average (HashMap) | O(key_len) tree walk | | ||
| | Memory | dense | sparse, shared across versions (Arc) | | ||
| | Versioning | no | **path-copy COW** — every insert returns a new root, prior roots stay valid | | ||
| | Append-only | no | **yes** — fits the workspace's substrate doctrine | | ||
| | Threading | `Send + Sync` (immutable post-build) | `Send + Sync` (Arc-shared subtrees, immutable) | | ||
|
|
||
| Both implement the `Codebook` trait. The `pipeline::Variant` enum | ||
| picks between them. Tests validate equivalence: under identical | ||
| inputs, both produce the same `compress → decompress` output. | ||
|
|
||
| The COW property is **explicitly exercised** in | ||
| `codebook::tests::cow_art_path_copy_preserves_old_root` — `art_v0` | ||
| stays empty after `art_v1.insert(...)` and `art_v2.insert(...)`; | ||
| each version sees only its own inserts. This is the architectural | ||
| property the workspace's append-only doctrine requires. | ||
|
|
||
| ## Compressed stream format (v1) | ||
|
|
||
| The Rust pipeline writes a self-contained format with the magic | ||
| `QRS1` ("Quasicryth Research Simplified v1"): | ||
|
|
||
| ``` | ||
| magic : [4] "QRS1" | ||
| orig_size : u64 little-endian | ||
| n_tokens : u32 | ||
| n_words : u32 | ||
| n_unique : u32 | ||
| lowered_size: u32 | ||
| lowered : [u8; lowered_size] the lowered byte stream | ||
| spans : [(u32 offset, u32 len, u8 case_flag); n_tokens] | ||
| case_size : u32 | ||
| case_data : [u8; case_size] AC over Model256 (token case flags) | ||
| word_size : u32 | ||
| word_data : [u8; word_size] AC over VModel (codebook indices) | ||
| ``` | ||
|
|
||
| This is **not** the upstream `.qm56` format — by design, see the | ||
| "What this is NOT" section above. | ||
|
|
||
| ## Relationship to workspace crates | ||
|
|
||
| - **bgz17** — uses `17φ/11 ≈ 5/2` (major tenth) as octave-stacking | ||
| constant for codebook hierarchy depth; this crate verifies the | ||
| non-collapse theorem that justifies φ over rational stacking | ||
| approximations. | ||
| - **helix** — uses pure φ for golden-angle azimuth and `√u` for | ||
| equal-area hemisphere placement; this crate verifies the Sturmian | ||
| minimality that makes φ optimal among irrational slopes. | ||
| - **jc::weyl** — proves 1-D `{k·φ⁻¹ mod 1}` star-discrepancy is | ||
| minimal at N=144 and N=1000; this crate's `qc_word_tiling` | ||
| exercises the same φ-stride at hierarchy scale. | ||
|
|
||
| ## Upstream | ||
|
|
||
| `https://github.com/robtacconelli/quasicryth` — v5.6.0 as of the | ||
| transcode date. The upstream is the canonical reference; this crate | ||
| tracks its algebraic surface and pipeline shape only and does NOT | ||
| attempt byte-for-byte compatibility with its compressed output. | ||
|
|
||
| ## Crate policy | ||
|
|
||
| Standalone, zero-dependency, `exclude`d from the lance-graph | ||
| workspace — same convention as `bgz17`, `deepnsm`, `helix`, | ||
| `bgz-tensor`. Verified via `cargo test --manifest-path`. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.