feat(quasicryth-research): direct C→Rust transcode + COW radix trie variant#461
Conversation
Standalone, zero-dep research/testing crate transcoding fib.h + fib.c +
the algebraic types from qtc.h of the upstream Quasicryth v5.6.0 C
reference (Tacconelli 2026, arxiv 2603.14999, upstream
github.com/robtacconelli/quasicryth).
Scope: the algebra the paper proves theorems about, not the compressor.
What's transcoded
- types.rs — Tile, HLevel, ParentMap, Hierarchy, DeepPositions,
TilingDesc (idiomatic Rust ownership; no unsafe).
- constants.rs — PHI, INV_PHI, HIER_WORD_LENS = {2,3,5,8,13,21,34,55,
89,144} = F_3..F_12, MAX_HIER=10, the 36-tiling
descriptor table (12 golden phases + sqrt(58)-7
+ noble-5 + sqrt(13)-3 + 18 greedy-discovered alphas
including the far-out alpha=0.502).
- tiling.rs — cut-and-project (qc_word_tiling[_alpha]) + five
substitution-rule families (Thue-Morse, Rudin-Shapiro,
period-doubling, Period-5, Sanddrift).
- hierarchy.rs — build_hierarchy (iterative deflation
(L,S)->super-L, L->super-S), hier_context,
detect_deep_positions, deep_counts.
What's NOT transcoded
The full v5.6 production compressor pipeline (ac.c arithmetic coding,
cb.c codebook construction, compress.c / decompress.c, tok.c
tokenization, md5.c, LZMA escape). Out of scope for "research and
testing" — the goal is verifying the workspace's phi-substrate
decisions against the reference algebra, not byte-compatibility with
the upstream compressed output.
Verification (28 tests, all passing)
- 19 unit tests covering each module's invariants
- 9 integration tests in tests/paper_theorems.rs verifying:
* Thm 2 Fibonacci hierarchy never collapses
* Cor 4 Period-5 collapses by level ~3.3 = log(5)/log(phi)
* Thm 9 Golden Compensation (L:S ratio = phi at every level)
* Thm 13/Cor 15 Aperiodic advantage grows with corpus scale
* Sturmian factor complexity <= n+1 (Thm 7 root)
* PV-property phi^2 = phi + 1
* HIER_WORD_LENS = Fibonacci F_3..F_12
* No-adjacent-S on all 36 canonical tilings
cargo clippy --all-targets -- -D warnings clean (pedantic+all). rustfmt
clean. Zero-dependency default build.
Relationship to workspace crates
- bgz17 (17*phi/11 = 5/2 = octave + major third) — this crate verifies
the non-collapse theorem that justifies phi over rational stacking
approximations.
- helix (golden-spiral hemisphere, Fisher-Z aligned) — Sturmian
minimality theorem here is the optimality argument for phi as the
azimuth stride.
- jc::weyl (1-D Weyl discrepancy at N=144, N=1000) — this crate's
qc_word_tiling exercises the same phi-stride at hierarchy scale.
Listed under root Cargo.toml `exclude` so it never enters the main
compile graph. Verified via cargo test --manifest-path
crates/quasicryth-research/Cargo.toml. Follows the helix convention:
Cargo.lock gitignored; the crate stays standalone-verifiable.
Phase 1 of the full-pipeline transcode plan. Two new modules:
- src/md5.rs (RFC 1321 / md5.c transcode, 196 LOC)
* Md5 incremental hasher + one-shot md5() function
* Direct port of upstream md5.c; bit-exact match
* 8 tests covering the full RFC 1321 §A.5 test suite
(empty, "a", "abc", "message digest", alphabet,
alphanumeric, 80-digit long input, incremental==one-shot)
- src/tok.rs (tok.c transcode, 377 LOC, partial)
* tokenize() — split raw bytes into Token spans with case
separation; lowered byte stream + per-token (offset, len,
case_flag) tracking
* word_split() — pre-lowered byte stream → word offsets, no
case work (lighter path)
* apply_case() — reverse the case lowering for a token
* TokenStream::round_trip() — the round-trip the C reference
verifies internally via case_roundtrips
* 12 tests covering case detection (lower/Cap/UPPER),
round-trip on lowercase / mixed-case / punctuation / empty
/ UTF-8 high-bit; word_split byte-order preservation
NOT in this phase (deferred):
- enc_case / dec_case — depend on the arithmetic coder
(phase 3, ac.c transcode)
Total tests: 48 (was 28). +20 from md5 (8) and tok (12).
Verification:
- cargo test --manifest-path crates/quasicryth-research/Cargo.toml
→ 39 unit + 9 integration = 48 passed, 0 failed
- cargo clippy --all-targets -- -D warnings clean
(added 4 pedantic-lint allows for legibility against upstream:
many_single_char_names, too_many_lines, format_push_string,
bool_to_int_with_if — all stylistic, no correctness impact)
- cargo fmt clean
Zero-dep preserved. No unsafe.
Phase 2 adds the codebook tier of the upstream compressor, in TWO
variants behind one trait — this is the architectural split the user
asked for: original-shape + COW radix trie.
New module src/codebook.rs (~700 LOC):
Codebook trait
- n_unique / n_uni / n_bi / n_ngram(level)
- unigram_index / bigram_index / ngram_index (forward lookups)
- unigram_word / bigram_words / ngram_words (reverse lookups)
- both variants satisfy Send + Sync (immutable post-construction)
CodebookSizes (port of qtc_cb_sizes_t)
- 11 tier budgets: uni, bi, tri, fg, eg, tg, vg, tfg, ffg, efg, ofg
- auto(nw) — 7-tier corpus-size table matching auto_codebook_sizes
in cb.c
Variant A — FlatCodebook
- direct port of cb.c storage shape
- Vec<u32> per tier for forward storage + HashMap for lookup
- sorts entries by descending frequency (with deterministic tie-break)
- filters n-gram candidates to those whose every word is in the
unigram codebook (matches the cb.c filtering pass)
- per-tier budgeting matches cb.c
Variant B — CowRadixCodebook
- the architectural variant the user asked for
- backed by CowArt: a Copy-on-Write Adaptive Radix Trie
- three node variants: Node4 (4 children, low fan-out),
Node16 (medium fan-out), Node256 (full byte/dword fan-out).
Node48 omitted as a deliberate simplification — Node16 grows
straight to Node256.
- insert() returns a NEW root via path-copy; old roots remain
valid for prior consumers (Arc-shared subtrees).
- one trie per tier; reverse direction uses the same Vec storage
as FlatCodebook (the trie owns the forward direction only).
The two variants are validated against EACH OTHER in test
cow_radix_codebook_agrees_with_flat_on_lookups: identical inputs
produce identical lookup results on unigrams and bigrams. This is
the cross-validation contract that makes the COW variant a drop-in.
COW semantics are explicitly tested in cow_art_path_copy_preserves_old_root:
the v0 root stays empty after v1/v2 inserts; v1 sees only its insert,
v2 sees both — exactly the property the workspace's append-only
substrate doctrine requires.
Tests added (8): codebook_sizes_auto_increases_with_corpus,
flat_codebook_roundtrips_{unigrams,bigrams},
cow_radix_codebook_roundtrips_{unigrams,bigrams},
cow_radix_codebook_agrees_with_flat_on_lookups,
cow_art_path_copy_preserves_old_root,
cow_art_grows_node_variants.
Total tests: 56 (was 48). +8 from codebook.
Verification:
cargo test → 47 unit + 9 integration = 56 passed, 0 failed
cargo clippy --all-targets -- -D warnings clean
(added 3 pedantic allows: assigning_clones, single_match_else,
only_used_in_recursion — all stylistic)
cargo fmt clean
No new deps; zero-dep ethos preserved (std HashMap/Arc only).
Phase 3 adds the entropy-coding layer that wraps both codebooks.
Direct transcode of ac.c.
New module src/arith_coder.rs (~640 LOC):
Constants
AC_PREC = 24 precision (bits)
AC_FULL = 1 << 24 full range
AC_HALF / AC_QTR E2 / E3 renormalization thresholds
AC_MAX_FREQ = 1 << 20 rescale trigger
Model256
- adaptive 256-symbol byte alphabet (port of qtc_model_t)
- freq[256], total; halve-on-cap rescaling (freq[i] = (f>>1) | 1)
- cdf() writes a 257-entry cumulative table for the coder
VModel (variable alphabet, Fenwick-tree accelerated)
- port of qtc_vmodel_t — O(log n) cum_lo and find
- fenwick tree 1-indexed under the hood; 0-indexed public API
- rescale rebuilds the tree from halved frequencies
Encoder
- 24-bit precision range coder with pending-bits underflow handling
- encode(cum_lo, cum_hi, total) drives the (lo, hi) range
- state machine bit-exact with ac.c:
* E1 (hi < HALF) output 0
* E2 (lo >= HALF) output 1, subtract HALF
* E3 (lo>=QTR && hi<3*QTR) pending++, subtract QTR
- finish() flushes pending state and packs the bit buffer to bytes
Decoder
- symmetric to Encoder; reads MSB-first bits from the input byte stream
- decode_256(cdf, total): binary-search the 256-entry CDF
- decode_v(model): VModel.find() drives Fenwick-tree symbol search
- advance() applies the same E1/E2/E3 transitions to (lo, hi, val)
High-level helpers
- ac_enc_sym / ac_dec_sym (Model256 + update)
- ac_enc_v / ac_dec_v (VModel + update)
Tests added (9):
- model256_initial_state_is_uniform
- model256_cdf_sums_to_total
- vmodel_initial_state_is_uniform
- vmodel_cum_lo_is_prefix_sum
- vmodel_find_is_inverse_of_cum_lo
- round_trip_256_alphabet — all 256 bytes
- round_trip_repeated_byte_compresses — 10K of one byte → strong
compression + round-trip
- round_trip_variable_alphabet — VModel symbols 0..50
- round_trip_pseudo_random_sequence — 5000-byte xorshift stream
- vmodel_round_trip_with_rescaling_pressure — forces AC_MAX_FREQ rescale
Total tests: 65 (was 56). +9 from arith_coder. All 9 round-trip tests
pass — encode(input) → decode produces identity, demonstrating the
coder is internally consistent (this is the load-bearing correctness
property for phase 4's compress/decompress pipeline).
Verification:
cargo test → 57 unit + 9 integration = 66 passed, 0 failed
cargo clippy --all-targets -- -D warnings clean
(added 1 doc-only fix in codebook.rs and 1 op-style fix here)
cargo fmt clean
Zero-dep preserved.
Honest scope flag (will appear in README at phase 4):
The Rust encoder/decoder round-trips with itself bit-exact.
It is NOT guaranteed byte-identical to the C reference output — the
C reference's output depends on multiple internal Model256/VModel
initializations across context contexts (144 per-level models, 12
per-index models, recency caches, two-tier unigram). Matching that
exactly is a separate engineering task out of scope for "research
and testing." Round-trip identity within the Rust pipeline is the
property phase 4 will verify end-to-end.
End-to-end pipeline wiring phases 1-3 into a working compress() →
decompress() round-trip for BOTH codebook variants.
New module src/pipeline.rs (~460 LOC):
Public API
- Variant enum: Flat | CowRadix — selects which codebook backs
the pipeline
- compress(text: &[u8], variant) -> Result<Vec<u8>, PipelineError>
- decompress(bytes: &[u8]) -> Result<Vec<u8>, PipelineError>
- PipelineError: OutOfVocabulary, BadMagic, Truncated, DecodeRange
Compressed stream format (v1, "QRS1" magic):
- magic [4] || orig_size [u64] || n_tokens [u32] || n_words [u32]
|| n_unique [u32]
- lowered byte stream (length-prefixed)
- per-token spans: (offset u32, len u32, case_flag u8)
- case-flag payload (AC over Model256, length-prefixed)
- word-ID payload (AC over VModel with codebook alphabet,
length-prefixed; round-trip witness for the codebook variant)
Pipeline shape
1. tokenize(text) → TokenStream + lowered byte stream + case flags
2. Intern token byte slices → word_ids + unique pool
3. Build codebook via the Codebook trait (Flat OR CowRadix)
4. Verify every word is in the unigram tier (OutOfVocabulary fails)
5. Encode word_ids stream via VModel + Encoder
6. Encode case flags via Model256 + Encoder
7. Serialize header + spans + lowered + AC payloads
Deliberate simplifications (documented in module-level doc + README)
- SINGLE-TIER codebook (unigrams only). The Fibonacci tiling
+ substitution hierarchy + deep-position detection from phase 1
remain verified-against-paper-theorems via tests/paper_theorems.rs,
but the bit-stream itself is single-tier. Multi-tier n-gram
encoding is a phase 5+ extension.
- NO LZMA escape stream (OOV → error). Reference C compressor has
a parallel LZMA stream for OOV words.
- NO multi-tile selection (the 36-tiling greedy engine isn't
wired into the bit-stream).
- NOT byte-identical to the C reference output. Round-trip
correctness within the Rust pipeline is the property tested;
byte-compat with the upstream .qm56 is out of scope.
Tests added (9):
- round_trips_empty
- round_trips_simple_lowercase — "the quick brown fox..."
- round_trips_mixed_case — "Hello WORLD foo Bar..."
- round_trips_punctuation_and_newlines — "Hi, world!\nFoo bar..."
- round_trips_repeated_phrase — 2000-byte cyclic phrase
- round_trips_pseudo_random_text — 500 random English words
- round_trips_utf8_high_bit — "café naïve façade"
- variants_produce_same_decompressed_output
— Flat and COW agree
- bad_magic_is_rejected
- truncated_stream_is_rejected
Every round-trip test runs against BOTH variants — the assert_round_trips
helper iterates Variant::{Flat, CowRadix} and verifies compress→
decompress is identity for both.
Bug caught during phase 4 (recorded for posterity): initial
implementation conflated two distinct "lowered" byte streams — the
full TokenStream.lowered vs a per-unique-word pool built during
interning. Token spans index into the former; I was indexing them
into the latter. Fixed by serializing TokenStream.lowered directly
and treating the per-unique pool as a build-only intermediate.
Total tests: 76 (was 65). +9 from pipeline + 2 error-path tests.
Verification:
cargo test → 67 unit + 9 integration = 76 passed, 0 failed
cargo clippy --all-targets -- -D warnings clean
(added 1 doc allow: doc_lazy_continuation)
cargo fmt clean
Zero-dep preserved. No unsafe. Stable Rust.
Phase 5 (integration tests) + Phase 6 (CLI binary), bundled.
Phase 5 — cross-variant integration tests
=========================================
New test file tests/round_trip.rs (7 tests):
- variants_agree_on_long_natural_text — 600-char Fibonacci-theory
paragraph round-trips under BOTH variants AND the decompressed
outputs are identical
- round_trip_at_5kb_scale — cyclic phrase to 5 KB, both variants
- round_trip_single_word
- round_trip_only_whitespace
- round_trip_mixed_punctuation_lines (parens, hyphens, semicolons,
quotes, tabs)
- round_trip_repeated_uppercase_word
- cross_variant_independence — compress with Flat, decompress;
compress with CowRadix, decompress; both equal original.
(Compressed bytes between variants MAY differ; decoded output
MUST match.)
This is the architectural property the codebook trait contract
guarantees and the workspace's substrate doctrine requires: the
COW radix trie variant is a drop-in alternative to the flat
storage variant at the compress/decompress boundary.
Phase 6 — CLI binary
====================
New src/bin/qresearch.rs (~170 LOC):
qresearch compress [-v flat|cow] <input> <output>
qresearch decompress <input> <output>
qresearch round-trip [-v flat|cow] <input>
qresearch --help / -h
Standard library only. Returns ExitCode::SUCCESS / ExitCode::FAILURE
with clean error messages on read/write/codec failures. The
`round-trip` subcommand reports compression ratio AND verifies
identity for quick validation on arbitrary text files.
Live test:
$ echo "The Fibonacci substitution..." > /tmp/sample.txt
$ qresearch round-trip /tmp/sample.txt
round-trip OK: 95 bytes → 329 compressed (346.32%) → identical, variant=Flat
$ qresearch round-trip -v cow /tmp/sample.txt
round-trip OK: 95 bytes → 329 compressed (346.32%) → identical, variant=CowRadix
(The >100% ratio on 95-byte inputs is expected: v1 simplifications
mean headers + per-token spans dominate at small sizes. The C
reference's per-byte overhead amortizes over much larger inputs
and uses multi-tier n-grams + LZMA escape + word-LZ77 to get
≤25% on enwik9. The Rust pipeline here demonstrates correctness,
not benchmark-competitive compression.)
README rewrite
==============
New README.md (180 lines) documents:
- the 7-phase transcode map (which C file → which Rust module)
- test counts per phase (total: 83)
- what's NOT byte-compatible with the upstream qm56 format
- CLI usage examples
- both codebook variants compared in a table
- the compressed stream format (v1 QRS1 magic) field by field
- relationships to bgz17 / helix / jc::weyl in the workspace
- paper-theorem verification list (Thm 2, Cor 4, Thm 9, Thm 13/Cor 15,
Sturmian minimality, PV property)
Final totals
============
Tests: 83 (was 76)
- 67 unit (no change)
- 9 paper-theorem integration
- 7 cross-variant integration (NEW)
Verification:
cargo test → 83 passed, 0 failed
cargo clippy --all-targets -- -D warnings clean
cargo fmt clean
cargo build --bin qresearch builds, CLI exercised
Zero dependencies. No unsafe. Stable Rust.
Full crate inventory
====================
Modules LOC Role
───── ─── ────
types 97 Tile, HLevel, ParentMap, Hierarchy, DeepPositions
constants 192 PHI, INV_PHI, MAX_HIER, HIER_WORD_LENS, 36 tilings
tiling 388 cut-and-project + 5 substitution-rule families
hierarchy 308 build_hierarchy, hier_context, detect_deep_positions
md5 196 RFC 1321 (~85 LOC C transcoded)
tok 377 tokenize, word_split, apply_case, TokenStream
codebook 744 Codebook trait + FlatCodebook + CowRadixCodebook +
CowArt (ART with Node4/Node16/Node256, path-copy)
arith_coder 640 Model256, VModel (Fenwick), Encoder, Decoder
pipeline 460 compress, decompress, Variant, PipelineError
bin/qresearch 170 CLI (compress/decompress/round-trip)
tests/... 310 paper_theorems + round_trip integration
lib + README 280
Total: ~4,160 LOC Rust (was ~1,300 after phase 0).
|
Warning Review limit reached
More reviews will be available in 31 minutes and 35 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis pull request adds a complete standalone Rust crate ( ChangesQuasicryth-Research Algebraic Transcode
Sequence DiagramsequenceDiagram
participant User
participant qresearch as qresearch CLI
participant pipeline as Pipeline
participant tokenizer as Tokenizer
participant codebook as Codebook
participant arith as Arithmetic Coder
User->>qresearch: compress input.txt -v flat
qresearch->>pipeline: compress(data, Flat)
pipeline->>tokenizer: tokenize(text)
tokenizer->>tokenizer: lowercase + case_flags
pipeline->>codebook: intern words
pipeline->>codebook: build FlatCodebook
pipeline->>arith: encode case_flags
pipeline->>arith: encode word_ids
arith-->>pipeline: compressed bytes
pipeline-->>qresearch: output stream
qresearch-->>User: compressed file
User->>qresearch: decompress compressed.qrs
qresearch->>pipeline: decompress(data)
pipeline->>arith: decode case_flags
pipeline->>arith: decode word_ids
arith-->>pipeline: decoded tokens
pipeline->>tokenizer: apply_case
tokenizer-->>pipeline: original text
pipeline-->>qresearch: output bytes
qresearch-->>User: original file
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7fed9b9f19
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Self::Node256 { children, .. } => { | ||
| if key < 256 { | ||
| children[key as usize] = Some(child); | ||
| } | ||
| return; |
There was a problem hiding this comment.
The COW radix trie is keyed by u32 word IDs, but once a node has grown to Node256 this branch silently ignores any child key >= 256. A corpus with 257 distinct alphabetic tokens triggers this in the unigram trie: Variant::CowRadix drops word ID 256 during codebook construction, so compress reports OutOfVocabulary even though the flat variant round-trips the same input and the codebook was sized to include every unique word.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (4)
crates/quasicryth-research/tests/round_trip.rs (1)
45-55: ⚡ Quick winAdd an explicit empty-input round-trip case.
Current cases are good, but a zero-byte payload is a common framing edge and worth pinning with a dedicated test.
Suggested test
+#[test] +fn round_trip_empty_input() { + round_trip(b"", Variant::Flat); + round_trip(b"", Variant::CowRadix); +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/quasicryth-research/tests/round_trip.rs` around lines 45 - 55, Add a dedicated zero-byte payload test that calls the existing test helper round_trip with an empty slice for both variants to pin the framing edge-case; implement a new #[test] fn (e.g., round_trip_empty_input) that invokes round_trip(b"", Variant::Flat) and round_trip(b"", Variant::CowRadix) so both code paths are exercised.crates/quasicryth-research/src/md5.rs (1)
158-170: 💤 Low valueOptional: bulk-copy optimization for
update().The byte-by-byte loop is correct but suboptimal for larger inputs. For a research crate this is acceptable, but if you later need better throughput, consider copying full chunks via
copy_from_slicewhen the buffer is empty and data contains complete blocks.♻️ Sketch of bulk-copy approach
pub fn update(&mut self, mut data: &[u8]) { let mut idx = (self.count & 63) as usize; self.count = self.count.wrapping_add(data.len() as u64); // Fill partial buffer first if idx != 0 { let fill = (64 - idx).min(data.len()); self.buffer[idx..idx + fill].copy_from_slice(&data[..fill]); idx += fill; data = &data[fill..]; if idx == 64 { transform(&mut self.state, &self.buffer); idx = 0; } } // Process full blocks directly while data.len() >= 64 { let block: &[u8; 64] = data[..64].try_into().unwrap(); transform(&mut self.state, block); data = &data[64..]; } // Buffer remainder self.buffer[..data.len()].copy_from_slice(data); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/quasicryth-research/src/md5.rs` around lines 158 - 170, The update() method currently copies input one byte at a time which is correct but slow; refactor update(&mut self, data: &[u8]) to handle bulk copies: compute idx = (self.count & 63) as usize and increment count, first fill a partial buffer if idx != 0 using slice copy_from_slice, call transform(&mut self.state, &self.buffer) if that fills to 64, then process any complete 64-byte blocks directly by taking 64-byte slices (convert to &[u8;64] for transform) in a loop, and finally copy any remaining tail into self.buffer; keep the same semantics for self.count, self.buffer and transform() calls and ensure bounds/slice lengths are handled with try_into()/unwrap or appropriate checks.crates/quasicryth-research/src/pipeline.rs (1)
205-210: 💤 Low valueSlice indexing may panic on malformed compressed input.
If a malformed/corrupted compressed stream contains
offset + lenvalues that exceedlowered_pool.len(), line 207 will panic. For a research crate this is acceptable, but consider adding bounds validation for robustness:if (offset + len) as usize > lowered_pool.len() { return Err(PipelineError::Truncated); }This is a minor hardening suggestion since the crate is documented as research-grade.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/quasicryth-research/src/pipeline.rs` around lines 205 - 210, The loop over spans reads slices from lowered_pool using (offset + len) and can panic if the compressed input is malformed; in the loop that iterates spans (the block referencing lowered_pool, apply_case, and out.extend_from_slice), validate that (offset + len) as usize <= lowered_pool.len() before slicing and return an Err(PipelineError::Truncated) (or appropriate error) when the check fails; this prevents out-of-bounds access while keeping the rest of the logic (apply_case and extending out) unchanged.crates/quasicryth-research/src/bin/qresearch.rs (1)
80-80: 💤 Low valueDivision by zero for empty input files.
If the input file is empty,
data.len()is 0 and the ratio calculation produces infinity. Consider guarding:let ratio = if data.is_empty() { 0.0 } else { 100.0 * compressed.len() as f64 / data.len() as f64 };Same applies to line 145 in
run_round_trip.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/quasicryth-research/src/bin/qresearch.rs` at line 80, The ratio calculation uses data.len() as divisor and will divide by zero for empty inputs; update the computation (the line that sets let ratio = 100.0 * compressed.len() as f64 / data.len() as f64) to guard for empty data (e.g., set ratio = 0.0 when data.is_empty()) and apply the same guarded logic inside the run_round_trip function where a similar ratio is computed; change only the ratio expression to a conditional based on data.is_empty() using the existing variables compressed and data.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Cargo.toml`:
- Around line 50-56: Update the Cargo.toml crate description to accurately
reflect that this research crate includes more than just the algebraic core:
mention the presence of arithmetic coding (arith_coder.rs —
Model256/VModel/Encoder/Decoder), tokenization (tok.rs), codebook construction
(codebook.rs: FlatCodebook/CowRadixCodebook), MD5 hashing (md5.rs), the
compression pipeline (pipeline.rs: compress/decompress), and the qresearch CLI
binary; replace the incorrect "only the algebraic core" phrasing with a concise
note that the crate implements a simplified full pipeline relative to the
upstream reference (single-tier unigram encoding, no multi-level n-grams, no
LZMA escape) rather than claiming it omits these components.
In `@crates/quasicryth-research/Cargo.toml`:
- Around line 9-12: The top-level comment in Cargo.toml incorrectly states the
crate is "Algebraic core only" and omits features actually implemented; update
that comment to reflect the real scope by listing included components such as
arithmetic coding (arith_coder.rs), tokenization (tok.rs), and codebook
construction (codebook.rs) and remove the claim that those live only in the
upstream C reference; keep the note about default zero-deps if still true and
ensure the wording matches the README/PR summary about what this crate provides.
In `@crates/quasicryth-research/src/codebook.rs`:
- Around line 504-516: The Node256 branch currently drops keys >= 256; update
the ART node representation so Node256 no longer assumes keys fit in a 0..255
slot: replace the fixed array children in the Node256 variant with a
HashMap<u32, Arc<ArtNode>> (or another dynamic map) and then update all helpers
that touch it — specifically change put_child, child (lookup), replace_child,
and grow_to_256 to insert/lookup/replace entries in that map and ensure
grow_to_256 moves all existing Node16 children into the new HashMap (preserving
keys >= 256 instead of dropping them); keep Node16, grow_to_256, Node256,
put_child, child, and replace_child identifiers to locate the changes.
In `@crates/quasicryth-research/src/tiling.rs`:
- Around line 366-370: The test sanddrift_generates_nonempty currently only
checks non-empty output; add the missing invariant assertion by calling
verify_no_adjacent_s on the tiles produced by sanddrift_tiling(100) (i.e., after
the existing assert!(!tiles.is_empty()) add verify_no_adjacent_s(&tiles)). This
uses the existing helper verify_no_adjacent_s to ensure no adjacent 'S' tiles
and keeps the test consistent with other generator tests.
- Around line 239-281: sanddrift_tiling currently emits raw symbols with SS
pairs (from L→LSSL) and tiles them directly, violating the module invariant
checked by verify_no_adjacent_s; fix by routing the produced symbol sequence
through the existing symbols_to_tiles merger (or otherwise performing the SS→L
merge) instead of directly constructing Tile entries—specifically, in
sanddrift_tiling replace the direct tiling loop that builds Tile { wpos, nwords,
is_l } from seq with a call to symbols_to_tiles(seq[..need]) (or an equivalent
merge step) so adjacent S symbols are collapsed to L as other generators expect,
or else update the module docstring and add sanddrift_tiling to the explicit
exception list if you intend to keep adjacent S behavior.
In `@crates/quasicryth-research/tests/paper_theorems.rs`:
- Around line 171-176: The current test assertion for Sturmian bound only checks
factors.len() <= n + 1 which can hide regressions; change the check in the test
to assert exact equality (factors.len() == n + 1) for the given long prefix and
small n, updating the assertion message to reflect expected equality and include
n and actual factors.len() for debugging; locate the assertion using the symbols
factors and n in this test and replace the <= check with an equality check (and
adjust the formatted message accordingly).
---
Nitpick comments:
In `@crates/quasicryth-research/src/bin/qresearch.rs`:
- Line 80: The ratio calculation uses data.len() as divisor and will divide by
zero for empty inputs; update the computation (the line that sets let ratio =
100.0 * compressed.len() as f64 / data.len() as f64) to guard for empty data
(e.g., set ratio = 0.0 when data.is_empty()) and apply the same guarded logic
inside the run_round_trip function where a similar ratio is computed; change
only the ratio expression to a conditional based on data.is_empty() using the
existing variables compressed and data.
In `@crates/quasicryth-research/src/md5.rs`:
- Around line 158-170: The update() method currently copies input one byte at a
time which is correct but slow; refactor update(&mut self, data: &[u8]) to
handle bulk copies: compute idx = (self.count & 63) as usize and increment
count, first fill a partial buffer if idx != 0 using slice copy_from_slice, call
transform(&mut self.state, &self.buffer) if that fills to 64, then process any
complete 64-byte blocks directly by taking 64-byte slices (convert to &[u8;64]
for transform) in a loop, and finally copy any remaining tail into self.buffer;
keep the same semantics for self.count, self.buffer and transform() calls and
ensure bounds/slice lengths are handled with try_into()/unwrap or appropriate
checks.
In `@crates/quasicryth-research/src/pipeline.rs`:
- Around line 205-210: The loop over spans reads slices from lowered_pool using
(offset + len) and can panic if the compressed input is malformed; in the loop
that iterates spans (the block referencing lowered_pool, apply_case, and
out.extend_from_slice), validate that (offset + len) as usize <=
lowered_pool.len() before slicing and return an Err(PipelineError::Truncated)
(or appropriate error) when the check fails; this prevents out-of-bounds access
while keeping the rest of the logic (apply_case and extending out) unchanged.
In `@crates/quasicryth-research/tests/round_trip.rs`:
- Around line 45-55: Add a dedicated zero-byte payload test that calls the
existing test helper round_trip with an empty slice for both variants to pin the
framing edge-case; implement a new #[test] fn (e.g., round_trip_empty_input)
that invokes round_trip(b"", Variant::Flat) and round_trip(b"",
Variant::CowRadix) so both code paths are exercised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 7a470a2b-f70f-4275-b261-b65b5a435c8a
📒 Files selected for processing (17)
Cargo.tomlcrates/quasicryth-research/.gitignorecrates/quasicryth-research/Cargo.tomlcrates/quasicryth-research/README.mdcrates/quasicryth-research/src/arith_coder.rscrates/quasicryth-research/src/bin/qresearch.rscrates/quasicryth-research/src/codebook.rscrates/quasicryth-research/src/constants.rscrates/quasicryth-research/src/hierarchy.rscrates/quasicryth-research/src/lib.rscrates/quasicryth-research/src/md5.rscrates/quasicryth-research/src/pipeline.rscrates/quasicryth-research/src/tiling.rscrates/quasicryth-research/src/tok.rscrates/quasicryth-research/src/types.rscrates/quasicryth-research/tests/paper_theorems.rscrates/quasicryth-research/tests/round_trip.rs
| #[test] | ||
| fn sanddrift_generates_nonempty() { | ||
| let tiles = sanddrift_tiling(100); | ||
| assert!(!tiles.is_empty()); | ||
| } |
There was a problem hiding this comment.
Test does not verify the no-adjacent-S invariant.
Unlike tests for other generators (thue_morse_alternates_at_low_indices, rudin_shapiro_generates_nonempty, period_doubling_generates_nonempty), this test omits the verify_no_adjacent_s assertion. Once the bug in sanddrift_tiling is fixed, add the invariant check here.
💚 Proposed fix
#[test]
fn sanddrift_generates_nonempty() {
let tiles = sanddrift_tiling(100);
assert!(!tiles.is_empty());
+ assert!(verify_no_adjacent_s(&tiles));
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[test] | |
| fn sanddrift_generates_nonempty() { | |
| let tiles = sanddrift_tiling(100); | |
| assert!(!tiles.is_empty()); | |
| } | |
| #[test] | |
| fn sanddrift_generates_nonempty() { | |
| let tiles = sanddrift_tiling(100); | |
| assert!(!tiles.is_empty()); | |
| assert!(verify_no_adjacent_s(&tiles)); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/quasicryth-research/src/tiling.rs` around lines 366 - 370, The test
sanddrift_generates_nonempty currently only checks non-empty output; add the
missing invariant assertion by calling verify_no_adjacent_s on the tiles
produced by sanddrift_tiling(100) (i.e., after the existing
assert!(!tiles.is_empty()) add verify_no_adjacent_s(&tiles)). This uses the
existing helper verify_no_adjacent_s to ensure no adjacent 'S' tiles and keeps
the test consistent with other generator tests.
Addresses PR #461 review feedback. LOAD-BEARING BUG (codex P2 / coderabbit Critical): CowArt silently dropped keys ≥ 256 ================================================== The original three-variant ART (Node4 / Node16 / Node256) was byte-keyed at the leaf level — Node256 only handled values 0..255. With u32 word-IDs, any corpus of 257+ unique words would silently lose entries from the unigram trie. Result: - Variant::Flat round-tripped correctly (HashMap-based) - Variant::CowRadix produced OutOfVocabulary on word_id ≥ 256 even though the codebook was sized to include every unique word Tests masked the bug because they used 5-word vocabularies. Fix: replace the three-variant ArtNode enum with a single sparse-children node: struct ArtNode { children: BTreeMap<u32, Arc<ArtNode>>, leaf: Option<u32>, } - Loses the ART byte-keyed Node4/Node16/Node256 branch-free optimization. The optimization assumed byte keys; u32 keys don't fit it without per-byte decomposition (which would be a much bigger refactor). - Gains correctness for arbitrary u32 keys including word IDs ≥ 256 (which is most real text). - Preserves the COW property — every insert returns a new root via path-copy, prior roots stay valid. This is the architectural point of the variant, and it's what the workspace's append-only doctrine needs. - BTreeMap (not HashMap) for deterministic iteration order, useful for any future serialization or cross-impl comparison. Two regression tests added so this bug can't recur silently: - cow_art_handles_arbitrary_u32_keys Inserts 302 keys spanning 0..300 + 1_000_000 + u32::MAX; verifies every one round-trips. The original implementation would have dropped 1_000_000 and u32::MAX silently. - cow_radix_codebook_handles_large_vocabulary Builds a 300-unique-word codebook via CowRadixCodebook; asserts every word ID (including 256..299) is findable via unigram_index(). This is the exact codex P2 scenario. Total tests: 84 (was 83). +2 from the regression tests, +1 from a renamed-and-tightened existing test. SECONDARY FINDINGS ================== coderabbit Critical — sanddrift_tiling docstring: The module docstring claimed all generators satisfy the no-adjacent-S invariant, but sanddrift's substitution L→LSSL produces SS pairs by design (LL forbidden, not SS). The upstream gen_sanddrift_tiles in fib.c also bypasses the SS→L merge for the same reason — preserving the substitution structure. Fix: update module docstring to name sanddrift as the documented exception; rename + strengthen the sanddrift test to assert the ACTUAL invariant (LL forbidden), not the wrong one (no-adjacent-S). Behaviour unchanged — matches the C reference. coderabbit Minor — Cargo.toml comments misrepresent crate scope: Both workspace Cargo.toml and crate Cargo.toml had stale "algebraic core only" comments from phase 0. Updated to reflect the full pipeline shipped in phases 1-6 (arithmetic coder, tokenization, codebook variants, compress/decompress). coderabbit Minor — Sturmian assertion too loose: tests/paper_theorems.rs::sturmian_factor_complexity_is_n_plus_1 asserted `factors.len() <= n + 1`, which would pass for degenerate (sub-Sturmian, periodic) streams. Sturmian minimality (Paper §4.10, Thm 7 corollary) requires EXACTLY n+1 distinct length-n factors. Strengthened to assert_eq! with a clearer error message. This catches drift toward either degenerate or super-Sturmian streams. Verification: cargo test --manifest-path crates/quasicryth-research/Cargo.toml → 68 unit + 9 paper-theorem + 7 cross-variant = 84 passed cargo clippy --all-targets -- -D warnings clean cargo fmt clean Zero deps preserved. No unsafe.
Summary
Direct Rust transcode of Quasicryth (Tacconelli 2026, arxiv 2603.14999, upstream github.com/robtacconelli/quasicryth v5.6.0) in two architectural variants behind one trait: the original flat-storage codebook from the C reference, and a Copy-on-Write Adaptive Radix Tree variant that fits this workspace's append-only substrate doctrine.
New excluded crate
crates/quasicryth-research/— standalone, zero-dep, follows the helix / bgz17 / deepnsm convention.6 phases, 6 commits
f0dfe88fib.c)68f754e9e229d5afd7969de566f67fed9b9Total: ~4,160 LOC Rust, 83 tests passing,
cargo clippy -- -D warningsclean (pedantic + all),cargo fmtclean. Zero dependencies. Nounsafe. Stable Rust.Two variants behind one trait
FlatCodebookCowRadixCodebookVec<u32>+HashMapSend + SyncSend + Sync(Arc-shared subtrees)The COW property is explicitly tested in
codebook::tests::cow_art_path_copy_preserves_old_root—art_v0stays empty afterart_v1.insert(...)andart_v2.insert(...). Tests also verify that the two variants agree on lookups (cow_radix_codebook_agrees_with_flat_on_lookups) and on end-to-end decompressed output (variants_produce_same_decompressed_output,cross_variant_independence).What round-trips end-to-end
pipeline::compress(text, Variant::Flat | Variant::CowRadix) → bytesandpipeline::decompress(bytes) → textround-trip on every test input, including:Hello WORLD foo)café naïve façade)Both variants produce identical decompressed output (compressed bytes may differ).
Paper-theorem verification (algebraic substrate)
tests/paper_theorems.rsverifies, on synthetic L/S sequences:HIER_WORD_LENS = F_3..F_12, no-adjacent-S on all 36 canonical tilingsThis is the mathematical underpinning the workspace's φ-substrate decisions (
bgz17's17φ/11,helix's golden-spiral hemisphere,jc::weyl's 1-D star-discrepancy) inherit. The transcode lets the workspace cross-check those decisions against the reference algebra without depending on the upstream C build.CLI binary
Tested live in the commit message of
7fed9b9.Deliberate simplifications (NOT a production compressor)
Documented in module-level docs + README. The Rust pipeline is research-grade, NOT byte-compatible with the upstream
.qm56format:tests/paper_theorems.rs, but the bit-stream itself only encodes word-ID symbols at the unigram tier. Multi-tier n-gram encoding is a phase 5+ extension.pipeline::build_codebookcaps the unigram tier atn_unique)..qm56exactly would require porting hundreds of model-initialization details and is out of scope for "research and testing."Implication: small inputs (sub-KB) currently produce >100% "compressed" output because headers + per-token spans dominate. This is expected and called out in the README. The architectural property (codebook + AC working end-to-end across both variants) is what's being demonstrated.
Crate policy
std)excluded from the lance-graph workspace per the helix / bgz17 / deepnsm conventioncargo test --manifest-path crates/quasicryth-research/Cargo.toml— 83 passingcargo clippy --all-targets -- -D warnings— clean (pedantic + all)cargo fmt— cleanCargo.lockgitignored per helix conventionWhat this PR is, in one sentence
A research transcode that demonstrates the architectural variation point (FlatCodebook vs CowRadixCodebook) the workspace cares about, working end-to-end through tokenize → codebook → arithmetic-code → bytes → decode, verified by 83 tests against both variants and against the paper's five core theorems.
🤖 Generated with Claude Code
Generated by Claude Code
Summary by CodeRabbit
Release Notes
New Features
quasicryth-researchcrate providing Quasicryth v5.6.0 compression and decompression capabilities.qresearchcommand-line tool supporting compress, decompress, and round-trip verification operations.FlatandCowRadixfor different use cases.Documentation