Skip to content

Commit bd628e3

Browse files
committed
fix(quasicryth-research): codex P2 + coderabbit review — bug + 4 nits
Addresses PR #461 review feedback. LOAD-BEARING BUG (codex P2 / coderabbit Critical): CowArt silently dropped keys ≥ 256 ================================================== The original three-variant ART (Node4 / Node16 / Node256) was byte-keyed at the leaf level — Node256 only handled values 0..255. With u32 word-IDs, any corpus of 257+ unique words would silently lose entries from the unigram trie. Result: - Variant::Flat round-tripped correctly (HashMap-based) - Variant::CowRadix produced OutOfVocabulary on word_id ≥ 256 even though the codebook was sized to include every unique word Tests masked the bug because they used 5-word vocabularies. Fix: replace the three-variant ArtNode enum with a single sparse-children node: struct ArtNode { children: BTreeMap<u32, Arc<ArtNode>>, leaf: Option<u32>, } - Loses the ART byte-keyed Node4/Node16/Node256 branch-free optimization. The optimization assumed byte keys; u32 keys don't fit it without per-byte decomposition (which would be a much bigger refactor). - Gains correctness for arbitrary u32 keys including word IDs ≥ 256 (which is most real text). - Preserves the COW property — every insert returns a new root via path-copy, prior roots stay valid. This is the architectural point of the variant, and it's what the workspace's append-only doctrine needs. - BTreeMap (not HashMap) for deterministic iteration order, useful for any future serialization or cross-impl comparison. Two regression tests added so this bug can't recur silently: - cow_art_handles_arbitrary_u32_keys Inserts 302 keys spanning 0..300 + 1_000_000 + u32::MAX; verifies every one round-trips. The original implementation would have dropped 1_000_000 and u32::MAX silently. - cow_radix_codebook_handles_large_vocabulary Builds a 300-unique-word codebook via CowRadixCodebook; asserts every word ID (including 256..299) is findable via unigram_index(). This is the exact codex P2 scenario. Total tests: 84 (was 83). +2 from the regression tests, +1 from a renamed-and-tightened existing test. SECONDARY FINDINGS ================== coderabbit Critical — sanddrift_tiling docstring: The module docstring claimed all generators satisfy the no-adjacent-S invariant, but sanddrift's substitution L→LSSL produces SS pairs by design (LL forbidden, not SS). The upstream gen_sanddrift_tiles in fib.c also bypasses the SS→L merge for the same reason — preserving the substitution structure. Fix: update module docstring to name sanddrift as the documented exception; rename + strengthen the sanddrift test to assert the ACTUAL invariant (LL forbidden), not the wrong one (no-adjacent-S). Behaviour unchanged — matches the C reference. coderabbit Minor — Cargo.toml comments misrepresent crate scope: Both workspace Cargo.toml and crate Cargo.toml had stale "algebraic core only" comments from phase 0. Updated to reflect the full pipeline shipped in phases 1-6 (arithmetic coder, tokenization, codebook variants, compress/decompress). coderabbit Minor — Sturmian assertion too loose: tests/paper_theorems.rs::sturmian_factor_complexity_is_n_plus_1 asserted `factors.len() <= n + 1`, which would pass for degenerate (sub-Sturmian, periodic) streams. Sturmian minimality (Paper §4.10, Thm 7 corollary) requires EXACTLY n+1 distinct length-n factors. Strengthened to assert_eq! with a clearer error message. This catches drift toward either degenerate or super-Sturmian streams. Verification: cargo test --manifest-path crates/quasicryth-research/Cargo.toml → 68 unit + 9 paper-theorem + 7 cross-variant = 84 passed cargo clippy --all-targets -- -D warnings clean cargo fmt clean Zero deps preserved. No unsafe.
1 parent 7fed9b9 commit bd628e3

5 files changed

Lines changed: 134 additions & 262 deletions

File tree

Cargo.toml

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -47,12 +47,16 @@ exclude = [
4747
# Kept out of the workspace so the nondeterministic autoencoder never
4848
# enters the deterministic lance-graph compile path (determinism boundary).
4949
"crates/lance-graph-arm-discovery",
50-
# Quasicryth algebraic-core transcode (Tacconelli 2026, arxiv 2603.14999) —
51-
# standalone zero-dep research/testing crate. Verifies the workspace's
52-
# φ-substrate decisions (bgz17 17φ/11, helix golden-spiral, jc::weyl) against
53-
# the reference algebra. NOT a production compressor: covers only the
54-
# algebraic core (tilings + hierarchy + deep-position detection), not the
55-
# arithmetic-coding / LZMA-escape / tokenizer pipeline.
50+
# Quasicryth research transcode (Tacconelli 2026, arxiv 2603.14999) —
51+
# standalone zero-dep research crate. Verifies the workspace's φ-substrate
52+
# decisions (bgz17 17φ/11, helix golden-spiral, jc::weyl) against the
53+
# reference algebra. Covers tilings + hierarchy + deep-position detection
54+
# PLUS arithmetic coding + tokenization + codebook construction
55+
# (FlatCodebook + CowRadixCodebook variants) + an end-to-end
56+
# compress/decompress pipeline that round-trips under both variants.
57+
# NOT byte-compatible with the upstream .qm56 output — simplifies the
58+
# v5.6 multi-tier n-gram + LZMA-escape + word-LZ77 + per-context-model
59+
# machinery to a single-tier unigram pipeline (see crate README).
5660
# Verified via `cargo test --manifest-path crates/quasicryth-research/Cargo.toml`.
5761
"crates/quasicryth-research",
5862
]

crates/quasicryth-research/Cargo.toml

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,24 @@ version = "0.1.0"
44
edition = "2021"
55
license = "Apache-2.0"
66
publish = false
7-
description = "Direct Rust transcode of the algebraic core of Quasicryth (Tacconelli 2026, arxiv 2603.14999) — Fibonacci quasicrystal tilings, substitution hierarchy, deep n-gram position detection. Research/testing crate, NOT a production compressor."
7+
description = "Direct Rust transcode of Quasicryth (Tacconelli 2026, arxiv 2603.14999) — Fibonacci quasicrystal tilings + substitution hierarchy + deep n-gram position detection + arithmetic coding + tokenization + codebook construction (FlatCodebook + CowRadixCodebook variants) + end-to-end compress/decompress pipeline. Research/testing crate, NOT byte-compatible with the upstream .qm56 production format."
88

99
# Standalone codec constitution (matches helix / bgz17 / deepnsm): the default
10-
# build is ZERO dependencies. Algebraic core only — no arithmetic coding, no
11-
# LZMA escape, no tokenization, no codebook construction (those live in the
12-
# upstream C reference implementation and are out of scope for this transcode).
10+
# build is ZERO dependencies. Covers the algebraic core (fib.c) PLUS the
11+
# compression layers (ac.c arithmetic coder, tok.c tokenization, cb.c codebook
12+
# construction, md5.c) PLUS an end-to-end pipeline that round-trips under both
13+
# the original FlatCodebook AND the COW radix trie variant.
1314
#
14-
# Upstream: https://github.com/robtacconelli/quasicryth (MIT/Apache, v5.6.0).
15-
# This transcode covers fib.h / fib.c (598 LOC) + the algebraic types from
16-
# qtc.h. Paper proves five theorems (non-collapse, PV-property, Sturmian
17-
# minimality, Golden Compensation, bounded overhead) — the included tests
18-
# verify all five on synthetic data.
15+
# Pipeline simplifications vs. the upstream v5.6 compressor are deliberate
16+
# and documented in the crate README: single-tier unigram encoding (no
17+
# multi-level n-grams), no LZMA escape stream (OOV → error), no word-level
18+
# LZ77, no per-level context models. The Rust pipeline round-trips with
19+
# itself; it is NOT byte-identical to the upstream .qm56 format.
20+
#
21+
# Upstream: https://github.com/robtacconelli/quasicryth (v5.6.0).
22+
# Paper proves five theorems (non-collapse, PV-property, Sturmian minimality,
23+
# Golden Compensation, bounded overhead) — all five verified on synthetic
24+
# data in tests/paper_theorems.rs.
1925
[dependencies]
2026

2127
[dev-dependencies]

0 commit comments

Comments
 (0)