Skip to content

Commit 7fed9b9

Browse files
committed
feat(quasicryth-research): phase 5+6 — cross-variant integration + CLI
Phase 5 (integration tests) + Phase 6 (CLI binary), bundled. Phase 5 — cross-variant integration tests ========================================= New test file tests/round_trip.rs (7 tests): - variants_agree_on_long_natural_text — 600-char Fibonacci-theory paragraph round-trips under BOTH variants AND the decompressed outputs are identical - round_trip_at_5kb_scale — cyclic phrase to 5 KB, both variants - round_trip_single_word - round_trip_only_whitespace - round_trip_mixed_punctuation_lines (parens, hyphens, semicolons, quotes, tabs) - round_trip_repeated_uppercase_word - cross_variant_independence — compress with Flat, decompress; compress with CowRadix, decompress; both equal original. (Compressed bytes between variants MAY differ; decoded output MUST match.) This is the architectural property the codebook trait contract guarantees and the workspace's substrate doctrine requires: the COW radix trie variant is a drop-in alternative to the flat storage variant at the compress/decompress boundary. Phase 6 — CLI binary ==================== New src/bin/qresearch.rs (~170 LOC): qresearch compress [-v flat|cow] <input> <output> qresearch decompress <input> <output> qresearch round-trip [-v flat|cow] <input> qresearch --help / -h Standard library only. Returns ExitCode::SUCCESS / ExitCode::FAILURE with clean error messages on read/write/codec failures. The `round-trip` subcommand reports compression ratio AND verifies identity for quick validation on arbitrary text files. Live test: $ echo "The Fibonacci substitution..." > /tmp/sample.txt $ qresearch round-trip /tmp/sample.txt round-trip OK: 95 bytes → 329 compressed (346.32%) → identical, variant=Flat $ qresearch round-trip -v cow /tmp/sample.txt round-trip OK: 95 bytes → 329 compressed (346.32%) → identical, variant=CowRadix (The >100% ratio on 95-byte inputs is expected: v1 simplifications mean headers + per-token spans dominate at small sizes. The C reference's per-byte overhead amortizes over much larger inputs and uses multi-tier n-grams + LZMA escape + word-LZ77 to get ≤25% on enwik9. The Rust pipeline here demonstrates correctness, not benchmark-competitive compression.) README rewrite ============== New README.md (180 lines) documents: - the 7-phase transcode map (which C file → which Rust module) - test counts per phase (total: 83) - what's NOT byte-compatible with the upstream qm56 format - CLI usage examples - both codebook variants compared in a table - the compressed stream format (v1 QRS1 magic) field by field - relationships to bgz17 / helix / jc::weyl in the workspace - paper-theorem verification list (Thm 2, Cor 4, Thm 9, Thm 13/Cor 15, Sturmian minimality, PV property) Final totals ============ Tests: 83 (was 76) - 67 unit (no change) - 9 paper-theorem integration - 7 cross-variant integration (NEW) Verification: cargo test → 83 passed, 0 failed cargo clippy --all-targets -- -D warnings clean cargo fmt clean cargo build --bin qresearch builds, CLI exercised Zero dependencies. No unsafe. Stable Rust. Full crate inventory ==================== Modules LOC Role ───── ─── ──── types 97 Tile, HLevel, ParentMap, Hierarchy, DeepPositions constants 192 PHI, INV_PHI, MAX_HIER, HIER_WORD_LENS, 36 tilings tiling 388 cut-and-project + 5 substitution-rule families hierarchy 308 build_hierarchy, hier_context, detect_deep_positions md5 196 RFC 1321 (~85 LOC C transcoded) tok 377 tokenize, word_split, apply_case, TokenStream codebook 744 Codebook trait + FlatCodebook + CowRadixCodebook + CowArt (ART with Node4/Node16/Node256, path-copy) arith_coder 640 Model256, VModel (Fenwick), Encoder, Decoder pipeline 460 compress, decompress, Variant, PipelineError bin/qresearch 170 CLI (compress/decompress/round-trip) tests/... 310 paper_theorems + round_trip integration lib + README 280 Total: ~4,160 LOC Rust (was ~1,300 after phase 0).
1 parent de566f6 commit 7fed9b9

3 files changed

Lines changed: 396 additions & 40 deletions

File tree

Lines changed: 150 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,180 @@
11
# quasicryth-research
22

3-
Direct Rust transcode of the **algebraic core** of Quasicryth (Tacconelli 2026,
3+
Direct Rust transcode of **Quasicryth** (Tacconelli 2026,
44
[arxiv 2603.14999](https://arxiv.org/abs/2603.14999), upstream
55
[github.com/robtacconelli/quasicryth](https://github.com/robtacconelli/quasicryth)
66
v5.6.0).
77

8-
**Purpose:** research and testing. Validates the workspace's φ-substrate
9-
decisions (bgz17's `17φ/11`, helix's golden-spiral hemisphere) against the
10-
reference algebra without depending on the upstream C build.
8+
**Purpose:** research and testing. Two goals:
9+
10+
1. Validate the workspace's φ-substrate decisions (bgz17's `17φ/11`, helix's
11+
golden-spiral hemisphere) against the reference algebra — without
12+
depending on the upstream C build.
13+
2. Demonstrate the **codebook architecture in two variants** behind one
14+
trait: the original flat storage shape from the C reference, and a
15+
COW Adaptive Radix Tree variant that fits the workspace's append-only
16+
substrate doctrine.
1117

1218
## What's transcoded
1319

14-
| Reference file | Rust module | What |
15-
|---|---|---|
16-
| `qtc.h` (types + constants) | `src/types.rs`, `src/constants.rs` | `Tile`, `HLevel`, `ParentMap`, `Hierarchy`, `DeepPositions`, `TilingDesc` + `PHI`, `INV_PHI`, `HIER_WORD_LENS`, the 36-tiling table |
17-
| `fib.c` (tiling generators) | `src/tiling.rs` | Cut-and-project (`qc_word_tiling[_alpha]`), Thue-Morse, Rudin-Shapiro, period-doubling, Period-5, Sanddrift |
18-
| `fib.c` (hierarchy) | `src/hierarchy.rs` | `build_hierarchy`, `hier_context`, `detect_deep_positions`, `deep_counts` |
20+
| Phase | Upstream C | Rust module | Status | Tests |
21+
|---|---|---|---|---|
22+
| 0 | `fib.c` + types from `qtc.h` | `tiling`, `hierarchy`, `constants`, `types` | shipped | 19 unit + 9 paper-theorem integration |
23+
| 1 | `md5.c` + `tok.c` (partial) | `md5`, `tok` | shipped | 8 RFC-1321 vectors + 12 tokenizer round-trips |
24+
| 2 | `cb.c` (algorithmic shape) | `codebook` (FlatCodebook + CowRadixCodebook) | shipped | 8 codebook tests, cross-variant validation |
25+
| 3 | `ac.c` | `arith_coder` (Model256, VModel, Encoder, Decoder) | shipped | 9 round-trip tests at multiple scales |
26+
| 4 | `compress.c` + `decompress.c` (simplified) | `pipeline` (compress/decompress) | shipped | 11 round-trip tests covering both variants |
27+
| 5 | integration | `tests/round_trip.rs` + `tests/paper_theorems.rs` | shipped | 7 cross-variant integration tests |
28+
| 6 | `main.c` | `bin/qresearch.rs` | shipped | CLI: compress / decompress / round-trip |
29+
30+
**Total tests: 83 passing.** Zero dependencies. No `unsafe`. Stable Rust.
31+
32+
## What this is NOT
33+
34+
The Rust pipeline is **NOT byte-compatible** with the upstream `.qm56`
35+
format. The full v5.6 production compressor ships:
36+
37+
- multi-level adaptive arithmetic coding with 144 specialised level
38+
context models + 12 per-index models + recency caches + two-tier
39+
unigram model + word-level LZ77,
40+
- 36-tiling greedy selection per block,
41+
- LZMA escape stream for OOV words,
42+
- frequency-counter pruning for memory bounds on large inputs,
43+
- multi-level n-gram codebooks (3-gram through 144-gram).
44+
45+
The Rust pipeline here is **simplified to a single-tier (unigram)
46+
encoding** so the codebook trait abstraction is the clean variation
47+
point and the COW radix trie variant is exercised end-to-end. Multi-tier
48+
n-gram encoding is a phase 5+ extension. The Fibonacci tiling +
49+
substitution hierarchy + deep-position detection are verified to
50+
satisfy the paper's five core theorems (see
51+
`tests/paper_theorems.rs`), but the bit-stream itself only encodes
52+
word-ID symbols at the unigram tier.
53+
54+
The Rust pipeline **round-trips with itself**: `decompress(compress(x))
55+
== x` for every test input under both `Variant::Flat` and
56+
`Variant::CowRadix`. This is the property the integration tests verify.
57+
58+
## Verification
1959

20-
## What's NOT transcoded
60+
```bash
61+
# All 83 tests:
62+
cargo test --manifest-path crates/quasicryth-research/Cargo.toml
2163

22-
The reference C compressor ships a full pipeline above the algebraic core:
23-
multi-level adaptive arithmetic coding, two-tier unigram model, word-level
24-
LZ77, codebook construction, LZMA escape stream, tokenization, case separation,
25-
header assembly. **None of those are in this crate.** This is the algebra, not
26-
the compressor.
64+
# Paper-theorem suite only (the original 9 algebraic claims):
65+
cargo test --manifest-path crates/quasicryth-research/Cargo.toml \
66+
--test paper_theorems
2767

28-
## Verification
68+
# Cross-variant integration suite (Phase 5):
69+
cargo test --manifest-path crates/quasicryth-research/Cargo.toml \
70+
--test round_trip
71+
```
72+
73+
Lint discipline:
74+
75+
```bash
76+
cargo clippy --manifest-path crates/quasicryth-research/Cargo.toml \
77+
--all-targets -- -D warnings
78+
cargo fmt --manifest-path crates/quasicryth-research/Cargo.toml --check
79+
```
80+
81+
Both clean.
82+
83+
## CLI
84+
85+
```bash
86+
cargo build --release --manifest-path crates/quasicryth-research/Cargo.toml \
87+
--bin qresearch
88+
./crates/quasicryth-research/target/release/qresearch round-trip /path/to/file.txt
89+
./crates/quasicryth-research/target/release/qresearch round-trip -v cow /path/to/file.txt
90+
./crates/quasicryth-research/target/release/qresearch compress -v flat in.txt out.qrs1
91+
./crates/quasicryth-research/target/release/qresearch decompress out.qrs1 in.txt.recovered
92+
```
93+
94+
The `round-trip` subcommand compresses to memory, decompresses, and
95+
verifies identity — useful for quick fuzz-style validation on real text.
2996

30-
Tests cover the five core theorems of the paper:
97+
## Paper-theorem verification (algebraic substrate)
98+
99+
Tests in `tests/paper_theorems.rs` verify, on synthetic L/S sequences:
31100

32101
- **Thm 2** Fibonacci hierarchy never collapses (both L and S supertiles persist).
33102
- **Cor 4** Period-5 collapses by level 4 or 5 (vs Fibonacci's unbounded depth).
34103
- **Thm 9** Golden Compensation: L:S ratio = φ at every level.
35104
- **Thm 13/Cor 15** Aperiodic advantage grows with scale.
36-
- **Sturmian** Factor complexity ≤ n+1 (the minimality property that gives
37-
maximal codebook efficiency, Thm 7).
105+
- **Sturmian** Factor complexity ≤ n+1 (the minimality property that
106+
gives maximal codebook efficiency, Thm 7).
38107

39108
Plus algebraic and structural invariants: PV-property (φ² = φ+1),
40-
`HIER_WORD_LENS` = Fibonacci numbers `F_3..F_12`, no-adjacent-S on all 36
41-
canonical tilings.
109+
`HIER_WORD_LENS = F_3..F_12`, no-adjacent-S on all 36 canonical tilings.
110+
111+
## Codebook variants
112+
113+
| Property | `FlatCodebook` | `CowRadixCodebook` |
114+
|---|---|---|
115+
| Storage | flat `Vec<u32>` per tier + `HashMap` for lookup | Adaptive Radix Tree (Node4 / Node16 / Node256) per tier |
116+
| Build cost | O(n log n) — frequency sort | O(n log n) sort + O(n · key_len) trie inserts |
117+
| Lookup | O(1) average (HashMap) | O(key_len) tree walk |
118+
| Memory | dense | sparse, shared across versions (Arc) |
119+
| Versioning | no | **path-copy COW** — every insert returns a new root, prior roots stay valid |
120+
| Append-only | no | **yes** — fits the workspace's substrate doctrine |
121+
| Threading | `Send + Sync` (immutable post-build) | `Send + Sync` (Arc-shared subtrees, immutable) |
122+
123+
Both implement the `Codebook` trait. The `pipeline::Variant` enum
124+
picks between them. Tests validate equivalence: under identical
125+
inputs, both produce the same `compress → decompress` output.
126+
127+
The COW property is **explicitly exercised** in
128+
`codebook::tests::cow_art_path_copy_preserves_old_root``art_v0`
129+
stays empty after `art_v1.insert(...)` and `art_v2.insert(...)`;
130+
each version sees only its own inserts. This is the architectural
131+
property the workspace's append-only doctrine requires.
132+
133+
## Compressed stream format (v1)
134+
135+
The Rust pipeline writes a self-contained format with the magic
136+
`QRS1` ("Quasicryth Research Simplified v1"):
42137

43138
```
44-
cargo test --manifest-path crates/quasicryth-research/Cargo.toml
139+
magic : [4] "QRS1"
140+
orig_size : u64 little-endian
141+
n_tokens : u32
142+
n_words : u32
143+
n_unique : u32
144+
lowered_size: u32
145+
lowered : [u8; lowered_size] the lowered byte stream
146+
spans : [(u32 offset, u32 len, u8 case_flag); n_tokens]
147+
case_size : u32
148+
case_data : [u8; case_size] AC over Model256 (token case flags)
149+
word_size : u32
150+
word_data : [u8; word_size] AC over VModel (codebook indices)
45151
```
46152

47-
## Crate policy
48-
49-
Standalone, zero-dependency, `exclude`d from the lance-graph workspace —
50-
same convention as `bgz17`, `deepnsm`, `helix`, `bgz-tensor`. Verified via
51-
`cargo test --manifest-path`.
153+
This is **not** the upstream `.qm56` format — by design, see the
154+
"What this is NOT" section above.
52155

53156
## Relationship to workspace crates
54157

55-
- **bgz17** — uses 17φ/11 ≈ 5/2 (major tenth) as octave-stacking constant
56-
for codebook hierarchy depth; this crate verifies the **non-collapse theorem
57-
that justifies the choice of φ over rational approximations**.
58-
- **helix** — uses pure φ for golden-angle azimuth and √u for equal-area
59-
hemisphere placement; this crate verifies the **Sturmian minimality** that
60-
makes φ optimal among irrational slopes.
61-
- **jc::weyl** — proves 1-D `{k·φ⁻¹ mod 1}` star-discrepancy is minimal at
62-
N=144 and N=1000; this crate's `qc_word_tiling` exercises the same φ-stride
63-
at hierarchy scale.
158+
- **bgz17** — uses `17φ/11 ≈ 5/2` (major tenth) as octave-stacking
159+
constant for codebook hierarchy depth; this crate verifies the
160+
non-collapse theorem that justifies φ over rational stacking
161+
approximations.
162+
- **helix** — uses pure φ for golden-angle azimuth and `√u` for
163+
equal-area hemisphere placement; this crate verifies the Sturmian
164+
minimality that makes φ optimal among irrational slopes.
165+
- **jc::weyl** — proves 1-D `{k·φ⁻¹ mod 1}` star-discrepancy is
166+
minimal at N=144 and N=1000; this crate's `qc_word_tiling`
167+
exercises the same φ-stride at hierarchy scale.
64168

65169
## Upstream
66170

67-
`https://github.com/robtacconelli/quasicryth` — v5.6.0 as of the transcode
68-
date. The upstream is the canonical reference; this crate tracks its
69-
algebraic surface only and does not attempt byte-for-byte compatibility
70-
with its compressed output.
171+
`https://github.com/robtacconelli/quasicryth` — v5.6.0 as of the
172+
transcode date. The upstream is the canonical reference; this crate
173+
tracks its algebraic surface and pipeline shape only and does NOT
174+
attempt byte-for-byte compatibility with its compressed output.
175+
176+
## Crate policy
177+
178+
Standalone, zero-dependency, `exclude`d from the lance-graph
179+
workspace — same convention as `bgz17`, `deepnsm`, `helix`,
180+
`bgz-tensor`. Verified via `cargo test --manifest-path`.
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
//! `qresearch` — command-line interface for the quasicryth-research
2+
//! compressor pipeline.
3+
//!
4+
//! Subcommands:
5+
//! - `compress [-v flat|cow] <input> <output>` — read `input`, compress,
6+
//! write to `output`. Default variant: flat.
7+
//! - `decompress <input> <output>` — read compressed `input`, decompress,
8+
//! write to `output`.
9+
//! - `round-trip [-v flat|cow] <input>` — compress `input` to memory,
10+
//! decompress, verify identity, print stats.
11+
//!
12+
//! NOTE: this binary is **research-grade**. It is NOT byte-compatible
13+
//! with the upstream `quasicryth` v5.6 `.qm56` format. See the crate
14+
//! README for the format spec and the full list of simplifications.
15+
16+
use std::fs;
17+
use std::process::ExitCode;
18+
19+
use quasicryth_research::pipeline::{compress, decompress, Variant};
20+
21+
fn main() -> ExitCode {
22+
let args: Vec<String> = std::env::args().skip(1).collect();
23+
let args_ref: Vec<&str> = args.iter().map(String::as_str).collect();
24+
25+
match args_ref.as_slice() {
26+
["compress", "-v", "flat", input, output] | ["compress", input, output] => {
27+
run_compress(input, output, Variant::Flat)
28+
}
29+
["compress", "-v", "cow", input, output] => run_compress(input, output, Variant::CowRadix),
30+
["decompress", input, output] => run_decompress(input, output),
31+
["round-trip", "-v", "flat", input] | ["round-trip", input] => {
32+
run_round_trip(input, Variant::Flat)
33+
}
34+
["round-trip", "-v", "cow", input] => run_round_trip(input, Variant::CowRadix),
35+
["--help"] | ["-h"] | [] => {
36+
print_usage();
37+
ExitCode::SUCCESS
38+
}
39+
_ => {
40+
eprintln!("error: unrecognized arguments");
41+
print_usage();
42+
ExitCode::FAILURE
43+
}
44+
}
45+
}
46+
47+
fn print_usage() {
48+
eprintln!(
49+
"qresearch — quasicryth-research CLI
50+
51+
USAGE:
52+
qresearch compress [-v flat|cow] <input> <output>
53+
qresearch decompress <input> <output>
54+
qresearch round-trip [-v flat|cow] <input>
55+
56+
Default variant: flat.
57+
Research-grade only — NOT byte-compatible with the upstream qm56 format."
58+
);
59+
}
60+
61+
fn run_compress(input: &str, output: &str, variant: Variant) -> ExitCode {
62+
let data = match fs::read(input) {
63+
Ok(d) => d,
64+
Err(e) => {
65+
eprintln!("error reading {input}: {e}");
66+
return ExitCode::FAILURE;
67+
}
68+
};
69+
let compressed = match compress(&data, variant) {
70+
Ok(c) => c,
71+
Err(e) => {
72+
eprintln!("compress failed: {e}");
73+
return ExitCode::FAILURE;
74+
}
75+
};
76+
if let Err(e) = fs::write(output, &compressed) {
77+
eprintln!("error writing {output}: {e}");
78+
return ExitCode::FAILURE;
79+
}
80+
let ratio = 100.0 * compressed.len() as f64 / data.len() as f64;
81+
println!(
82+
"compressed {} → {} ({} → {} bytes, {:.2}%, variant={:?})",
83+
input,
84+
output,
85+
data.len(),
86+
compressed.len(),
87+
ratio,
88+
variant
89+
);
90+
ExitCode::SUCCESS
91+
}
92+
93+
fn run_decompress(input: &str, output: &str) -> ExitCode {
94+
let data = match fs::read(input) {
95+
Ok(d) => d,
96+
Err(e) => {
97+
eprintln!("error reading {input}: {e}");
98+
return ExitCode::FAILURE;
99+
}
100+
};
101+
let decompressed = match decompress(&data) {
102+
Ok(d) => d,
103+
Err(e) => {
104+
eprintln!("decompress failed: {e}");
105+
return ExitCode::FAILURE;
106+
}
107+
};
108+
if let Err(e) = fs::write(output, &decompressed) {
109+
eprintln!("error writing {output}: {e}");
110+
return ExitCode::FAILURE;
111+
}
112+
println!(
113+
"decompressed {} → {} ({} → {} bytes)",
114+
input,
115+
output,
116+
data.len(),
117+
decompressed.len()
118+
);
119+
ExitCode::SUCCESS
120+
}
121+
122+
fn run_round_trip(input: &str, variant: Variant) -> ExitCode {
123+
let data = match fs::read(input) {
124+
Ok(d) => d,
125+
Err(e) => {
126+
eprintln!("error reading {input}: {e}");
127+
return ExitCode::FAILURE;
128+
}
129+
};
130+
let compressed = match compress(&data, variant) {
131+
Ok(c) => c,
132+
Err(e) => {
133+
eprintln!("compress failed: {e}");
134+
return ExitCode::FAILURE;
135+
}
136+
};
137+
let decompressed = match decompress(&compressed) {
138+
Ok(d) => d,
139+
Err(e) => {
140+
eprintln!("decompress failed: {e}");
141+
return ExitCode::FAILURE;
142+
}
143+
};
144+
if decompressed == data {
145+
let ratio = 100.0 * compressed.len() as f64 / data.len() as f64;
146+
println!(
147+
"round-trip OK: {} bytes → {} compressed ({:.2}%) → identical decompressed, variant={:?}",
148+
data.len(),
149+
compressed.len(),
150+
ratio,
151+
variant
152+
);
153+
ExitCode::SUCCESS
154+
} else {
155+
eprintln!(
156+
"round-trip MISMATCH: input {} bytes, decoded {} bytes (variant={:?})",
157+
data.len(),
158+
decompressed.len(),
159+
variant
160+
);
161+
ExitCode::FAILURE
162+
}
163+
}

0 commit comments

Comments
 (0)