Skip to content

Latest commit

 

History

History
985 lines (753 loc) · 51.5 KB

File metadata and controls

985 lines (753 loc) · 51.5 KB

Quaternary Quantization: Predictions

Related documents: DESIGN.md · TESTING.md

Section references of the form §D-x.y refer to DESIGN.md. Section references of the form §T-x refer to TESTING.md.


Contents


P1 — The canonical CGAT mapping

The Gray map $\phi$ of §D-2.7 assigns 2-bit encodings to the four Z₄ symbols. The unique assignment of DNA bases ${G, A, C, T}$ to ${0, 1, 2, 3}$ that makes Watson–Crick complementarity coincide with the complement involution $\theta$ (§D-2.8) is:

Z₄ $\phi$ DNA base Ring structure Functional group
0 00 G Purine (two-ring) Keto
1 01 A Purine (two-ring) Amino
2 11 C Pyrimidine (one-ring) Amino
3 10 T Pyrimidine (one-ring) Keto

The mapping is unique. Watson–Crick pairs are ${G, C}$ and ${A, T}$. The involution $\theta$ pairs ${0,2}$ and ${1,3}$. Mapping $G \mapsto 0$, $C \mapsto 2$, $A \mapsto 1$, $T \mapsto 3$ (or its complement relabelling) is the only assignment satisfying $\theta(G) = C$ and $\theta(A) = T$.

The Gray bits encode the two chemical classification axes.

  • Bit 0 (MSB of $\phi$): $0$ for ${G, A}$, $1$ for ${C, T}$. This is the purine/pyrimidine distinction — whether the base has two fused rings or one.
  • Bit 1 (LSB of $\phi$): $0$ for ${G, T}$, $1$ for ${A, C}$. This is the keto/amino distinction — whether the base carries a carbonyl or an amino group at the relevant position.

These two axes are the standard chemical classification of nucleobases used independently of the Watson–Crick model. The Gray code encodes them simultaneously, with no design choices remaining.

The Lee metric has a chemical interpretation.

  • Lee distance 1 pairs — one chemical property changes:
    • $G$–$A$ (both purines, differ in keto/amino): transition mutation
    • $C$–$T$ (both pyrimidines, differ in keto/amino): transition mutation
    • $G$–$T$ (both keto, differ in purine/pyrimidine): transversion type 1
    • $A$–$C$ (both amino, differ in purine/pyrimidine): transversion type 1
  • Lee distance 2 pairs — both chemical properties change:
    • $G$–$C$: Watson–Crick complement pair
    • $A$–$T$: Watson–Crick complement pair

Lee distance 2 is exactly Watson–Crick complementarity. In molecular evolution, the Lee-distance-2 substitutions (complement transversions) are the most disruptive because they cross both chemical axes simultaneously.

Consequence for retrieval. The Lee metric on Z₄ is equivalent to Hamming distance in 2D chemical property space ${\text{purine/pyrimidine}} \times {\text{keto/amino}}$. The mapping from embedding space to Z₄ is therefore a projection onto a chemically natural 2D coordinate system. This is not an analogy; it is the same algebraic structure.

The two chemical axes and their intersection, showing all four bases and their Lee distances:

quadrantChart
    title CGAT mapping: purine/pyrimidine × keto/amino axes
    x-axis Keto --> Amino
    y-axis Pyrimidine --> Purine
    quadrant-1 Purine + Amino
    quadrant-2 Purine + Keto
    quadrant-3 Pyrimidine + Keto
    quadrant-4 Pyrimidine + Amino
    G (phi=00, Z4=0): [0.15, 0.75]
    A (phi=01, Z4=1): [0.85, 0.75]
    T (phi=10, Z4=3): [0.15, 0.25]
    C (phi=11, Z4=2): [0.85, 0.25]
Loading

Watson–Crick pairs (G–C and A–T) are diagonally opposite: Lee distance 2. Adjacent pairs (G–A, C–T, G–T, A–C) differ in exactly one axis: Lee distance 1.


P2 — Codon palindromes and hairpin density

A codon is any triplet of consecutive symbols in a transition sequence $R = (r_0, r_1, \ldots, r_{k-1})$. Adjacent symbols in $R$ are always distinct ($r_i \neq r_{i+1}$ for all $i$, by construction of run-reduction).

Definition. A codon $(r_i,\ r_{i+1},\ r_i)$ is a palindrome codon. It represents a trajectory that departs from state $r_i$, visits $r_{i+1}$, and returns.

The three palindrome types differ by the Lee distance of the intermediate step:

Form Example Lee dist. Character
Adjacent palindrome $(A, B, A)$ 1 Narrow excursion; returns from nearest neighbour
Cyclic-wrap palindrome $(A, D, A)$ 1 Returns from cyclic-adjacent extremum
Complement palindrome $(A, C, A)$, $(B, D, B)$ 2 Visits semantic antipode and returns

A complement palindrome is exactly $(x,\ \theta(x),\ x)$: the trajectory crosses to the complement state and returns. In the Gray encoding, $\phi(\theta(x)) = \overline{\phi(x)}$, so the intermediate symbol has all bits flipped. The biological correspondence is direct: an RNA hairpin loop forms where a strand folds back on itself via Watson–Crick complementarity; here the fold is in the transition sequence, and the complementarity is $\theta$.

A complement palindrome codon $(A, C, A)$ in a transition sequence — visiting the semantic antipode and returning:

graph LR
    A1("A") --"step out\nLee 2"--> C("C = θ(A)") --"return\nLee 2"--> A2("A")
    style C fill:#f96,stroke:#c63
Loading

All three palindrome types on the Z₄ cycle:

graph TD
    subgraph Adjacent["Adjacent palindrome (A,B,A) — Lee 1"]
        direction LR
        aa("A") --"Lee 1"--> ab("B") --"Lee 1"--> aa2("A")
    end
    subgraph Cyclic["Cyclic-wrap palindrome (A,D,A) — Lee 1"]
        direction LR
        ca("A") --"Lee 1 (wrap)"--> cd("D") --"Lee 1 (wrap)"--> ca2("A")
    end
    subgraph Complement["Complement palindrome (A,C,A) — Lee 2 ★"]
        direction LR
        cpa("A") --"Lee 2"--> cpc("C = θ(A)") --"Lee 2"--> cpa2("A")
        style cpc fill:#f96,stroke:#c63
    end
Loading

Hairpin density. Define:

$$\rho_{\text{hp}}(R) = \frac{\bigl|{i : r_{i+1} = \theta(r_i)\ \text{and}\ r_{i+2} = r_i}\bigr|}{|R| - 2}$$

for $|R| \geq 3$, and $0$ otherwise.

Null baseline. For a uniformly random transition sequence of length $|R|$, conditioning on $r_i = x$ gives $r_{i+1}$ uniform over the three values ${A,B,C,D} \setminus {x}$, so $P(r_{i+1} = \theta(r_i)) = 1/3$. Given $r_{i+1} = \theta(r_i)$, $r_{i+2}$ is uniform over the three values excluding $r_{i+1}$, exactly one of which equals $r_i$, so $P(r_{i+2} = r_i) = 1/3$:

$$\mathbb{E}[\rho_{\text{hp}}]_{\text{null}} = \frac{1}{3} \cdot \frac{1}{3} = \frac{1}{9} \approx 0.111$$

Prediction. A complement palindrome codon in a document's transition sequence indicates a dimension that was visited in opposition and returned from: the embedding crossed to the semantic complement of some direction before settling back. Documents containing dialectical or contrastive language should produce transition sequences with $\rho_{\text{hp}} > 1/9$; documents that commit to a semantic complement without returning should produce $\rho_{\text{hp}} < 1/9$.

Three probe classes make this precise:

Class Example Predicted $\rho_{\text{hp}}$
Direct "Optimism is warranted." $\approx 1/9$
Dialectical "Optimism is warranted; one could argue for pessimism, yet on balance optimism prevails." $> 1/9$
Negated "Optimism is not warranted." $< 1/9$

The ordering prediction is:

$$\rho_{\text{hp}}(\text{Dialectical}) > \rho_{\text{hp}}(\text{Direct}) \approx \frac{1}{9} > \rho_{\text{hp}}(\text{Negated})$$

Antonym retrieval corollary. Two documents about semantically opposed concepts (e.g. optimism and pessimism) both traverse the same semantic axis — one as $(x,\ \theta(x),\ x)$ and the other as $(\theta(x),\ x,\ \theta(x))$. The set of complement palindrome codons in their respective transition sequences should overlap more than it does for two unrelated documents. Complement-palindrome overlap between query and candidate is therefore a signal of semantic opposition, orthogonal to Lee distance.

Test protocol: §T-2 (code corpus), §T-3 (embedding models).


P3 — Complement bigram suppression (CpG analog)

In vertebrate genomes, the dinucleotide CG is severely underrepresented — typically at 20–25% of its expected frequency under a base-composition model — because CpG sites are hypermutable: methylation at cytosine followed by deamination converts CG to TG at elevated rates, depleting CG over evolutionary time.

Under the CGAT mapping (P1), a CG dinucleotide on the same strand corresponds to a consecutive ${G, C}$ pair in the transition sequence — a Lee-distance-2 bigram, i.e. a complement bigram where $r_{i+1} = \theta(r_i)$.

Prediction. In a real natural-language corpus, complement bigrams (consecutive symbols at Lee distance 2) are underrepresented relative to the null expectation.

Null expectation. For a uniformly random transition sequence, each successor is drawn from the three symbols ${A,B,C,D} \setminus {r_i}$, exactly one of which is $\theta(r_i)$. The null rate of complement bigrams is therefore $1/3 \approx 0.333$.

Prediction. Observed complement-bigram frequency in real embedding corpora is $< 1/3$.

The intuition matches: a semantic trajectory that jumps to the full complement of its current state in a single step represents an abrupt, maximum-Lee-distance transition. Embedding distributions produced by trained models should prefer smaller steps, suppressing complement bigrams relative to the uniform null.

Falsification condition. Observed complement-bigram frequency $\geq 1/3$ across multiple corpora and models would falsify this prediction.

Test protocol: §T-1 (null baseline), §T-2 (code corpus).


P4 — Transition/transversion asymmetry and weighted Lee distance

Under the CGAT mapping (P1), Lee-distance-1 pairs split into two chemically distinct types:

  • Transitions ($G$–$A$ and $C$–$T$): change the keto/amino bit only, remaining within the same ring class (purine-to-purine or pyrimidine-to-pyrimidine). In molecular evolution, transitions occur at rate $\mu_{\text{ts}}$.
  • Transversions type 1 ($G$–$T$ and $A$–$C$): change the purine/pyrimidine bit only, crossing ring classes while preserving functional group. Rate $\mu_{\text{tv1}} \approx \mu_{\text{ts}}/2$.
  • Complement transversions ($G$–$C$ and $A$–$T$, Lee distance 2): both bits change. Rate $\mu_{\text{tv2}} \approx \mu_{\text{ts}}/4$.

The empirical transition-to-transversion ratio Ti/Tv is approximately 2–4 in vertebrate genomes. The uniform Lee metric treats all Lee-distance-1 pairs identically; the biological evidence implies that within Lee-distance-1, transitions should be semantically cheaper than same-ring-class transversions.

Prediction. A weighted Lee metric with weights:

$$w_1 = 1 \text{ (transition)}, \quad w_2 \approx 1.5\text{ to }2 \text{ (type-1 transversion)}, \quad w_3 \approx 3 \text{ (complement transversion)}$$

outperforms the uniform Lee metric ($w_1 = w_2 = 1$, $w_3 = 2$) on retrieval benchmarks. The empirical weights should be estimable from the corpus's own complement-bigram and transition-bigram frequencies, without reference to any external gold standard.

Falsification condition. If the weighted metric performs no better than the uniform metric on a held-out retrieval benchmark, the transition/transversion distinction is not present in the embedding distributions produced by the models under test.

Test protocol: §T-3 (embedding models).


P5 — Reverse complement retrieval of semantic antonyms

In molecular biology, the reverse complement of a DNA sequence $S = (s_0, s_1, \ldots, s_{n-1})$ is the sequence read on the antiparallel strand:

$$\bar{S}^R = (\theta(s_{n-1}),\ \theta(s_{n-2}),\ \ldots,\ \theta(s_0))$$

Definition. The reverse complement of a transition sequence $R = (r_0, r_1, \ldots, r_{k-1})$ is:

$$\bar{R}^{\text{rev}} = (\theta(r_{k-1}),\ \theta(r_{k-2}),\ \ldots,\ \theta(r_0))$$

Prediction. Given a document $D$ with transition sequence $R$, the document whose transition sequence is closest (under Lee distance) to $\bar{R}^{\text{rev}}$ is the semantic antonym of $D$ — the document that traverses the same semantic axis in the opposite direction.

The intuition follows from P1: the antiparallel strand in DNA encodes the complementary information on the same axis. A document that argues $x \to \theta(x) \to x$ and a document arguing $\theta(x) \to x \to \theta(x)$ traverse the same axis in opposite orientations. Their transition sequences are reverse complements of each other.

Concrete test. Embed a set of antonym pairs (e.g. optimism/pessimism, ascent/descent, expansion/contraction). For each document in the pair, compute the reverse complement of its transition sequence, then query the index. The prediction is that the query returns the antonym with higher frequency than an unrelated document at the same Lee distance from the query.

This prediction is orthogonal to cosine similarity: two antonym documents can have high cosine similarity (their embeddings are close on $S^{n-1}$) while being reverse complements in transition-sequence space.

Visualisation. A document R and its semantic antonym share the same semantic axis, traversed in opposite orientations:

graph LR
    subgraph DocD["Document D — R = (A, B, C, D, …)"]
        direction LR
        d0("A") --> d1("B") --> d2("C") --> d3("D")
    end
    subgraph Antonym["Antonym — R̄ rev = (θ(D), θ(C), θ(B), θ(A), …) = (B, A, D, C, …)"]
        direction LR
        a0("B") --> a1("A") --> a2("D") --> a3("C")
    end
    DocD -. "reverse-complement\nquery retrieves" .-> Antonym
Loading

Falsification condition. Reverse-complement queries retrieve antonyms at no better than chance rate relative to documents at the same Lee distance.

Test protocol: §T-3 (embedding models), §T-4 (local LLMs).


P6 — Two-stage search architecture (PAM/seed analog)

CRISPR-Cas9 searches a genome in two stages:

  1. PAM recognition — Cas9 scans for the 3-base PAM motif (e.g. NGG for Streptococcus pyogenes Cas9) using rapid diffusion. This is an O(1)-per-site lookup covering ~6 bits of specificity.
  2. Seed matching — the 12 bases of the gRNA adjacent to the PAM (the seed region) must match the target with near-zero mismatch tolerance. The remaining 8 bases (PAM-distal) tolerate up to 5 mismatches.

The Q² 64-bit key has the same two-level structure by construction (§D-3.3):

  • High-order bits ($r_0$–$r_5$, top 12 bits): encode the coarsest semantic distinctions, analogous to the PAM plus seed region.
  • Low-order bits ($r_6$–$r_{31}$, bottom 52 bits): encode fine intra-block structure, analogous to the PAM-distal region.

Prediction. The optimal Q² search architecture is:

  1. Hash lookup on the top $k$ bits of the key, $k \approx 12$–$24$, returning a small candidate set.
  2. Full Lee-distance computation on the candidate set only.
flowchart TD
    Q["Query document\n→ 64-bit key K"] --> S1
    S1["Stage 1: Hash lookup\ntop k bits of K\n(coarse semantic filter)"] --> CS["Small candidate set\n(~hundreds of docs)"]
    CS --> S2["Stage 2: Full Lee-distance\ncomputation on candidates\n(fine re-ranking)"]
    S2 --> R["Ranked results"]
    style S1 fill:#6af,stroke:#36c
    style S2 fill:#9c9,stroke:#6a6
Loading

The false-positive rate as a function of mismatch depth follows a sigmoid with inflection near symbol position 12–14 from the MSB, not a linear function of Lee distance. The optimal $k$ for Stage 1 is the point where the specificity of the hash pre-filter exceeds the cost of the full Lee computation on surviving candidates.

Mismatch tolerance asymmetry. The high-order symbols (depth 0–5) should behave as the seed region: retrieval recall degrades sharply with mismatches here. The low-order symbols (depth 18–31) should behave as PAM-distal: mismatches are tolerated and precision is maintained. The magnitude of the asymmetry is bounded by the Ti/Tv ratio from P4.

Falsification condition. If a flat (uniform-depth) search performs equally well as the two-stage architecture on latency-matched benchmarks, the seed/PAM-distal asymmetry is absent from the key structure.

Test protocol: §T-3 (embedding models).


P7 — Document secondary structure

RNA secondary structure is the set of non-crossing Watson–Crick base pairs that minimises free energy. The Nussinov algorithm finds this in $O(n^3)$ by dynamic programming: a pair $(i, k)$ is included in the structure if $s_i$ and $s_k$ are complements, and no pair in the structure crosses $(i, k)$.

The same algorithm applies to transition sequences. Define a complement pair in a transition sequence $R$ as a pair of positions $(i, k)$ with $i + 2 \leq k$ such that:

  • $r_i = r_k$ (the flanking symbols are equal), and
  • $r_{i+1} = \theta(r_i)$ and $r_{k-1} = \theta(r_k)$ (the interior opens and closes on the complement).

A set of non-crossing complement pairs is the secondary structure of $R$.

Prediction. Documents whose transition sequences admit a rich secondary structure (many nested complement pairs) encode rhetorically complex argument forms — nested qualifications, embedded counterarguments, recursive dialectical moves.

Two documents with identical 64-bit keys but different secondary structures have the same coarse semantic address but different argumentative shape. Secondary structure distinguishes a direct assertion from a dialectical one even when both resolve to the same semantic neighbourhood.

Pseudoknots as long-range cross-references. A pseudoknot in RNA secondary structure is a set of base pairs that cross — pairs $(i, k)$ and $(j, l)$ with $i < j < k < l$. These cannot be represented as a planar tree; they require $O(n^5)$–$O(n^6)$ algorithms to find optimally and are NP-hard in full generality. In a transition sequence, a pseudoknot corresponds to a semantic cross-reference: a document returns to an earlier topic after an intervening argument, crossing the boundary of the intervening hairpin. Such structures should be rare and computationally expensive to detect, consistent with the biological case.

Falsification condition. If secondary structure complexity (number of nested complement pairs) is uncorrelated with human-annotated rhetorical complexity on the same documents, the secondary structure prediction is falsified.

Test protocol: §T-3 (embedding models), §T-4 (local LLMs).


P8 — Codon degeneracy and synonymous substitutions

The genetic code maps 64 codons to 20 amino acids. Most amino acids are encoded by 2–6 synonymous codons. Synonymous mutations change the codon without changing the encoded amino acid. Their frequency is non-uniform and correlated with tRNA abundance (codon usage bias).

In Q², a triplet of transition symbols $(r_i, r_{i+1}, r_{i+2})$ is a semantic codon. Multiple distinct codons can encode the same semantic primitive — a conceptual unit represented at the triplet level.

Prediction. The frequency distribution of transition triplets in a real corpus is non-uniform and non-random: a small number of triplet forms account for a disproportionate share of occurrences. The high-frequency triplets are the "synonymous codons" for common semantic primitives (hedges, intensifiers, logical connectives, topic-establishing patterns). Low-frequency triplets correspond to rare rhetorical moves.

Corpus-dependence. The frequency distribution should vary by corpus domain in a manner analogous to codon usage bias by organism: a code-specialised model should exhibit different triplet frequency biases than a natural-language model, reflecting different distributions of semantic primitives in the two domains.

Synonymy density as a vocabulary measure. Documents that use a wider variety of triplet forms (lower synonymy concentration) are making more diverse semantic moves. This gives a basis for a vocabulary-richness metric defined over transition sequences rather than surface-form word counts.

Falsification condition. If the triplet frequency distribution is statistically indistinguishable from a Zipf-distributed random sequence with the same bigram constraints, the semantic-codon structure is not present.

Test protocol: §T-2 (code corpus), §T-3 (embedding models).


P9 — Alphabet optimality (Kerdock/Preparata)

The Kerdock and Preparata codes are optimal non-linear binary error-correcting codes. Under the bijection established by Hammons, Kumar, Calderbank, Sloane, and Solé (1994), these codes become linear when lifted from ${0,1}^{2n}$ to $\mathbb{Z}_4^n$ via the Gray map. This is the theorem cited in §D-2.7.

The significance is that $\mathbb{Z}_4$ (2 bits per symbol) is the minimum ring over which optimal non-linear binary codes become linear. Neither $\mathbb{Z}_2$ (1 bit) nor $\mathbb{Z}_8$ (3 bits) has this property for the same code families.

Prediction. Increasing the alphabet to $\mathbb{Z}8$ or $\mathbb{Z}{16}$ (3 or 4 bits per symbol) provides diminishing retrieval returns, holding embedding dimension constant. The algebraic advantage of the Gray-map isometry is specific to $\mathbb{Z}_4$: for larger alphabets there is no exact Lee-to-Hamming isometry under plain Hamming distance, and this complicates the popcnt(XOR) computation.

Specifically:

  • $\mathbb{Z}_8$: the 3-bit Gray code cannot be an isometry from $(\mathbb{Z}_8, d_L)$ to $({0,1}^3, d_H)$ because $\max d_L = 4$ on $\mathbb{Z}_8$ while $\max d_H = 3$ on 3 bits. The cyclic wrap $d_L(7,0) = 1$ maps to Gray codes $\mathtt{100}$ and $\mathtt{000}$ (Hamming distance 1), and adjacent pairs at the midpoint ($d_L(3,4)=1$, Gray codes $\mathtt{010}$ and $\mathtt{110}$, Hamming distance 1) also behave correctly, but larger Lee distances cannot be represented exactly by Hamming distance on 3 bits. Consequently, popcnt(XOR) over Gray-coded $\mathbb{Z}_8$ symbols computes Hamming distance, which underestimates Lee distance for some pairs. To recover true Lee distance one must apply a correction step, e.g., decode each 3-bit Gray symbol back to $\mathbb{Z}_8$ and use a small Lee-distance lookup table, or map bitwise differences through a per-symbol LUT/SIMD transform before summing.
  • The information gain per symbol from $\mathbb{Z}_4$ to $\mathbb{Z}_8$ is $\log_2 8 - \log_2 4 = 1$ bit, but the additional bit encodes intra-cell position within the four magnitude classes, which §D-1.5 shows is not recoverable signal for L1 retrieval.

Falsification condition. If a $\mathbb{Z}_8$-encoded system with otherwise identical architecture achieves a statistically significant improvement in retrieval on a held-out benchmark, the $\mathbb{Z}_4$ optimality prediction is falsified. This would imply that the fifth-through-eighth quantization cells carry retrievable semantic signal, contradicting §D-1.5.

Test protocol: §T-3 (embedding models).


P10 — Key entropy and collision rate

The 64-bit transition key is intended as a compact semantic address: most distinct documents should map to distinct keys, and collisions should occur mainly between semantically similar documents.

Prediction. For a corpus of size $N$ (e.g., 10k–100k documents), the empirical collision rate of 64‑bit keys should be close to the random-hash expectation $O(N^2 / 2^{65})$, and collision pairs should be more semantically similar (lower Lee distance) than random pairs. A collision rate substantially higher than the null expectation indicates loss of discriminative power in the key.

Test protocol. §T-2, §T-3. Compute key frequencies and estimate collision rate. Compare against the theoretical baseline for uniform 64-bit keys, and inspect the semantic similarity of collided pairs.


P11 — Biomechanical grounding of complement suppression

The complement-bigram suppression predicted in P3 has a mechanistic basis that goes beyond the statistical smoothness of trained embeddings. Natural language was shaped over millennia by a hard physical constraint: speech production and speech perception share the same apparatus and cannot run at full capacity simultaneously. Auditory feedback is actively suppressed during speech production (efference copy). This forced language to evolve toward low-Lee-distance trajectories — the same smoothness preference that produces coarticulation in phonetics and stepwise motion in melody.

The Lee circle is the embedding-space analog of the articulatory-acoustic continuum. A complement bigram (Lee distance 2) corresponds to the most abrupt possible transition — the acoustic equivalent of a tritone leap within the diminished seventh chord. Human mouths and ears jointly disfavor such transitions, and language accordingly suppresses them.

Prediction. Complement-bigram suppression (P3) is:

  1. Universal across language families. The same vocal tract and auditory system produced every natural language. The suppression rate should appear in embeddings trained on typologically distant languages (SOV/SVO, agglutinative/isolating, tonal/non-tonal).

  2. Stronger in spoken-language corpora than written-only corpora. Speech transcripts, conversational text, and oral-tradition literature should show more pronounced complement suppression than formal written registers (legal, academic, mathematical), which are less directly shaped by articulatory constraints.

  3. Correlated with phonotactic strictness. Languages with strict CV phonotactics (e.g., Japanese, Hawaiian) impose stronger articulatory smoothness than languages permitting complex onset/coda clusters (e.g., Georgian, Czech). Complement suppression rates in embeddings should be ordered accordingly.

Null hypothesis. The suppression observed in P3 is a statistical artifact of the training objective (next-token prediction over written text) and carries no articulatory signal. If so, suppression rates should be invariant across spoken vs. written corpora and uncorrelated with phonotactic strictness.

Falsification condition. Complement-bigram suppression rates are statistically indistinguishable between spoken and written corpora, or between phonotactically strict and permissive language families, across at least three typologically distinct language groups.

Test protocol: §T-3 (embedding models), multilingual extension.


P12 — Regime diagnostic: semantic distance vs. conceptual distance

Two documents can be close in embedding space for different reasons:

  • Semantic proximity: they share vocabulary, style, or discourse conventions. This is largely a property of the language system — the human machinery used to encode meaning.
  • Conceptual proximity: they describe the same region of physical, experiential, or structural reality. This is a property of the world being described, independent of which language or genre is used.

Standard embedding models are trained on surface co-occurrence and primarily measure semantic proximity. A conceptual embedding would cluster documents by the geometry of their referents, not their surface form.

The diagnostic. If Q²'s transition sequences are capturing conceptual geometry, then the complement-bigram suppression rate should be invariant across languages for matched-content corpora — because the attractor and no-go structure reflects the topology of the domain, not the phonology of the language. If suppression rates instead co-vary with the phonotactic profile of each language (P11), the system is operating in the semantic regime.

The same diagnostic applies to retrieval. Define a conceptual near-neighbor pair as two documents that:

  1. Describe the same physical or structural referent (same geographic location, same mathematical object, same historical event).
  2. Were written in different languages with no translation or connecting secondary literature in the training corpus.
  3. Share low surface-vocabulary overlap.

Sea poetry written in English on the Cornish coast and sea poetry written in Japanese on the Miyagi coast satisfies all three criteria. A semantic embedding has no surface path between them. A conceptual embedding, if it is tracking the attractor basin of the underlying referent (ocean, scale, horizon, mortality), should score them as near-neighbors.

Prediction. Conceptual near-neighbor pairs achieve lower Lee distance in transition sequences than their cosine embedding distance predicts. The gap between transition-sequence Lee distance and cosine distance is larger for cross-linguistic matched-content pairs than for same-language matched-content pairs.

Primary-source condition. The test uses only primary source documents — texts produced before any secondary scholarship explicitly connecting them entered broad circulation. This excludes documents where the conceptual link has already been surface-encoded in the training corpus via meta-analysis or comparative scholarship. The prediction must hold on first-contact pairs.

Corollary. Complement-bigram suppression rates for the two documents in a conceptual near-neighbor pair should be more similar to each other than to a random document in the same language — because both transition sequences are orbiting the same referent attractor, regardless of the phonotactic constraints of their source languages.

Falsification condition. Lee distance between known conceptual near-neighbors with disjoint surface vocabularies is statistically indistinguishable from Lee distance between random document pairs at the same cosine embedding distance. Equivalently: transition-sequence space adds no retrieval signal over cosine similarity for cross-linguistic matched-content pairs.

Test protocol: §T-3 (embedding models), §T-4 (local LLMs), multilingual extension.


P13 — The regime dial and the sɪ test

P12 treats the semantic/conceptual distinction as binary: a pair either sits in the semantic regime or the conceptual regime. But regimes are not discrete. Every document pair occupies a point on a continuous spectrum, and the spectrum is the interesting object.

The dial. Define the regime score R of a document pair (A, B) as:

$$R(A, B) = 1 - \frac{d_{\text{Lee}}(A,B) / d_{\text{max}}}{d_{\cos}(A,B) / d_{\text{max,cos}}}$$

where $d_{\text{Lee}}$ is the mean Lee distance between their transition sequences, $d_{\cos}$ is their cosine distance in embedding space, and the denominators normalize each to [0, 1]. $R = 0$ means the two metrics agree: the system is operating in the semantic regime, transition sequences add nothing. $R > 0$ means Lee distance is proportionally smaller than cosine distance: the transition sequences are finding conceptual proximity that surface embeddings miss. $R < 0$ means Lee distance is proportionally larger: surface form is closer than referent geometry.

The dial is not a binary classifier — it is a signed scalar with a natural zero. Most pairs in a monolingual, same-genre corpus will cluster around $R = 0$. Cross-linguistic matched-content pairs should shift toward $R > 0$. Pairs that share genre and register but describe opposite referents (e.g., encomium vs. invective for the same person) should shift toward $R < 0$.

The sɪ test. The user's observation provides a minimal test case in three tokens:

Token Language Semantic field Phonology
sea English Ocean, vastness /siː/
see English Perception, witness /siː/
si Spanish Affirmation, existence /si/

These three tokens have near-maximal surface-vocabulary disjunction across semantic fields. A purely semantic embedding places them in three different neighborhoods. Yet:

  1. sea / see: share the observer-at-the-horizon phenomenology — the sea is paradigmatically what you see when you reach the edge of the world. The conceptual attractor is the vanishing point.
  2. see / si: share the assertoric function — witnessing and affirming are both acts of ratifying that something is the case.
  3. sea / si: the sound /sɪ/ is where the world gets slippery — ocean and affirmation converge in the Romance tradition of the as the word for yes that lives by the sea (Dante's Divina Commedia I.33: "del bel paese là dove 'l sì suona").

The sɪ test is not a full benchmark — it is a three-point calibration. It predicts that token-level transition sequences for documents centered on each of these concepts (a sea voyage poem, a poem about bearing witness, a poem of affirmation) should produce a pairwise $R$ distribution that is right-skewed (more mass at $R > 0$) relative to three randomly chosen tokens with the same cosine pairwise distances.

Prediction. The regime score $R$ is:

  1. Positive and non-trivial ($R > 0.05$, two-tailed $p < 0.05$) for cross-linguistic matched-content primary-source pairs.
  2. Near-zero ($|R| < 0.02$) for same-language, same-genre, same-topic pairs, where semantic and conceptual proximity co-vary.
  3. Negative ($R < -0.05$) for same-genre antonym pairs (encomium/invective, elegy/epinicion) where surface register is matched but referent geometry is opposed.

Corollary — the dial as retrieval signal. For any query document, ranking candidates by $R$ (rather than by raw Lee distance or raw cosine distance) should improve cross-lingual recall while degrading monolingual precision by a detectable but small amount. The trade-off curve of precision vs. cross-lingual recall as a function of the weight given to $R$ should be convex, indicating a well-defined operating point rather than a monotone accuracy/coverage trade-off.

Falsification condition. The $R$ distribution for cross-linguistic matched-content pairs is statistically indistinguishable from the $R$ distribution for random cross-linguistic pairs at the same cosine distance. Equivalently: $R$ is pure noise — the dial has no signal.

Test protocol: §T-3, §T-4, multilingual extension; sɪ calibration on curated primary-source poetry corpus.


P14 — Phylomemetic fingerprinting

Dawkins (1976) introduced the meme as a unit of cultural transmission, analogous to the gene: a discrete, heritable, selectable idea. Phylomemetics extends quantitative phylogenetic methods to meme lineages — asking not only what an author wrote, but which prior authors' distributions does it descend from. The Q² transition sequence provides a natural feature space for this analysis: stylometric patterns are captured in the bigram, triplet, and hairpin statistics already defined for P2–P8, and the Bayesian attribution model requires only that each author's feature distribution be sufficiently stable across documents.

P14a — Author fingerprint stability

Prediction. The stylometric feature distribution of a single author (function-word frequencies, complement-bigram rate, triplet entropy, $\rho_{\text{hp}}$) is stable across topic changes within that author's corpus. Formally: the within-author variance across documents on different topics is significantly lower than the between-author variance on the same topic.

Corpus. Project Gutenberg public-domain texts (pre-1927), providing ample multi-topic corpora per major author. The pre-1927 boundary is a feature, not a limitation: Gutenberg supplies the origin layer of English conceptual vocabulary. Words and constructions coined before 1927 propagated forward through every subsequent generation of writers and into AI training corpora, making their stylometric signal more likely to be detectable at the terminal node (AI output) than the signal of post-war authors present only in the non-public-domain strata.

Bayesian attribution. Model each author as a Dirichlet-Multinomial over triplet frequencies and a Beta-Binomial over complement-bigram rate. Given a new document, the posterior $P(\text{author} \mid \text{document})$ is updated incrementally per sample. The prediction is that the posterior concentrates on the correct author after $N \ll 50$ documents. The exact threshold $N^*$ (defined as the minimum $N$ at which top-1 accuracy exceeds 90% on held-out documents) is an empirically measurable quantity.

Falsification condition. Within-author variance across topics is statistically indistinguishable from between-author variance at the same topic, across at least five authors with corpora of $\geq 10$ documents each.

P14b — RLHF variance compression

Prediction. AI model outputs (attributed to a specific model) have lower stylometric entropy than human-author outputs. RLHF optimises for a reward signal that is approximately shared across the human rater population; repeated optimisation against a shared reward compresses the variance of the output distribution. This is measurable as:

$$H_{\text{AI}} < H_{\text{human}}$$

where $H$ is the entropy of the triplet frequency distribution pooled across documents from the same source (author or model). The compression should be stronger for heavily RLHF-tuned models than for base or instruction-only models.

Falsification condition. AI model triplet entropy is statistically indistinguishable from the median human-author triplet entropy across at least three model/author pairs.

P14c — Cross-lineage influence detection

Prediction. For specific human authors whose stylometric signal is strong and distinctive, their Q² feature distribution is detectable in AI model outputs at a rate disproportionate to their share of the training corpus. The influence coefficient $\alpha_i$ for author $i$ is estimated by fitting:

$$\hat{f}_{\text{AI}} = \sum_i \alpha_i \cdot f_i + \epsilon$$

where $f_i$ is the triplet frequency vector of author $i$ and $\hat{f}_{\text{AI}}$ is the observed AI triplet frequency vector. Authors with $\alpha_i$ significantly greater than their estimated corpus share fraction have disproportionate stylometric influence.

Expected non-null cases. Authors whose stylometric signal is both strong and subject-matter-adjacent to AI training tasks are the most likely to show detectable influence:

  • Carroll (1865, 1871): Recursive logical structure, novel word formation, categorical boundary violation. Stylistically distinctive and subject-matter-adjacent to AI reasoning tasks.
  • Shelley, M. (1818): Distinctive first-person introspective register combined with third-person scientific framing. Conceptually adjacent to questions of machine consciousness.
  • Lovelace (1843, in Menabrea translation notes): Precise procedural exposition combined with speculative extension. Directly adjacent to AI capability discourse.
  • Doyle (1887–1927): Highly distinctive deductive-reasoning exposition style. Widely imitated; a strong prior signal.

The Carroll case is the sharpest test: his stylometric distribution is unusual enough that contamination from later authors who imitated him is itself a form of Carroll influence. Detecting Carroll's signal in AI reasoning outputs, controlling for intervening imitation, constitutes detection of second-order memetic transmission.

Falsification condition. No author's $\alpha_i$ is significantly greater than their estimated corpus-share fraction, across at least four candidate authors and two AI models. Equivalently: AI stylometric output is a flat mixture of the training corpus with no author-specific amplification.

P14d — Temporal propagation ordering

Prediction. The estimated influence coefficients $\alpha_i$ respect temporal ordering: an author at date $t$ cannot have a coefficient driven by authors at date $t' &gt; t$. This is a consistency check rather than a novel prediction, but it distinguishes genuine influence detection from spurious correlation. If the recovered $\alpha_i$ for Carroll is driven primarily by Carroll-contemporary bigram patterns rather than by later Carroll-imitation patterns, the temporal ordering constraint is satisfied.

The Gutenberg corpus allows this check because publication dates are known. The influence of pre-1850 authors should be recoverable from the stylometric signal independently of how often their names or works are cited in the AI training corpus, because the feature vector $f_i$ is computed from surface style, not from content.

Falsification condition. Recovered influence coefficients for well-dated authors do not respect the temporal ordering constraint at a statistically significant level — i.e., authors are assigned disproportionate influence from periods after their death.


Test protocol: §T5 (Gutenberg + AI corpus, new phase).


P15 — Admissible sequence count and trie sparsity

The no-repeat constraint in the transition sequence (§D-3.1) eliminates the vast majority of possible 64-bit keys. The number of admissible transition sequences of length $k$ over a $q$-ary alphabet is:

$$D(k) = q \cdot (q-1)^{k-1}$$

For $q = 4$ and $k = 32$ (the Q² key capacity):

$$D(32) = 4 \cdot 3^{31} \approx 2.47 \times 10^{15}$$

This occupies $\approx 1.34 \times 10^{-4}$ of the $2^{64}$ address space — the no-repeat constraint concentrates valid keys into a sparse subset comprising less than 0.014% of all possible 64-bit values.

Prediction. The empirical key distribution of a real corpus occupies a strict subset of the admissible trie. Specifically:

  1. Trie sparsity: for any corpus of size $C \leq 10^{10}$, the fraction of the admissible trie occupied is $C / D(32) \leq 4.05 \times 10^{-6}$ — the corpus is sparse even within the already-sparse admissible set.
  2. Non-uniform trie occupancy: certain subtrees of the transition trie are disproportionately occupied because the source distribution favours certain symbol transitions over others (cf. complement-bigram suppression, P3). The occupied subtree structure is a fingerprint of the source distribution.
  3. Information efficiency: a 32-symbol key carries $2 + 31 \times \log_2 3 \approx 51.1$ effective bits of information within its 64-bit container, giving an efficiency of $\approx 79.9%$. The remaining $\approx 12.9$ bits are "wasted" by the no-repeat constraint — but this waste is the cost of the run-reduction that makes the key a topological invariant (§D-3.1, Lemma 3.3).

Falsification condition. Empirical key distributions do not respect the trie structure — i.e., keys appear that violate the no-repeat constraint, indicating a bug in the run-reduction step.

Test protocol: §T-0 (unit tests), §T-1 (null baseline), §T-6 (general quantization).


P16 — Geode progressive quantization

The Geode factorization (§D-4.1):

$$S - 1 = S_1 \cdot G$$

predicts that the transition key has a natural multi-resolution structure: the first $j$ symbols form a coarse address, and the remaining symbols refine within that coarse cell. The Geode $G = 1/(1 - 3x)$ counts the refinement possibilities: $3^{k}$ distinct continuations of length $k$ after the first symbol is fixed.

Prediction. Multi-resolution retrieval using the transition key exhibits diminishing information gain per additional symbol:

  1. Logarithmic decay of information per symbol: the first symbol contributes $\log_2 4 = 2$ bits; each subsequent symbol contributes $\log_2 3 \approx 1.585$ bits. The cumulative information at depth $j$ is:

    $$I(j) = 2 + (j-1) \cdot \log_2 3 \approx 2 + 1.585(j-1) \text{ bits}$$

  2. Retrieval recall is a step function of prefix depth: window queries at successively deeper prefix lengths $j$ should show diminishing returns in precision improvement. The transition from "broad" to "narrow" retrieval occurs at a characteristic depth $j^*$ that depends on corpus density, not on the key structure itself.

  3. The Geode coefficient predicts candidate-set size: a window query at prefix depth $j$ retrieves at most $3^{32-j}$ candidates from the admissible trie. In practice the candidate set is smaller (most of the admissible trie is unoccupied), but the Geode coefficient provides the tight upper bound.

Concrete test. For a corpus of $C$ documents, measure retrieval precision and candidate-set size as a function of prefix depth $j \in {4, 8, 12, 16, 20, 24, 28, 32}$. The candidate-set size should decrease geometrically as $O(3^{-j})$ while precision increases monotonically. The crossover point where precision plateaus is $j^*$.

Falsification condition. Information gain per symbol is non-monotonic or candidate-set sizes do not follow the predicted $3^{32-j}$ bound within the admissible trie.

Test protocol: §T-3 (embedding models), §T-6 (general quantization).


P17 — Mixed-precision adaptive quantization

The hyper-Catalan framework (§D-4.3) predicts that the number of structurally distinct mixed-precision codebooks — allocating different bit-widths to different dimensions — is finite and computable. For a signal with $m_2$ binary-quantized dimensions, $m_3$ ternary-quantized dimensions, and $m_4$ quaternary-quantized dimensions, the count of distinct mixed-precision trees is:

$$C_{(m_2, m_3, m_4)} = \frac{1}{N} \binom{N}{n_0,\ m_2,\ m_3,\ m_4}$$

where $N = 1 + 2m_2 + 3m_3 + 4m_4$ is the total node count and $n_0 = 1 + m_2 + 2m_3 + 3m_4$ is the leaf count.

Prediction. Adaptive bit allocation guided by per-dimension variance outperforms uniform quaternary quantization on retrieval benchmarks when the source distribution has heterogeneous variance across dimensions:

  1. Variance-guided allocation: dimensions with variance below a threshold $\sigma^$ should be quantized at 1 bit (binary: sign only); dimensions above $\sigma^$ should retain 2-bit (quaternary) precision. The optimal $\sigma^*$ minimises a loss function trading off code length against retrieval accuracy.

  2. Compression improvement: for a typical transformer embedding with heterogeneous per-dimension variance, mixed-precision encoding achieves the same retrieval accuracy as uniform quaternary at a 15–25% shorter code length, or higher accuracy at the same code length.

  3. Diminishing returns for ternary inclusion: adding ternary ($\mathbb{Z}_3$) cells for near-zero-variance dimensions offers $\leq 0.085$ bits per dimension improvement over binary (since $\log_2 3 - 1 \approx 0.585$ bits, attenuated by the low variance of those dimensions). The ternary option should be dominated by binary for dimensions with variance below $\sigma^*/2$.

Falsification condition. Variance-guided mixed-precision encoding does not improve over uniform quaternary encoding on any of the retrieval benchmarks, even for models with demonstrably heterogeneous per-dimension variance (as measured by the coefficient of variation across dimensions).

Test protocol: §T-3 (embedding models), §T-6 (general quantization).


Summary table

ID Prediction From Tested in Effort
P1 Gray code bits encode purine/pyrimidine × keto/amino axes CGAT mapping — (algebraic) None
P2 $\rho_{\text{hp}}$: Dialectical $&gt;$ Direct $\approx 1/9 &gt;$ Negated RNA hairpin §T-2, §T-3 Low
P3 Complement bigrams $&lt; 1/3$ in real corpora CpG suppression §T-1, §T-2 Low
P4 Weighted Lee outperforms uniform Lee on retrieval Ti/Tv ratio §T-3 Medium
P5 Reverse-complement query retrieves semantic antonym Antiparallel strand §T-3, §T-4 Low
P6 Two-stage (hash + Lee) outperforms flat Lee search PAM/seed architecture §T-3 Medium
P7 Secondary structure complexity correlates with rhetorical complexity RNA folding §T-3, §T-4 High
P8 Triplet frequency distribution shows codon-usage-bias pattern Genetic code degeneracy §T-2, §T-3 Low
P9 $\mathbb{Z}_8$ encoding yields no significant retrieval improvement Kerdock/Preparata §T-3 Medium
P10 Key collision rate is low and correlates with semantic similarity Key design §T-2, §T-3 Low
P11 Complement suppression is universal, stronger in spoken corpora, and ordered by phonotactic strictness Auditory-vocal co-evolution §T-3, multilingual Medium
P12 Cross-linguistic conceptual near-neighbors achieve lower Lee distance than cosine similarity predicts; suppression rates converge across languages for matched-content pairs Semantic vs. conceptual regime §T-3, §T-4, multilingual High
P13 Regime score R is positive for cross-lingual matched-content pairs, near-zero for same-genre same-topic pairs, and negative for antonym pairs; sɪ calibration is right-skewed Semantic/conceptual dial §T-3, §T-4, multilingual High
P14a Within-author stylometric variance < between-author variance; Bayesian posterior concentrates on correct author at $N^* \ll 50$ Phylomemetics (Dawkins 1976) §T-5 Medium
P14b AI model stylometric entropy < median human-author entropy; compression scales with RLHF intensity RLHF variance compression §T-5 Low
P14c Specific Gutenberg authors (Carroll, Shelley, Lovelace, Doyle) have influence coefficients $\alpha_i$ disproportionate to corpus share in AI output Cross-lineage influence detection §T-5 High
P14d Recovered influence coefficients respect temporal ordering of authors by publication date Memetic causality constraint §T-5 Medium
P15 Admissible trie has $\approx 2.47 \times 10^{15}$ nodes; real corpora occupy $&lt; 10^{-5}$ of it; key efficiency $\approx 79.9%$ Transition trie counting (§D-3.6, §D-4.1) §T-0, §T-1, §T-6 Low
P16 Multi-resolution retrieval shows logarithmic information decay per symbol; candidate sets shrink as $O(3^{-j})$ per Geode coefficient Geode factorization (§D-4.1) §T-3, §T-6 Medium
P17 Variance-guided mixed-precision (binary + quaternary) matches uniform quaternary retrieval at 15–25% shorter code Hyper-Catalan mixed-arity counting (§D-4.3) §T-3, §T-6 Medium

P3 requires only a frequency count on quantizer output and can be run immediately once the quantizer from PR #9 is merged. P2 and P5 require a retrieval benchmark but no new infrastructure. P4, P6, and P9 require benchmark-calibrated metric variants. P7 requires a Nussinov dynamic programming implementation over transition sequences. P15 is verifiable from the existing unit test suite and null baselines. P16 and P17 require multi-resolution retrieval benchmarks and variance analysis, respectively, and are tested alongside the existing embedding benchmarks in §T-3 and the new general quantization phase §T-6.