|
| 1 | +# Data-Shape Etymology & the Mechanics of Magic — a savant mind-opener |
| 2 | + |
| 3 | +> READ BY: workspace-primer, convergence-architect, creative-explorer-savant, |
| 4 | +> truth-architect, family-codec-smith, dto-soa-savant, prior-art-savant, |
| 5 | +> any fresh session about to propose a new type, a new layer, or a new trick. |
| 6 | +> |
| 7 | +> Written 2026-07-02, at the close of the onebrc t0–t7 arc + the OGAR |
| 8 | +> provenance date-check. Every epiphany below is tagged FINDING (shipped, |
| 9 | +> dated, cite-able) or CONJECTURE (labeled honestly, with the probe that |
| 10 | +> would promote it). Companion capstone: `EPIPHANIES.md` |
| 11 | +> E-SEMANTIC-OS-CONVERGENCE-1 (the membrane law). This doc is the |
| 12 | +> *shape-and-trick* companion: where our shapes come from, and why our |
| 13 | +> magic works. |
| 14 | +
|
| 15 | +**Thesis in one line:** every data shape in this workspace is older than |
| 16 | +us, every name is a fossil record of a decision, and every trick that |
| 17 | +looks like magic is a mechanism that survived an audit — the savant |
| 18 | +discipline is to read the etymology before proposing the type, and to |
| 19 | +name the mechanism before trusting the trick. |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## 1. The name is the fossil record (FINDING) |
| 24 | + |
| 25 | +**OGAR = "Open Graph of Active Record."** A session recently asked, |
| 26 | +delighted: *"our V3 GUID looks almost like ActiveRecord folded into an |
| 27 | +ORM schema?"* — and the answer was already in the acronym. The date |
| 28 | +check proves it is provenance, not analogy: `ruff_ruby_spo` (the Rails |
| 29 | +ActiveRecord harvest frontend) is dated **2026-05-29**, a month before |
| 30 | +`ruff_python_spo` (Odoo, 2026-06-28); its test fixtures are literally |
| 31 | +Redmine models (`Project has_many :issues`, `acts_as_watchable`). The |
| 32 | +GUID *grew from* AR's polymorphic `(type, id)` — folding the type INTO |
| 33 | +the identifier (`classid | HEEL|HIP|TWIG | family | identity`) instead |
| 34 | +of storing it in a column beside it. Every later frontend (C++ 06-16, |
| 35 | +C# 06-26, Python 06-28) was fitted to the mold AR established. |
| 36 | + |
| 37 | +**The trick it bought:** *the key prerenders nodes with zero value |
| 38 | +decode* (OGAR P0). AR pays a column read to learn a row's type; the |
| 39 | +GUID's dash-groups are self-describing at sight. And AR's two classic |
| 40 | +wounds — polymorphic `(type, id)` breaking referential integrity, |
| 41 | +type-string renames corrupting data — are fixed structurally: the |
| 42 | +classid is an opaque u32 through a codebook, bit-math banned. |
| 43 | + |
| 44 | +**The discipline:** when a name puzzles you, check `git log --date`. |
| 45 | +Etymology answered an architecture question here in one grep. A name |
| 46 | +you can't trace is a name about to be reinvented under a second |
| 47 | +spelling — and duplicated meaning is the third membrane failure mode. |
| 48 | + |
| 49 | +## 2. Old shapes, new clothes — the winning shapes all predate us (FINDING) |
| 50 | + |
| 51 | +SoA is a Fortran-era shape. Morton order is 1966. The 64×64 tile is a |
| 52 | +GPU texture swizzle wearing an L1-cache costume. The kanban WAL is |
| 53 | +`acts_as_journalized` is double-entry bookkeeping. **"gridlake"** was |
| 54 | +coined in PR-X3's design doc before anything shipped; the carrier |
| 55 | +(`MultiLaneColumn`, PR-X1/#174) shipped first and waited for its name — |
| 56 | +ndarray's onebrc probe then called it verbatim "the gridlake carrier, |
| 57 | +not a hashmap." |
| 58 | + |
| 59 | +**The measured trick:** the onebrc sweet spot (E-1BRC-GRIDLAKE- |
| 60 | +SWEETSPOT-1) was not an algorithm. J(gridlake 4096, 1 lane, no |
| 61 | +registry) = 46.3 Mrows/s — equal to the best streamed topology while |
| 62 | +carrying a double-WAL — because 4096 cells ≈ 80 KB integer (16 KB as |
| 63 | +BF16, ndarray #227's proven VDPBF16PS tier) *fits the cache tier*. The |
| 64 | +same pipeline at 65536 cells ran at ~20. **The magic was the SIZE.** |
| 65 | +Architecture taxes are usually working-set mismatches wearing an |
| 66 | +architecture costume; measure the size before redesigning the design. |
| 67 | + |
| 68 | +## 3. A mask is a face over the data (FINDING) |
| 69 | + |
| 70 | +Etymology: *masque* — a face you put OVER something. A mask never |
| 71 | +mutates the data; it changes what you attend to. The workspace's mask |
| 72 | +family is one idea at four scales: |
| 73 | + |
| 74 | +- `cmpeq_mask` (ndarray SIMD): a compare becomes a `u32`/`u64` bitmap. |
| 75 | + Add Kernighan's `mask & (mask - 1)` walk + `trailing_zeros` and the |
| 76 | + bitmap becomes an **ordered event stream** — lane B turns a SIMD |
| 77 | + compare into a *parser*, no per-byte branch. |
| 78 | +- `FieldMask` (contract): one mask = RBAC = UI = render convergence |
| 79 | + (the semantic-OS grounding row) = the wikidata facet presence-bitmask. |
| 80 | +- `StepMask` (compiled templates): **vocabulary arrived before code** — |
| 81 | + it exists only in doctrine docs today. Watch this: etymology running |
| 82 | + ahead of implementation is how phantom types get "re-used" before |
| 83 | + they exist. |
| 84 | +- The Drain-side uniqueness assert (lane H): a `HashSet` over activated |
| 85 | + owner_idxs — a mask over *decisions*, catching the router-straddle |
| 86 | + bug class permanently. |
| 87 | + |
| 88 | +**The discipline:** attention is cheaper than mutation. If a proposal |
| 89 | +mutates shared state to express "which parts matter," ask whether a |
| 90 | +mask over unchanged state does it (cf. borrow-strategy: readonly store, |
| 91 | +owned microcopies, gated write-back). |
| 92 | + |
| 93 | +## 4. Phase is convention, not data — the deepest hat-trick (FINDING core, CONJECTURE edges) |
| 94 | + |
| 95 | +OGAR's perturbation canon decomposes a signal as *(exponent, location, |
| 96 | +phase, magnitude)* and stores **only magnitude** — exponent is the tier |
| 97 | +nibble, location the implied mantissa, phase a deterministic recurrence |
| 98 | +from the ADDRESS. Same address ⟹ same phase forever; roundtrip |
| 99 | +bit-exact; nothing transmitted. |
| 100 | + |
| 101 | +This is one instance of the workspace's deepest rule, which shows up in |
| 102 | +five costumes: |
| 103 | + |
| 104 | +| Costume | The derivable thing never stored/sent | |
| 105 | +|---|---| |
| 106 | +| deterministic phase | phase, from the address walk | |
| 107 | +| clear-by-undo (#227, lanes F–J) | table reset, from the dirty list | |
| 108 | +| codebook mint-once + `SlotMemo` | identity, after first sight — direct CAM writes | |
| 109 | +| `row_owner[i] == i` (lane I) | ownership, from index alignment — no message path | |
| 110 | +| zero-copy-to-tombstone (PR #477) | *everything* — no inter-mailbox handoff type exists | |
| 111 | + |
| 112 | +**The generalization: whatever is derivable from an address already in |
| 113 | +hand must be neither stored nor transmitted.** The GUID is the |
| 114 | +function's argument; storage exists only for what the function cannot |
| 115 | +compute. (CONJECTURE edge, per the substrate-is-ValueSchema probe: |
| 116 | +the *write path* itself — private-merge vs owned/witnessed — may be |
| 117 | +derivable from which tenants a classid's ValueSchema makes live. If |
| 118 | +that holds, even "which substrate" is phase, not data.) |
| 119 | + |
| 120 | +## 5. The witness is free; the boundary is not (FINDING) |
| 121 | + |
| 122 | +Measured across the whole onebrc arc: the kanban witness costs ~66 µs |
| 123 | +per card — **within noise**. Every real tax was a boundary: the Arc |
| 124 | +corpus copy at the actor membrane, blocking/async oversubscription, |
| 125 | +messages (which scale with *batches*, never with data or address-space |
| 126 | +size). The double-cast trick — one frozen `Arc` table cast whole to |
| 127 | +BOTH the ownership sink and the Lance sink — buys two WALs for one |
| 128 | +allocation: testimony at both ends, 312 messages total. |
| 129 | + |
| 130 | +Etymology: witness, from *testis* — the journal is **testimony**, not |
| 131 | +logging. And the dated harvest lesson: `op-journals` mirrors |
| 132 | +`journal.rb`'s *structure* perfectly and contains zero hits for |
| 133 | +aggregation/window/compaction — OpenProject's 15 years of operational |
| 134 | +journal wisdom (time-window coalescing = their independently-evolved |
| 135 | +ahead-firing batch writer; journal-table bloat = the failure mode we |
| 136 | +have not yet paid for) is not in the class graph. |
| 137 | +**The class graph transfers; the pain doesn't.** Structural harvests |
| 138 | +carry declarations; operational doctrine must be distilled by hand. |
| 139 | +(Open gap, still: a WAL retention/compaction doctrine note.) |
| 140 | + |
| 141 | +## 6. Resolve, don't carry — why ValueSchema beat ClassRoutingDTO (FINDING) |
| 142 | + |
| 143 | +DTO etymology: Fowler's *Data Transfer Object*, invented for expensive |
| 144 | +**remote** boundaries. The V3 substrate deleted its internal remote |
| 145 | +boundaries (nothing crosses mailboxes; envelopes are zero-copy to |
| 146 | +tombstone) — so inside the substrate there is nothing left for a DTO |
| 147 | +to do. When the dual-substrate question ("keep fast V2 for huge data, |
| 148 | +switched by classid") arrived, the answer was not a `ClassRoutingDTO` |
| 149 | +but the door that already existed: `ClassView::value_schema(classid)`, |
| 150 | +whose variants already ladder Bootstrap/Compressed (lean, no lifecycle |
| 151 | +tenants) → Cognitive/Full (witnessed). A **resolved** enum costs no |
| 152 | +`ENVELOPE_LAYOUT_VERSION`; a carried struct costs a membrane forever |
| 153 | +(E-V3-SUBSTRATE-IS-VALUESCHEMA-1). |
| 154 | + |
| 155 | +**The litmus:** *does this type travel, or is it re-derivable at the |
| 156 | +reader from an address already in hand?* Re-derivable → resolve it, |
| 157 | +never ship it. DTOs belong only at true membranes (the BBB, the wire, |
| 158 | +the lab REST surface) — and the classid's own iron rule is the same |
| 159 | +sentence from the other side: *pure address; the magic is what it |
| 160 | +resolves to.* |
| 161 | + |
| 162 | +## 7. Homonyms are leaky membranes; the compiler is the etymologist (FINDING) |
| 163 | + |
| 164 | +Two dated incidents, one mechanism: |
| 165 | + |
| 166 | +- The **"app" homonym** (canonical appid *byte*, hi half vs APP render |
| 167 | + *prefix*, lo half) generated an entire phantom cross-session conflict |
| 168 | + — R-1, three sessions, a RULING-NEEDED escalation — resolved by one |
| 169 | + line of existing canon nobody re-read. A word meaning two adjacent |
| 170 | + things is a membrane with a hole in it. |
| 171 | +- The **hardcoded 32** in lane B: `U8x32`'s *name* leaked into a stride |
| 172 | + literal (`array_chunks::<u8, 32>`), silently pinning an AVX-512 build |
| 173 | + to ymm half-width. The fix was to make the name resolve again: |
| 174 | + `array_chunks::<u8, { SimdByte::LANES }>` — the width is now a claim |
| 175 | + the compiler re-checks every build, on every target. |
| 176 | + |
| 177 | +**The discipline:** an inline number or name is a claim that rots; a |
| 178 | +dispatched symbol is a claim under permanent audit. When two ledgers |
| 179 | +seem to disagree, grep for the homonym before escalating — and when a |
| 180 | +constant appears twice, make one of them derive from the other. |
| 181 | + |
| 182 | +## 8. The hat-trick test — magic must name its mechanism (FINDING) |
| 183 | + |
| 184 | +Every real trick above is mechanical and auditable: deterministic phase |
| 185 | +names its recurrence, mint-once names its memo, the mask walk names |
| 186 | +Kernighan, the double-cast names its `Arc`. The anti-pattern is the |
| 187 | +trick with **hidden state**: v1 setters silently writing bits that v2 |
| 188 | +reclaimed — caught FIVE times in one sprint (I-LEGACY-API-FEATURE-GATED) |
| 189 | +— the same function name performing *different magic* depending on a |
| 190 | +feature flag the caller can't see. That is not a trick; that is a bug |
| 191 | +wearing a cape. |
| 192 | + |
| 193 | +The capstone's sharpening states the same law for membranes: *"a |
| 194 | +membrane without a build-failing tripwire is prose."* The unified |
| 195 | +savant test, applicable to every proposal in this workspace: |
| 196 | + |
| 197 | +> **Name the mechanism, or name the fuse. A trick that can't name its |
| 198 | +> mechanism is a bug; a boundary that can't name its fuse is a wish.** |
| 199 | +
|
| 200 | +--- |
| 201 | + |
| 202 | +## The litmus battery (carry these) |
| 203 | + |
| 204 | +1. Puzzled by a name? `git log --date` before you theorize. (§1) |
| 205 | +2. Architecture tax? Measure the working-set size first. (§2) |
| 206 | +3. Mutating to express relevance? Try a mask. (§3) |
| 207 | +4. Derivable from an address in hand? Never store, never send. (§4) |
| 208 | +5. Adding a witness? It's ~free. Adding a boundary? That's the bill. (§5) |
| 209 | +6. New type that travels? Prove it can't be resolved instead. (§6) |
| 210 | +7. Two ledgers disagree? Hunt the homonym. Inline literal? Dispatch it. (§7) |
| 211 | +8. Impressed by a trick? Make it name its mechanism. (§8) |
| 212 | + |
| 213 | +*Cross-refs:* E-SEMANTIC-OS-CONVERGENCE-1 (membrane law), |
| 214 | +E-1BRC-* arc (all measurements), E-V3-SUBSTRATE-IS-VALUESCHEMA-1, |
| 215 | +`crates/onebrc-probe/{FINDINGS,COMMENTARY}.md`, OGAR `CLAUDE.md` P0 + |
| 216 | +perturbation canon, ndarray `.claude/knowledge/pr-x1-design.md` + |
| 217 | +`guid-prefix-shape-routing.md`, `docs/architecture/soa-three-tier-model.md`. |
0 commit comments