|
| 1 | +# DeepNSM → Crystal Encoder Pipeline |
| 2 | + |
| 3 | +## The Five Pieces |
| 4 | + |
| 5 | +| Piece | Location | What It Is | Status | |
| 6 | +|-------|----------|-----------|--------| |
| 7 | +| **Paper** | arXiv:2505.11764 | "Towards Universal Semantics with LLMs" — 65 semantic primes, explications beat dictionary defs, 1B/8B models beat GPT-4o | Published May 2025 | |
| 8 | +| **DeepNSM repo** | AdaWorldAPI/DeepNSM | Fork of paper's code: eval pipeline, prompts, train wrappers. Models on HF: `baartmar/DeepNSM-1B`, `baartmar/DeepNSM-8B`, `baartmar/nsm_dataset` | Working (Python, needs GPU) | |
| 9 | +| **deepmsm repo** | AdaWorldAPI/deepmsm | Unrelated — "deep Markov State Modeling" for molecular dynamics. Wrong repo. Not ours. | **IGNORE** | |
| 10 | +| **nsm.rs** | ladybug-rs `src/grammar/nsm.rs` (448 lines) | 65 NSM primitives as Rust constants. `NSMField` = 65-dim float vector. `from_text()` = keyword match → prime weights. `to_fingerprint_contribution()` = golden-ratio hash projection to fingerprint bits. | Working but primitive — keyword matching, not LLM explications | |
| 11 | +| **deepnsm_integration.rs** | ladybug-rs `src/spo/deepnsm_integration.rs` (26KB) | Integration spec: explication → prime weights → role-bind → fingerprint. Training: use DeepNSM model. Inference: pure SIMD, no LLM. | Architecture defined, partially implemented | |
| 12 | + |
| 13 | +## The Pipeline (how they connect) |
| 14 | + |
| 15 | +``` |
| 16 | + TRAINING (one-time, needs GPU) |
| 17 | + ═══════════════════════════════ |
| 18 | + |
| 19 | +Text corpus ──► DeepNSM-8B model (HF: baartmar/DeepNSM-8B) |
| 20 | + │ |
| 21 | + ▼ |
| 22 | + NSM Explications |
| 23 | + "HAPPY = a person feels something good, |
| 24 | + this person thinks: something good happened to me" |
| 25 | + │ |
| 26 | + ▼ |
| 27 | + Parse into Prime Weight Vectors |
| 28 | + HAPPY → [I:0.3, FEEL:0.9, GOOD:0.8, THINK:0.6, |
| 29 | + HAPPEN:0.5, SOMETHING:0.7, ...] |
| 30 | + │ |
| 31 | + ▼ |
| 32 | + Build Codebook: 1024 σ₃-distinct prime-weight clusters |
| 33 | + Each centroid = a "semantic kernel locus" in prime space |
| 34 | + │ |
| 35 | + ▼ |
| 36 | + Project each centroid → 16K-bit fingerprint via nsm.rs |
| 37 | + (currently golden-ratio hash, should be learned projection) |
| 38 | +
|
| 39 | +
|
| 40 | + INFERENCE (forever, no GPU, no LLM) |
| 41 | + ════════════════════════════════════ |
| 42 | +
|
| 43 | +New text ──► Keyword → prime weights (nsm.rs `from_text()`) |
| 44 | + │ |
| 45 | + ├──► Nearest codebook centroid (1024-way lookup) |
| 46 | + │ Cost: 1024 Hamming distances = ~13K cycles |
| 47 | + │ |
| 48 | + ├──► SPO role binding: S⊗subject_primes + P⊗predicate_primes + O⊗object_primes |
| 49 | + │ Cost: 3 XOR operations |
| 50 | + │ |
| 51 | + └──► Store as SPO fingerprint with typed halo + NARS truth |
| 52 | + Cost: 13 cycles per comparison |
| 53 | +``` |
| 54 | + |
| 55 | +## What's Missing (the gap between paper and production) |
| 56 | + |
| 57 | +### Gap 1: `from_text()` is keyword matching, not explication |
| 58 | + |
| 59 | +Current `nsm.rs`: |
| 60 | +```rust |
| 61 | +pub fn from_text(text: &str) -> NSMField { |
| 62 | + // Looks for "I", "WANT", "KNOW" etc. as literal words in text |
| 63 | + // This misses: "desire" → WANT, "understand" → KNOW, "beautiful" → GOOD+SEE |
| 64 | +} |
| 65 | +``` |
| 66 | + |
| 67 | +The paper's insight: DeepNSM-1B generates proper explications that decompose ANY word into primes. "Schadenfreude" → "this person feels something good because something bad happened to another person" → [FEEL:0.9, GOOD:0.7, BAD:0.6, HAPPEN:0.8, SOMEONE:0.5, BECAUSE:0.9]. |
| 68 | + |
| 69 | +**Fix:** Build a lookup table from DeepNSM-8B explications. Run DeepNSM on vocabulary (e.g., 50K words from WordNet or domain corpus). Parse each explication into prime weights. Store as the codebook. At inference time, tokenize → lookup → sum prime weights. No LLM needed. |
| 70 | + |
| 71 | +### Gap 2: `to_fingerprint_contribution()` uses golden-ratio hash |
| 72 | + |
| 73 | +Current projection: |
| 74 | +```rust |
| 75 | +let seed = (i as u64).wrapping_mul(0x9E3779B97F4A7C15); // golden ratio |
| 76 | +let bit_pos = (seed.wrapping_mul((j + 1) as u64) % FINGERPRINT_BITS as u64) as usize; |
| 77 | +``` |
| 78 | + |
| 79 | +This is random projection. It works (JL lemma guarantees distance preservation in expectation) but it's not learned. The paper shows that DeepNSM models learn better-than-random representations. |
| 80 | + |
| 81 | +**Fix:** Train the projection matrix. Use DeepNSM explication pairs with known similarity as training data. Optimize the 65→16K projection to maximize σ₃ separation in Hamming space. This is the codebook training pipeline from `codebook_training.rs` (Phase 2: distillation). |
| 82 | + |
| 83 | +### Gap 3: No SPO role binding for primes |
| 84 | + |
| 85 | +Current: NSMField treats all 65 primes as a flat vector. But the paper's explications have structure: "a PERSON FEELS something GOOD" has Subject (PERSON), Predicate (FEEL), Object (GOOD). |
| 86 | + |
| 87 | +**Fix:** Parse explications into SPO structure, then bind primes to roles: |
| 88 | +```rust |
| 89 | +pub fn explication_to_spo(explication: &str) -> SpoFingerprint { |
| 90 | + let parsed = parse_nsm_explication(explication); |
| 91 | + let s_primes = NSMField::from_primes(&parsed.agent_primes); |
| 92 | + let p_primes = NSMField::from_primes(&parsed.action_primes); |
| 93 | + let o_primes = NSMField::from_primes(&parsed.patient_primes); |
| 94 | + |
| 95 | + SpoFingerprint { |
| 96 | + s_plane: s_primes.to_fingerprint(), |
| 97 | + p_plane: p_primes.to_fingerprint(), |
| 98 | + o_plane: o_primes.to_fingerprint(), |
| 99 | + } |
| 100 | +} |
| 101 | +``` |
| 102 | + |
| 103 | +Now each plane carries NSM-grounded meaning. The Faktorzerlegung decomposes in terms of universal semantic primes, not arbitrary embedding dimensions. |
| 104 | + |
| 105 | +### Gap 4: Evaluation metrics not wired |
| 106 | + |
| 107 | +The paper defines three metrics: |
| 108 | +1. **Legality Score**: (primes - molecules) / total_words — how pure is the explication? |
| 109 | +2. **Substitutability Score**: log-probability tests — does replacing the word with its explication preserve meaning? |
| 110 | +3. **Cross-Translatability**: round-trip BLEU through low-resource languages |
| 111 | + |
| 112 | +These should be wired into the codebook training pipeline as quality gates. An explication with legality < 0.7 doesn't enter the codebook. |
| 113 | + |
| 114 | +## How This Connects to the Crystal Encoder Strategy (prompt 05) |
| 115 | + |
| 116 | +The crystal encoder strategy has three phases: |
| 117 | +``` |
| 118 | +Phase 1: Jina parallel (external API, 1024D dense, ~100ms) |
| 119 | +Phase 2: Distillation (Burn/Candle, SPO structural loss) |
| 120 | +Phase 3: Pure crystal (no external, codebook only, ~5μs) |
| 121 | +``` |
| 122 | + |
| 123 | +NSM/DeepNSM provides a **fourth path that may be better than all three**: |
| 124 | + |
| 125 | +``` |
| 126 | +Phase 0: NSM bootstrap (one-time) |
| 127 | + Run DeepNSM-8B on vocabulary → explications → prime weights → codebook |
| 128 | + No continuous distillation needed. The codebook IS the model. |
| 129 | + |
| 130 | +Phase 3b: Pure NSM inference |
| 131 | + Text → tokenize → prime weight lookup → SPO role bind → fingerprint |
| 132 | + Cost: dictionary lookup + 3 XOR operations |
| 133 | + No Jina. No Burn/Candle. No transformer. No API. |
| 134 | + The 65 primes ARE the embedding dimensions. |
| 135 | +``` |
| 136 | + |
| 137 | +This is even cheaper than the σ₃ codebook lookup because you skip the nearest-centroid step entirely. The prime weights directly encode meaning. The projection to fingerprint space is deterministic. |
| 138 | + |
| 139 | +**The trade-off**: NSM keyword matching (`from_text()`) is cruder than transformer encoding. But: |
| 140 | +- The paper shows DeepNSM-1B (1 billion params) beats GPT-4o on explication quality |
| 141 | +- If you build the lookup table from DeepNSM explications, you get that quality at dictionary-lookup cost |
| 142 | +- 65 primes is an incredibly compact representation (65 floats = 260 bytes vs 1024D = 4KB) |
| 143 | +- The primes are **universal** (attested in 90+ languages) — no retraining for new languages |
| 144 | + |
| 145 | +## Connection to Edge Validation (from fork plan prompt 10, Action 2) |
| 146 | + |
| 147 | +The edge validation in the lance-graph fork uses three paths: |
| 148 | +- Path A: Grammar verbs (144 in ladybug) |
| 149 | +- Path B: NSM primes (65 from this paper) |
| 150 | +- Path C: Semantic kernel loci (1024 σ₃ codebook) |
| 151 | + |
| 152 | +With DeepNSM integration: |
| 153 | +``` |
| 154 | +Query: MATCH (a)-[:LOVES]->(b) |
| 155 | +
|
| 156 | +Path B resolution: |
| 157 | + "LOVES" → DeepNSM explication → "SOMEONE FEELS something very GOOD |
| 158 | + about another SOMEONE, this SOMEONE WANTS to be NEAR this other SOMEONE" |
| 159 | + → NSMField: [SOMEONE:1.0, FEEL:0.9, GOOD:0.8, WANT:0.7, NEAR:0.6] |
| 160 | + → fingerprint contribution |
| 161 | + → Hamming search in predicate plane for any edge with similar prime activation |
| 162 | + → Returns: LOVES, ADORES, CHERISHES, IS_FOND_OF (all with confidence scores) |
| 163 | +``` |
| 164 | + |
| 165 | +This gives **semantic edge validation**, not string matching. A query for `:LOVES` finds `:ADORES` because they decompose into similar primes. This is what makes the Cypher bridge intelligent. |
| 166 | + |
| 167 | +## Action Items |
| 168 | + |
| 169 | +### Immediate |
| 170 | +1. **Verify DeepNSM-8B model access**: Download from `baartmar/DeepNSM-8B` on HF. Test locally. |
| 171 | +2. **Generate lookup table**: Run DeepNSM on WordNet core vocabulary (~5000 words). Parse explications into prime weights. Store as JSON/bincode codebook. |
| 172 | +3. **Upgrade `from_text()`**: Replace keyword matching with codebook lookup. Fall back to keywords for OOV words. |
| 173 | + |
| 174 | +### Short-term |
| 175 | +4. **SPO role parsing**: Implement explication → Subject/Predicate/Object prime extraction. Use grammar module's `CausalityFlow` for agent/action/patient parsing. |
| 176 | +5. **Quality gates**: Wire legality/substitutability/cross-translatability scores into codebook training. |
| 177 | +6. **Learned projection**: Replace golden-ratio hash with trained 65→16K projection matrix optimized for σ₃ separation. |
| 178 | + |
| 179 | +### Medium-term |
| 180 | +7. **Full crystal encoder Phase 0**: NSM bootstrap path as alternative to Jina/Burn/Candle. |
| 181 | +8. **Edge validation**: Wire NSM resolution into lance-graph fork's `resolve_edge_type()`. |
| 182 | +9. **Multi-language**: Test with non-English input. NSM primes are language-universal — the codebook should work across languages without retraining. |
| 183 | + |
| 184 | +## Note on deepmsm |
| 185 | + |
| 186 | +`AdaWorldAPI/deepmsm` is a **molecular dynamics** repo ("Progress in deep Markov State Modeling: Coarse graining and experimental data restraints"). It has nothing to do with Natural Semantic Metalanguage. The naming collision is unfortunate. This repo should be ignored for the current work. |
0 commit comments