Skip to content

Commit d03c16a

Browse files
committed
DeepNSM pipeline: 65 universal primes → SPO fingerprint → semantic edge validation
1 parent b127c14 commit d03c16a

1 file changed

Lines changed: 186 additions & 0 deletions

File tree

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
# DeepNSM → Crystal Encoder Pipeline
2+
3+
## The Five Pieces
4+
5+
| Piece | Location | What It Is | Status |
6+
|-------|----------|-----------|--------|
7+
| **Paper** | arXiv:2505.11764 | "Towards Universal Semantics with LLMs" — 65 semantic primes, explications beat dictionary defs, 1B/8B models beat GPT-4o | Published May 2025 |
8+
| **DeepNSM repo** | AdaWorldAPI/DeepNSM | Fork of paper's code: eval pipeline, prompts, train wrappers. Models on HF: `baartmar/DeepNSM-1B`, `baartmar/DeepNSM-8B`, `baartmar/nsm_dataset` | Working (Python, needs GPU) |
9+
| **deepmsm repo** | AdaWorldAPI/deepmsm | Unrelated — "deep Markov State Modeling" for molecular dynamics. Wrong repo. Not ours. | **IGNORE** |
10+
| **nsm.rs** | ladybug-rs `src/grammar/nsm.rs` (448 lines) | 65 NSM primitives as Rust constants. `NSMField` = 65-dim float vector. `from_text()` = keyword match → prime weights. `to_fingerprint_contribution()` = golden-ratio hash projection to fingerprint bits. | Working but primitive — keyword matching, not LLM explications |
11+
| **deepnsm_integration.rs** | ladybug-rs `src/spo/deepnsm_integration.rs` (26KB) | Integration spec: explication → prime weights → role-bind → fingerprint. Training: use DeepNSM model. Inference: pure SIMD, no LLM. | Architecture defined, partially implemented |
12+
13+
## The Pipeline (how they connect)
14+
15+
```
16+
TRAINING (one-time, needs GPU)
17+
═══════════════════════════════
18+
19+
Text corpus ──► DeepNSM-8B model (HF: baartmar/DeepNSM-8B)
20+
21+
22+
NSM Explications
23+
"HAPPY = a person feels something good,
24+
this person thinks: something good happened to me"
25+
26+
27+
Parse into Prime Weight Vectors
28+
HAPPY → [I:0.3, FEEL:0.9, GOOD:0.8, THINK:0.6,
29+
HAPPEN:0.5, SOMETHING:0.7, ...]
30+
31+
32+
Build Codebook: 1024 σ₃-distinct prime-weight clusters
33+
Each centroid = a "semantic kernel locus" in prime space
34+
35+
36+
Project each centroid → 16K-bit fingerprint via nsm.rs
37+
(currently golden-ratio hash, should be learned projection)
38+
39+
40+
INFERENCE (forever, no GPU, no LLM)
41+
════════════════════════════════════
42+
43+
New text ──► Keyword → prime weights (nsm.rs `from_text()`)
44+
45+
├──► Nearest codebook centroid (1024-way lookup)
46+
│ Cost: 1024 Hamming distances = ~13K cycles
47+
48+
├──► SPO role binding: S⊗subject_primes + P⊗predicate_primes + O⊗object_primes
49+
│ Cost: 3 XOR operations
50+
51+
└──► Store as SPO fingerprint with typed halo + NARS truth
52+
Cost: 13 cycles per comparison
53+
```
54+
55+
## What's Missing (the gap between paper and production)
56+
57+
### Gap 1: `from_text()` is keyword matching, not explication
58+
59+
Current `nsm.rs`:
60+
```rust
61+
pub fn from_text(text: &str) -> NSMField {
62+
// Looks for "I", "WANT", "KNOW" etc. as literal words in text
63+
// This misses: "desire" → WANT, "understand" → KNOW, "beautiful" → GOOD+SEE
64+
}
65+
```
66+
67+
The paper's insight: DeepNSM-1B generates proper explications that decompose ANY word into primes. "Schadenfreude" → "this person feels something good because something bad happened to another person" → [FEEL:0.9, GOOD:0.7, BAD:0.6, HAPPEN:0.8, SOMEONE:0.5, BECAUSE:0.9].
68+
69+
**Fix:** Build a lookup table from DeepNSM-8B explications. Run DeepNSM on vocabulary (e.g., 50K words from WordNet or domain corpus). Parse each explication into prime weights. Store as the codebook. At inference time, tokenize → lookup → sum prime weights. No LLM needed.
70+
71+
### Gap 2: `to_fingerprint_contribution()` uses golden-ratio hash
72+
73+
Current projection:
74+
```rust
75+
let seed = (i as u64).wrapping_mul(0x9E3779B97F4A7C15); // golden ratio
76+
let bit_pos = (seed.wrapping_mul((j + 1) as u64) % FINGERPRINT_BITS as u64) as usize;
77+
```
78+
79+
This is random projection. It works (JL lemma guarantees distance preservation in expectation) but it's not learned. The paper shows that DeepNSM models learn better-than-random representations.
80+
81+
**Fix:** Train the projection matrix. Use DeepNSM explication pairs with known similarity as training data. Optimize the 65→16K projection to maximize σ₃ separation in Hamming space. This is the codebook training pipeline from `codebook_training.rs` (Phase 2: distillation).
82+
83+
### Gap 3: No SPO role binding for primes
84+
85+
Current: NSMField treats all 65 primes as a flat vector. But the paper's explications have structure: "a PERSON FEELS something GOOD" has Subject (PERSON), Predicate (FEEL), Object (GOOD).
86+
87+
**Fix:** Parse explications into SPO structure, then bind primes to roles:
88+
```rust
89+
pub fn explication_to_spo(explication: &str) -> SpoFingerprint {
90+
let parsed = parse_nsm_explication(explication);
91+
let s_primes = NSMField::from_primes(&parsed.agent_primes);
92+
let p_primes = NSMField::from_primes(&parsed.action_primes);
93+
let o_primes = NSMField::from_primes(&parsed.patient_primes);
94+
95+
SpoFingerprint {
96+
s_plane: s_primes.to_fingerprint(),
97+
p_plane: p_primes.to_fingerprint(),
98+
o_plane: o_primes.to_fingerprint(),
99+
}
100+
}
101+
```
102+
103+
Now each plane carries NSM-grounded meaning. The Faktorzerlegung decomposes in terms of universal semantic primes, not arbitrary embedding dimensions.
104+
105+
### Gap 4: Evaluation metrics not wired
106+
107+
The paper defines three metrics:
108+
1. **Legality Score**: (primes - molecules) / total_words — how pure is the explication?
109+
2. **Substitutability Score**: log-probability tests — does replacing the word with its explication preserve meaning?
110+
3. **Cross-Translatability**: round-trip BLEU through low-resource languages
111+
112+
These should be wired into the codebook training pipeline as quality gates. An explication with legality < 0.7 doesn't enter the codebook.
113+
114+
## How This Connects to the Crystal Encoder Strategy (prompt 05)
115+
116+
The crystal encoder strategy has three phases:
117+
```
118+
Phase 1: Jina parallel (external API, 1024D dense, ~100ms)
119+
Phase 2: Distillation (Burn/Candle, SPO structural loss)
120+
Phase 3: Pure crystal (no external, codebook only, ~5μs)
121+
```
122+
123+
NSM/DeepNSM provides a **fourth path that may be better than all three**:
124+
125+
```
126+
Phase 0: NSM bootstrap (one-time)
127+
Run DeepNSM-8B on vocabulary → explications → prime weights → codebook
128+
No continuous distillation needed. The codebook IS the model.
129+
130+
Phase 3b: Pure NSM inference
131+
Text → tokenize → prime weight lookup → SPO role bind → fingerprint
132+
Cost: dictionary lookup + 3 XOR operations
133+
No Jina. No Burn/Candle. No transformer. No API.
134+
The 65 primes ARE the embedding dimensions.
135+
```
136+
137+
This is even cheaper than the σ₃ codebook lookup because you skip the nearest-centroid step entirely. The prime weights directly encode meaning. The projection to fingerprint space is deterministic.
138+
139+
**The trade-off**: NSM keyword matching (`from_text()`) is cruder than transformer encoding. But:
140+
- The paper shows DeepNSM-1B (1 billion params) beats GPT-4o on explication quality
141+
- If you build the lookup table from DeepNSM explications, you get that quality at dictionary-lookup cost
142+
- 65 primes is an incredibly compact representation (65 floats = 260 bytes vs 1024D = 4KB)
143+
- The primes are **universal** (attested in 90+ languages) — no retraining for new languages
144+
145+
## Connection to Edge Validation (from fork plan prompt 10, Action 2)
146+
147+
The edge validation in the lance-graph fork uses three paths:
148+
- Path A: Grammar verbs (144 in ladybug)
149+
- Path B: NSM primes (65 from this paper)
150+
- Path C: Semantic kernel loci (1024 σ₃ codebook)
151+
152+
With DeepNSM integration:
153+
```
154+
Query: MATCH (a)-[:LOVES]->(b)
155+
156+
Path B resolution:
157+
"LOVES" → DeepNSM explication → "SOMEONE FEELS something very GOOD
158+
about another SOMEONE, this SOMEONE WANTS to be NEAR this other SOMEONE"
159+
→ NSMField: [SOMEONE:1.0, FEEL:0.9, GOOD:0.8, WANT:0.7, NEAR:0.6]
160+
→ fingerprint contribution
161+
→ Hamming search in predicate plane for any edge with similar prime activation
162+
→ Returns: LOVES, ADORES, CHERISHES, IS_FOND_OF (all with confidence scores)
163+
```
164+
165+
This gives **semantic edge validation**, not string matching. A query for `:LOVES` finds `:ADORES` because they decompose into similar primes. This is what makes the Cypher bridge intelligent.
166+
167+
## Action Items
168+
169+
### Immediate
170+
1. **Verify DeepNSM-8B model access**: Download from `baartmar/DeepNSM-8B` on HF. Test locally.
171+
2. **Generate lookup table**: Run DeepNSM on WordNet core vocabulary (~5000 words). Parse explications into prime weights. Store as JSON/bincode codebook.
172+
3. **Upgrade `from_text()`**: Replace keyword matching with codebook lookup. Fall back to keywords for OOV words.
173+
174+
### Short-term
175+
4. **SPO role parsing**: Implement explication → Subject/Predicate/Object prime extraction. Use grammar module's `CausalityFlow` for agent/action/patient parsing.
176+
5. **Quality gates**: Wire legality/substitutability/cross-translatability scores into codebook training.
177+
6. **Learned projection**: Replace golden-ratio hash with trained 65→16K projection matrix optimized for σ₃ separation.
178+
179+
### Medium-term
180+
7. **Full crystal encoder Phase 0**: NSM bootstrap path as alternative to Jina/Burn/Candle.
181+
8. **Edge validation**: Wire NSM resolution into lance-graph fork's `resolve_edge_type()`.
182+
9. **Multi-language**: Test with non-English input. NSM primes are language-universal — the codebook should work across languages without retraining.
183+
184+
## Note on deepmsm
185+
186+
`AdaWorldAPI/deepmsm` is a **molecular dynamics** repo ("Progress in deep Markov State Modeling: Coarse graining and experimental data restraints"). It has nothing to do with Natural Semantic Metalanguage. The naming collision is unfortunate. This repo should be ignored for the current work.

0 commit comments

Comments
 (0)