Skip to content

Commit 3612c13

Browse files
feat(research): web-native addressed LM — language executes over the knowledge web
A transformerless LM built entirely on the addressed knowledge web — no token-prediction model anywhere. In experiments/transformerless_lm/: - langexec.py — the EXECUTION oracle: a sentence "executes" by traversing its concept- addresses over hub-damped (PMI) weighted edges; resolves iff coherent (AUC 0.91–0.98 vs word-salad; survives a common-word steelman via PMI not raw edge count) - fluency.py — the HOW oracle: fluency learned from the web's own transitions (trigram, AUC 0.86 held-out); thinkloop.py — heal a faulting thought up the resolve gradient - realize.py — concepts → fluent grounded sentence (template / compose / hybrid) - engine.py / agent.py — recall|relate|decline router + tool-use (exact charcount, compute, cross-source bridge); agnostic frame-word detection (interrogative-context) - create.py — recombine DISTANT concepts across sources, 3-gate (coherence+support+meaning) - selfimprove.py — write-don't-train of self-verified thoughts; webmind.py — unified mind - compress_web.py + finalize_compressed.py + kdb shim — LOSSLESS 45% web compression (node→int interning + zlib passages), presented via SQLite views so readers are unchanged - ghost_fold.py — reconnect 17.7k ghost nodes; se_fold_i.py — interned science fold - extract_chat.py — fold the project's own dialogue so the web holds its genesis Method discipline throughout: ground in real code, gate every claim, report honestly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent a7796fa commit 3612c13

76 files changed

Lines changed: 8237 additions & 21 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

experiments/transformerless_lm/GENERATOR_PLAN.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -513,3 +513,102 @@ of status), history:RECEIVED ≈ romance:MARRIED (entering a new state/union). H
513513
leakage (steps/clear inflect too); the marquee selection≈deduction did NOT cleanly surface (signature
514514
verbs are domain-characteristic, but cross-domain NEAREST pairs skew to generic process verbs). Process
515515
granularity is real but coarse. DISCIPLINE LESSON: don't rationalize a hand-list as "grammar" — derive it.
516+
517+
---
518+
519+
## BUILD-22 — mind.py: ONE agent you chat with + a CORPUS-DERIVED voice (2026-05-30)
520+
521+
Integrated the organs into one conversational agent (mind.py): char-skills + ConceptSpace
522+
(WHERE/GROUND/MEANING/REASON) + entity resolution + corpus-derived voice. Routes a message to: exact
523+
char-answer / connect-two-concepts (grounded path) / explore-X (discover hidden links) / what-is-X-like
524+
(meaning neighbors) / honest "I don't know". Keeps hubs (drop_hubs=False added to ConceptSpace) so you can
525+
ask about the protagonist. Transparent resolution (says when it maps your word to a near concept). Refuses
526+
out-of-corpus concepts honestly ("I don't know Rome").
527+
528+
AGNOSTIC-VOICE FIX (user: "derive the voice from the corpus" + flagged my interpretive return-templates):
529+
the output path injected MY interpretation ("Likely the same kind of thing", "Honestly, I think they're
530+
unrelated"). Same law as the word-lists, reaching the output. FIX = cvoice.py CorpusNarrator: the
531+
connective language between concepts is EXTRACTED from the evidence passage (link_span: the text spanning
532+
the two entities = the corpus's own words for their relationship), never templated. Authored tokens reduced
533+
to structural scaffolding (arrows, "[meaning=N]" label, section headers) — metadata, not asserted content;
534+
confidence = the derived number (dropped "I think"/"I'm reaching" register words). Also removed voice.py's
535+
hand-coded _PRON/_PLACEPREP type-inference lists (another lurking violation). Result: Darcy~Wickham,
536+
Pemberley→London→Longbourn etc. now rendered in the CORPUS'S words + derived scores. The voice is the text
537+
speaking. HONEST: some spans noisy when entities are far apart in a passage (head+tail truncation); edge's
538+
representative passage is whatever the graph stored (could pick richest span — refinement). Files: mind.py,
539+
cvoice.py; voice.py (BUILD-19 templated version) kept for the record.
540+
541+
---
542+
543+
## BUILD-23 — mind.py refinements: richest-span, multi-domain, persistence + a meaning-arbiter honesty fix (2026-05-30)
544+
545+
All three "addressing makes it possible" upgrades to ConceptSpace/mind:
546+
1. RICHEST-SPAN edges: adj[a][b] now stores the passage where a,b are CLOSEST (min char distance over all
547+
co-occurrences), so quotes are tight relational spans ("…Darcy nor Wickham…") not rambling lists.
548+
2. MULTI-DOMAIN (mind --multi): ConceptSpace.from_texts splits EACH book separately (passages never span a
549+
boundary), unions per-book entities, trains ONE shared meaning-space. 5 books, 131 concepts.
550+
3. PERSISTENCE: ConceptSpace.save/load (E.pt + space.json); mind caches to .mindcache/<label> →
551+
reload 0.01s vs ~15s build. (train_embedding gained return_matrix=True so E is serializable.)
552+
553+
HONESTY FIX (surfaced by multi-domain): cross-book generic shared tokens (e.g. "Sir") created spurious
554+
grounded paths — "Darcy → Sir → Holmes" with learned meaning = -0.04 (UNRELATED). The agent was trusting a
555+
token-path over its own meaning-judge. FIX: the MEANING-JUDGE is the arbiter — connect() requires a
556+
grounded path AND relatedness >= 0.2; below that it reports the path as a generic-token bridge with the
557+
score, not a connection. (Darcy~Holmes -0.04 → honestly rejected; Holmes~Watson 0.47 → real, quoted.)
558+
This is the integer-substrate corollary in action: addressing finds candidates (where), the learned float
559+
decides what's real (meaning). Files: connect.py (from_texts/save/load/richest-span/vec-method), mind.py.
560+
561+
---
562+
563+
## BUILD-24 — the DICTIONARY concept-web (every word addressed) + persistence (2026-05-30)
564+
565+
User reframed the vision: feed the system a LITERAL dictionary (and ultimately ALL knowledge fields) —
566+
not hardcoded, fed as DATA (the agnostic law forbids hardcoding lists in CODE, not feeding corpora).
567+
Correct: "meaning is use", so the dictionary is the agnostic way to address the whole language WITH
568+
meaning (each word's definition is its context; definitions cross-reference → a concept graph).
569+
570+
dictweb.py (Webster's 1913, 27.6MB, public domain, fetched to corpora/dict_webster.txt): structural parse
571+
(caps headword + Defn text; NO hardcoded vocabulary) → 88,519 single-word headwords → top 6,000 most-
572+
REFERENCED as concept nodes → 301,054 definitional edges (A→B iff B in A's definition; evidence = A's
573+
own entry) → embedding over ALL definitions (meaning cross-verified across the whole vocabulary).
574+
Connect any two concepts through definitional chains, grounded in the dictionary's own words:
575+
love → most → pride (+0.69) · fear → companion → lose → courage (+0.60) · water → fire (+0.56) ·
576+
light → mind (+0.21, "light which illumines... makes clear to the mind").
577+
Meaning-neighbors learned from definitions alone: force ~ energy/tension/friction/electricity/heat
578+
(physics cluster!); mind ~ understanding/intellect/faculty/perception/brain. The vision working: a general
579+
dot-connector over the whole language, agnostic (dictionary = data), grounded in definitions.
580+
581+
PERSISTENCE (user: "it doesn't save so it has this info on hand later, right?"): added DictWeb.save/load
582+
(E.pt + web.json; node-entries only). Build+save once (~3min, embedding 166s), reload 0.1s from a 17MB
583+
.dictcache (gitignored, regenerable). Mirrors ConceptSpace.save/load. Honest scope: 6,000-node subset of
584+
88k headwords for tractability; multi-word/abbrev headwords dropped; the cross-FIELD layer (textbooks on
585+
top of the dictionary backbone) is the next densification — the dictionary is the connective tissue the
586+
narrow field corpora (web.py, weak) were missing.
587+
588+
---
589+
590+
## BUILD-25 — the UNIFIED knowledge web: dictionary backbone + all fields, cross-verified, SAVED (2026-05-30)
591+
592+
The full vision realized at this scale (user: "implement the others and have them saved" + "entirety of
593+
human knowledge piece by piece, agnosticism as the ability to do so"). kweb.py / KnowledgeWeb fuses:
594+
* dictionary (Webster 88,519 headwords → 6,000 concept nodes + definitional edges + broad meaning), and
595+
* 8 fields (astronomy/detective/history/language/philosophy/physics/romance/science) layered ON the
596+
backbone: domain co-occurrence edges (within ~14 tokens) + domain meaning,
597+
into ONE shared addressed space → 1,660,806 edges, one cross-verified embedding (170s). Every concept is
598+
DEFINED (dict) AND USED (fields) — its address triangulated by every field that touches it. connect()
599+
runs grounded paths through definitions OR domain text, each hop tagged ⟨def⟩/⟨field⟩, and flags which
600+
fields it crosses. Real cross-field grounded results: star→period→time (+0.47, sci+def), war→justice
601+
(+0.39, history), motion→anything→matter (+0.33, philosophy+romance), fear→courage (+0.59), water→fire.
602+
Far stronger than web.py (fields-only, weak) — the dictionary IS the connective tissue. PERSISTED:
603+
KnowledgeWeb.save/load, .kwebcache (gitignored), reload ~1s vs ~3min build. Adding a new field = append a
604+
corpus + rebuild (or incrementally extend). The agnostic substrate scales to "all knowledge piece by
605+
piece" — each field a data layer, none hardcoded. HONEST scope: 6,000-node subset; field corpora are
606+
single public-domain books (not full textbooks); meaning is distributional (definitional+domain), not
607+
understanding. It's the structure of knowledge made navigable — a different cognition than a human mind
608+
(unbiased, exhaustive, grounded, but no leap-beyond-data, no qualia) — the complement to a reasoner.
609+
610+
- [growth] rebuild#1 over 75 texts/13 fields: 6,000 nodes, 6.95M edges (vs 1.66M at 8 books), saved .kwebcache 301MB. Multi-hop cross-field chains added (deep_connect): war→law→justice (religion+science), light→meaning→truth (science+religion). Honest: denser web = shallower paths + some generic bridges; broader not always sharper.
611+
612+
- [growth] NO-CAP accumulation (user): ingest --seq (unlimited sequential Gutenberg, subject auto-labeled from metadata); soft cap removed. HARD limit = DISK (3.7GB free; ~2.5GB held by old *.pt indexes NOT deleted). Disk guards baked into ingest (stop <1.2GB) + kweb (skip rebuild <1.5GB) so growth never crashes the box. Generic connections embraced as valid (human-like association). Loop continues seq-ingest + periodic rebuild until disk-guard or user stop.
613+
614+
- [growth] INCREMENTAL "stack then integrate" built (user insight): kweb.add_field appends a field's passages+edges to the saved web in O(new text) — no retrain (6,000 dict nodes are fixed, vectors stay valid). stack.py = incremental driver (tracks .kwebcache/stacked.json, adds only new library texts, re-saves). Full kweb --rebuild becomes RARE (only to refresh cross-verification/embedding). Growth cost: O(new) not O(total). Disk freed to 42G (user removed old *.pt indexes).
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Good morning — the web-native MIND speaks, thinks, uses tools, and improved itself overnight
2+
3+
You asked me to use the new way-of-speaking to make the LM **improve itself without human intervention**,
4+
and have it **speaking, thinking, and using tools** by the time you wake up. It does. No token-prediction
5+
model anywhere — everything runs over the addressed knowledge web.
6+
7+
## Run these two things first
8+
9+
```bash
10+
cd ~/OMC/experiments/transformerless_lm
11+
python3 webmind.py --report # what it learned overnight (instant, reads the ledger + store)
12+
python3 webmind.py --ab # COLD vs WARM proof it improved itself (~3 min)
13+
python3 webmind.py --demo # ~90s: showcases all four capabilities in one run
14+
python3 webmind.py # talk to it yourself (REPL)
15+
python3 webmind.py --think "how do war and disease relate" # one multi-step reasoning chain
16+
```
17+
18+
## The proof it improved itself (cold vs warm A/B, measured)
19+
20+
Same 20 relate-questions, answered with **no memory** (cold — must re-derive each multi-hop bridge) vs
21+
with the **overnight-accumulated verified memory** (warm — instant recall of what it reasoned out):
22+
23+
```
24+
mean confidence : COLD 0.48 -> WARM 0.79
25+
instant recalls : COLD 0/20 -> WARM 16/20
26+
```
27+
28+
That delta *is* the self-improvement: bridges it once derived slowly, it now answers instantly and with
29+
higher confidence. (`webmind.py --ab` reproduces it.) The four questions that didn't change routed to
30+
single-topic recall both ways — shown honestly, not hidden.
31+
32+
## What got built (all new tonight, all tested)
33+
34+
| file | capability | what it does |
35+
|---|---|---|
36+
| `agent.py` | **TOOLS** | addresses each query to the right tool: `charcount` (exact letter-counting — what token-LLMs get wrong), `compute` (arithmetic), `relate` (cross-source bridge), `recall` (single topic), `memory` (recall a self-verified thought) |
37+
| `selfimprove.py` | **SELF-IMPROVEMENT** | the engine self-probes, reasons out multi-hop connections, **gates** them, and records the verified ones — then recalls them instantly. Ran all night. |
38+
| `webmind.py` | **THINK + unified** | multi-step reasoning chains (reasoning-as-navigation), plus `--demo` / `--report` / REPL |
39+
40+
The earlier pieces it builds on (also this project): `langexec.py` (resolve oracle), `fluency.py`
41+
(how-to-speak oracle), `thinkloop.py` (heal-to-coherence), `realize.py` (concepts→fluent sentence),
42+
`create.py` (bridge distant concepts), `engine.py` (recall/relate/decline router).
43+
44+
## How the self-improvement actually works (and its honest limits)
45+
46+
It's **write-don't-train of self-verified thoughts**. The loop:
47+
1. **probes** itself with concept pairs sampled from the web,
48+
2. **reasons** out a grounded multi-hop bridge between them,
49+
3. **gates** the result on THREE independent tests — *coherence* (the path resolves), *support* (the
50+
weakest hop is a real above-chance association, not a hub-walk), and *meaning* (the endpoints are
51+
semantically related, not a co-occurrence artifact like a translation pair),
52+
4. **records** the survivors in `derived.db` (a separate store — your 9 GB `knowledge.db` is only ever read),
53+
5. **recalls** them instantly next time instead of re-deriving.
54+
55+
**It does not invent facts.** Every stored thought is a *gated recombination of real, sourced passages*
56+
a connection no single source states, but every hop of which is grounded. It **closes connection-gaps**
57+
(verified recombinations) and **maps knowledge-gaps** (topics it's sparse on, logged honestly, never
58+
fabricated). Measured improvement on its 30-pair curriculum (avg 5.78-hop bridges): instant-recall hits
59+
**0 → 21** after one round; the store grows by autonomous exploration through the night.
60+
61+
Honest nits I'd fix next: the multi-step chain sometimes drifts into morphological variants (rays→ray);
62+
fluency is a trigram model (~0.86 separation — a small neural model lifts it); recall answers can be a
63+
full passage span (trimming to the key sentence is easy polish).
64+
65+
## Files & state
66+
- verified thoughts: `derived.db` (sqlite) · run log: `selfimprove_overnight.log` · metrics: `selfimprove_ledger.jsonl`
67+
- the overnight learner had a 7-hour budget; if it's still running you'll see it in `ps`. It's safe to
68+
stop (`pkill -f selfimprove.py`) — the store persists and `--report` reflects whatever it reached.
69+
- git untouched; nothing pushed.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
import sys, json; sys.path.insert(0,'.')
2+
from kdb import KnowledgeDB, load_embedding
3+
from navigator import coherent_path
4+
s,E,n,e=load_embedding(); k=KnowledgeDB("knowledge.db",s,E,n,e)
5+
d=k.db; purged=0
6+
for mid,cj in d.execute("SELECT id,concepts FROM memory WHERE kind='derived'").fetchall():
7+
cs=json.loads(cj)
8+
if len(cs)>=3 and not coherent_path(k,cs):
9+
d.execute("DELETE FROM memory WHERE id=?",(mid,)); purged+=1
10+
d.commit()
11+
print("purged",purged,"incoherent | remaining derived:",d.execute("SELECT COUNT(*) FROM memory WHERE kind='derived'").fetchone()[0])

0 commit comments

Comments
 (0)