RandomCoder-lab
diff --git a/‎tools/substrate_tokenizer/README.md‎
Lines changed: 105 additions & 0 deletions b/‎tools/substrate_tokenizer/README.md‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎tools/substrate_tokenizer/build_vocab.py‎
Lines changed: 91 additions & 0 deletions b/‎tools/substrate_tokenizer/build_vocab.py‎
Lines changed: 91 additions & 0 deletions
@@ -0,0 +1,105 @@
+# Substrate-aware tokenizer infrastructure
+
+Pipeline to train an LLM where the top-N most-common OMC canonical
+hashes get reserved single-token IDs in the vocabulary. The LLM
+then writes `<omc:tok_42>` (one token) instead of repeating the
+full source body across context.
+
+This is goal 4 of the OMC-as-content-addressed-AI plan. The
+infrastructure ships today; the actual fine-tune on a meaningful
+base model needs a GPU.
+
+## Pipeline
+
+```
+            corpus_collect.py        build_vocab.py           train_fine_tune.py
+                  │                        │                          │
+   .omc files ───>│                        │                          │
+                  ▼                        ▼                          ▼
+        canonical_hash_index.jsonl  hash_token_table.json   fine_tuned_model.pt
+                  │                        │                          │
+                  └────────────────────────┴──────────┐               │
+                                                      ▼               │
+                                          tokenizer_eval.py ◀─────────┘
+```
+
+| Stage | Script | Input | Output |
+|---|---|---|---|
+| 1 | `corpus_collect.py DIR` | Directory of `.omc` files | `canonical_hash_index.jsonl` — `{canonical_hash, fn_name, source, count}` |
+| 2 | `build_vocab.py --top N` | The index | `hash_token_table.json` — `{token_id: canonical_hash}` for the top N |
+| 3 | `train_fine_tune.py [args]` | The table + a base model | `fine_tuned_model.pt` |
+| 4 | `tokenizer_eval.py model.pt` | Trained model + test corpus | Token-compression metrics + completion quality |
+
+Stages 1–2 are fast (CPU, minutes). Stage 3 is multi-day on a GPU
+for a meaningful base model. Stage 4 measures the actual context-
+compression win.
+
+## What ships today
+
+**1. Corpus collector (CPU, fast)** — walks a directory, extracts
+every OMC fn, computes canonical hash, counts occurrences. Produces
+the JSONL index that downstream stages consume.
+
+**2. Vocabulary builder (CPU, fast)** — reads the index, picks the
+top-N canonical hashes by count, assigns them reserved token IDs
+in a `[unused_0..unused_N]` range that most tokenizers reserve for
+fine-tune extensions.
+
+**3. CPU sanity fine-tune** — a tiny GPT-2-shaped model (~10M
+params) trained on a synthetic corpus where the top-N hashes are
+overrepresented. Demonstrates the training loop works end-to-end
+in ~5 min on CPU. Not a useful model; just proves the pipeline.
+
+**4. Tokenizer evaluator (CPU)** — measures, for a given input
+text:
+  - Naive BPE token count
+  - Substrate-aware token count (hash-refs → 1 token each)
+  - Compression ratio
+
+Run on real workloads to project the win before committing to GPU.
+
+## What needs GPU
+
+The actual fine-tune on a real base model (Llama-3 8B, Mistral 7B,
+or even a smaller code-focused base like StarCoder2-3B) requires
+GPU time. Launch instructions for a single-node 1×A100 setup are in
+`gpu_fine_tune.md`. Cost estimate: ~$50–200 depending on base
+model size + dataset.
+
+## Honest expected wins
+
+For an agentic workload that heavily reuses standard library fns:
+- Naive BPE: each fn reference costs ~10–100 tokens
+- Substrate tokens: each fn reference costs 1 token
+- Realistic context-compression: 3–10× on code-heavy workloads
+- Worst case (no fn reuse): ~1× (no harm)
+
+The fine-tune teaches the model to EMIT `<omc:hash>` tokens when
+appropriate. Without that training, the LLM treats them as
+unfamiliar special tokens.
+
+## Why this is the long-term unlock
+
+If a major code-LLM is fine-tuned with substrate-aware tokens:
+- Every agentic system using that LLM gets cost/context savings
+  for free
+- The kernel becomes the universal back-end for canonical-hash
+  resolution
+- The transformerless-LM thesis gains its third validated
+  substrate component beyond CRT-PE + HBit-OOD + geodesic-attention
+
+This is the infrastructure that makes that fine-tune cheap to
+attempt. The hardest engineering (canonicalization, kernel, codec,
+geodesic) is done. The remaining work is dataset curation +
+hyperparameter sweeps — bounded compute, bounded time.
+
+## Files
+
+| File | Purpose |
+|---|---|
+| `corpus_collect.py` | Stage 1: walk OMC files, build canonical-hash index |
+| `build_vocab.py` | Stage 2: select top-N hashes, emit token table |
+| `train_fine_tune.py` | Stage 3: CPU sanity fine-tune (proves pipeline) |
+| `tokenizer_eval.py` | Stage 4: measure compression on real text |
+| `gpu_fine_tune.md` | Launch instructions for a meaningful GPU run |
+| `README.md` | This file |
@@ -0,0 +1,91 @@
+"""Stage 2 of the substrate-tokenizer pipeline.
+
+Read the canonical-hash index from stage 1; pick the top-N
+hashes (most-frequently occurring); emit a token table that
+maps reserved token IDs to canonical hashes.
+
+The output table assigns token IDs in a range that most BPE
+tokenizers reserve for fine-tune extensions:
+  - Llama / Mistral: [128000..128255] (256 reserved special tokens)
+  - GPT-2: [50257..50337] (similar range)
+  - StarCoder: configurable
+
+The mapping is:
+  token_id = base_token_id + index_in_top_N
+
+so the first popular canonical hash gets `base + 0`, the second gets
+`base + 1`, etc.
+
+Usage:
+    python3 build_vocab.py --top N [--base 128000] < canonical_hash_index.jsonl > hash_token_table.json
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--top", type=int, default=64,
+                        help="Number of canonical hashes to assign reserved tokens (default 64)")
+    parser.add_argument("--base", type=int, default=128000,
+                        help="First reserved token ID (default 128000 for Llama-style)")
+    parser.add_argument("--min-count", type=int, default=2,
+                        help="Skip hashes with fewer than this many occurrences (default 2)")
+    args = parser.parse_args()
+
+    entries: list[dict] = []
+    for line in sys.stdin:
+        line = line.strip()
+        if not line:
+            continue
+        try:
+            rec = json.loads(line)
+        except json.JSONDecodeError:
+            continue
+        if rec.get("count", 0) < args.min_count:
+            continue
+        entries.append(rec)
+    # Index is already sorted by count desc in corpus_collect.py output,
+    # but defensively re-sort.
+    entries.sort(key=lambda r: (-r["count"], r["canonical_hash"]))
+    top = entries[: args.top]
+
+    table = {
+        "base_token_id": args.base,
+        "vocab_size": len(top),
+        "source": "substrate_canonical_hashes",
+        "tokens": [
+            {
+                "token_id": args.base + i,
+                "canonical_hash": rec["canonical_hash"],
+                "fn_name": rec.get("fn_name", ""),
+                "count": rec.get("count", 0),
+                "size_bytes": rec.get("size_bytes", 0),
+                "origin_file": rec.get("first_origin_file", ""),
+            }
+            for i, rec in enumerate(top)
+        ],
+    }
+
+    json.dump(table, sys.stdout, indent=2)
+    sys.stdout.write("\n")
+
+    total_count_covered = sum(rec["count"] for rec in top)
+    total_count_all = sum(rec["count"] for rec in entries)
+    total_bytes_covered = sum(rec["size_bytes"] * rec["count"] for rec in top)
+    print(
+        f"build_vocab: assigned {len(top)} tokens "
+        f"[{args.base}..{args.base + len(top) - 1}] "
+        f"covering {total_count_covered}/{total_count_all} fn occurrences "
+        f"({100 * total_count_covered / max(total_count_all, 1):.1f}%, "
+        f"{total_bytes_covered:,} bytes of repeated source)",
+        file=sys.stderr,
+    )
+
+
+if __name__ == "__main__":
+    main()