|
| 1 | +# Entity Wiki Pages |
| 2 | + |
| 3 | +> ⚠️ **Requires the entity-extraction companion PRs — not yet merged into OB1 `main`.** This recipe reads `public.entities`, `public.thought_entities`, and `public.edges`. Those tables are introduced by the in-flight entity-extraction schema + worker PRs (tracking: [#197](https://github.com/NateBJones-Projects/OB1/pull/197) schema, [#199](https://github.com/NateBJones-Projects/OB1/pull/199) worker). On the current `main` branch those tables do not exist and every query in `generate-wiki.mjs` will fail with `relation "public.entities" does not exist`. Do not try to install this recipe until both companion PRs are merged. See [Prerequisites](#prerequisites) for details. |
| 4 | +
|
| 5 | +> Auto-generate per-entity markdown wiki pages by aggregating every thought linked to a person, project, topic, organization, tool, or place — then synthesizing a structured narrative with an LLM. |
| 6 | +
|
| 7 | +## What It Does |
| 8 | + |
| 9 | +Turns your scattered atomic thoughts about "Alan" or "ExoCortex" or "PostgreSQL" into a coherent wiki page. For each entity in your knowledge graph, this recipe: |
| 10 | + |
| 11 | +1. Gathers every linked thought (via `thought_entities`) plus typed edges to other entities (via `edges`, skipping raw co-mention noise). |
| 12 | +2. Optionally expands with semantic search if you enable embeddings. |
| 13 | +3. Calls any OpenAI-compatible Chat Completions endpoint (OpenRouter by default) to synthesize a Summary / Key Facts / Timeline / Relationships / Open Questions page. |
| 14 | +4. Emits the result to disk, to `entities.metadata.wiki_page`, or back into the thought store as a `dossier`-typed thought — your choice. |
| 15 | + |
| 16 | +The wiki is an **emergent, regenerable view** of atomic state. `public.thoughts` remains the source of truth; wikis are cached snapshots you can rebuild anytime. |
| 17 | + |
| 18 | +Inspired by [Andrej Karpathy's LLM Wiki concept](https://github.com/karpathy/llm-wiki) and the ExoCortex dossier pattern. |
| 19 | + |
| 20 | +## How It Works |
| 21 | + |
| 22 | +``` |
| 23 | ++--------------------+ +-------------------+ +------------------+ |
| 24 | +| entities |--->| thought_entities |--->| thoughts | |
| 25 | +| (id, canonical_... | | (thought_id, | | (id, content, | |
| 26 | +| aliases, type) | | entity_id, role) | | metadata) | |
| 27 | ++--------------------+ +-------------------+ +------------------+ |
| 28 | + | | |
| 29 | + | typed edges (excl. co_occurs_with) | |
| 30 | + v v |
| 31 | + +-----------+ +----------------+ |
| 32 | + | edges | +--------+ | LLM synthesis | |
| 33 | + | (from, to,|------->| Script |----------------->| (Chat | |
| 34 | + | relation)| | | | Completions) | |
| 35 | + +-----------+ +--------+ +----------------+ |
| 36 | + | | |
| 37 | + v v |
| 38 | + +-------+-----------------------------+--+ |
| 39 | + v v v |
| 40 | + wikis/{slug}.md entities.metadata dossier thought |
| 41 | + (default) .wiki_page (trade-off!) |
| 42 | +``` |
| 43 | + |
| 44 | +The script groups typed edges by relation, truncates thought content to 300 chars per snippet, caps the prompt at ~25 linked + ~15 semantic items (configurable), and asks the model to cite thought ids inline. Sections with no material are skipped rather than filled with boilerplate. |
| 45 | + |
| 46 | +## Prerequisites |
| 47 | + |
| 48 | +> [!WARNING] |
| 49 | +> **Schema prereq not yet in OB1 `main`.** The `schemas/entity-extraction/` schema and the `integrations/entity-extraction-worker/` edge function referenced below are in-flight PRs, not merged code. Paths like `../../schemas/entity-extraction/` will 404 on GitHub today. This recipe will not run until both companion PRs land. Track: schema PR [#197](https://github.com/NateBJones-Projects/OB1/pull/197), worker PR [#199](https://github.com/NateBJones-Projects/OB1/pull/199). |
| 50 | +
|
| 51 | +- A working Open Brain setup ([guide](../../docs/01-getting-started.md)). |
| 52 | +- The `schemas/entity-extraction/` schema deployed, and the companion `integrations/entity-extraction-worker/` edge function processing the queue. This recipe reads `public.entities`, `public.edges`, and `public.thought_entities` — if those tables are empty, there is nothing to synthesize. Let the worker ingest your thoughts for at least one run before you try this. |
| 53 | +- An API key for any OpenAI-compatible Chat Completions provider (OpenRouter, OpenAI, Groq, Together, Anthropic via OpenRouter, a local Ollama/LM Studio server — anything that accepts `POST /chat/completions`). |
| 54 | +- Node.js 18+ (uses built-in `fetch`). |
| 55 | + |
| 56 | +> [!NOTE] |
| 57 | +> This recipe does **not** require the `recipes/ob-graph/` manual graph layer. It uses the automatic extraction tables from the (pending) `schemas/entity-extraction/` PR. The two are independent. |
| 58 | +
|
| 59 | +## Credential Tracker |
| 60 | + |
| 61 | +```text |
| 62 | +ENTITY-WIKI -- CREDENTIAL TRACKER |
| 63 | +-------------------------------------- |
| 64 | +
|
| 65 | +FROM YOUR OPEN BRAIN SETUP |
| 66 | + Project URL (OPEN_BRAIN_URL): ____________ |
| 67 | + Service role key (OPEN_BRAIN_SERVICE_KEY): ____________ |
| 68 | +
|
| 69 | +LLM PROVIDER |
| 70 | + LLM_BASE_URL (default: openrouter.ai): ____________ |
| 71 | + LLM_API_KEY: ____________ |
| 72 | + LLM_MODEL (default: claude-haiku-4-5): ____________ |
| 73 | +
|
| 74 | +REQUIRED WHEN --output-mode=thought OR --semantic-expand |
| 75 | + EMBEDDING_BASE_URL (default: openai): ____________ |
| 76 | + EMBEDDING_API_KEY: ____________ |
| 77 | + EMBEDDING_MODEL (default: text-embedding-3-small): ____________ |
| 78 | +
|
| 79 | +-------------------------------------- |
| 80 | +``` |
| 81 | + |
| 82 | +> [!CAUTION] |
| 83 | +> `OPEN_BRAIN_SERVICE_KEY` is the Supabase **service role** key. It bypasses RLS. Keep it server-side only. Never ship it to a browser, a mobile client, or any environment an end user can inspect. This recipe is intended to run on your own machine or a trusted server. |
| 84 | +
|
| 85 | +## Installation |
| 86 | + |
| 87 | + |
| 88 | + |
| 89 | +No npm install needed — the script uses only Node.js built-ins. Just copy the recipe: |
| 90 | + |
| 91 | +```bash |
| 92 | +# From your Open Brain project root: |
| 93 | +cp -r recipes/entity-wiki ./entity-wiki |
| 94 | +cd entity-wiki |
| 95 | +``` |
| 96 | + |
| 97 | +Done when: `generate-wiki.mjs` is sitting next to a `.env.local` you will create in Step 2. |
| 98 | + |
| 99 | +--- |
| 100 | + |
| 101 | + |
| 102 | + |
| 103 | +Create `.env.local` next to `generate-wiki.mjs` (or export the variables in your shell): |
| 104 | + |
| 105 | +```bash |
| 106 | +OPEN_BRAIN_URL=https://<your-project-ref>.supabase.co |
| 107 | +OPEN_BRAIN_SERVICE_KEY=<service-role-key> |
| 108 | +LLM_API_KEY=<your-openrouter-or-openai-key> |
| 109 | +# Optional overrides: |
| 110 | +# LLM_BASE_URL=https://api.openai.com/v1 |
| 111 | +# LLM_MODEL=gpt-4o-mini |
| 112 | +# OB_WIKI_OUT_DIR=./wikis |
| 113 | +``` |
| 114 | + |
| 115 | +Done when: `node generate-wiki.mjs --help` prints the usage block without errors. |
| 116 | + |
| 117 | +--- |
| 118 | + |
| 119 | + |
| 120 | + |
| 121 | +<details> |
| 122 | +<summary><strong>SQL: Sanity-check entity + link counts</strong> (click to expand)</summary> |
| 123 | + |
| 124 | +```sql |
| 125 | +-- Run in Supabase SQL Editor |
| 126 | +SELECT |
| 127 | + (SELECT count(*) FROM public.entities) AS entities, |
| 128 | + (SELECT count(*) FROM public.thought_entities) AS thought_links, |
| 129 | + (SELECT count(*) FROM public.edges WHERE relation <> 'co_occurs_with') AS typed_edges; |
| 130 | +``` |
| 131 | + |
| 132 | +</details> |
| 133 | + |
| 134 | +If `entities` or `thought_links` is 0, wait for the entity-extraction worker to process your queue before running the recipe. See the pending `schemas/entity-extraction/` PR for worker setup (not yet on `main` — see warning at the top of this README). |
| 135 | + |
| 136 | +Done when: all three counts are non-zero and at least one entity has 3+ linked thoughts. |
| 137 | + |
| 138 | +## Usage Examples |
| 139 | + |
| 140 | +**Single entity by name:** |
| 141 | + |
| 142 | +```bash |
| 143 | +node generate-wiki.mjs --entity "Alan Shurafa" |
| 144 | +# Writes ./wikis/person-alan-shurafa.md |
| 145 | +``` |
| 146 | + |
| 147 | +**Disambiguate by type** (useful when "Python" is both a tool and a topic): |
| 148 | + |
| 149 | +```bash |
| 150 | +node generate-wiki.mjs --entity "Python" --type tool |
| 151 | +``` |
| 152 | + |
| 153 | +**Single entity by id** (BIGINT — not the UUID thought id): |
| 154 | + |
| 155 | +```bash |
| 156 | +node generate-wiki.mjs --id 42 |
| 157 | +``` |
| 158 | + |
| 159 | +**Dry-run** — print to stdout without writing anything: |
| 160 | + |
| 161 | +```bash |
| 162 | +node generate-wiki.mjs --entity "ExoCortex" --dry-run |
| 163 | +``` |
| 164 | + |
| 165 | +**Batch mode** — generate pages for every entity with 3+ linked thoughts, capped at 25 entities per run: |
| 166 | + |
| 167 | +```bash |
| 168 | +node generate-wiki.mjs --batch --batch-min-linked 3 --batch-limit 25 |
| 169 | +``` |
| 170 | + |
| 171 | +**Choose an output mode:** |
| 172 | + |
| 173 | +```bash |
| 174 | +# Default: write to ./wikis/<slug>.md |
| 175 | +node generate-wiki.mjs --entity "PostgreSQL" --output-mode file |
| 176 | + |
| 177 | +# Cache under entities.metadata.wiki_page — no filesystem, queryable via SQL |
| 178 | +node generate-wiki.mjs --entity "PostgreSQL" --output-mode entity-metadata |
| 179 | + |
| 180 | +# Store as a dossier thought — REQUIRES EMBEDDING_API_KEY so the dossier is |
| 181 | +# retrievable via match_thoughts / MCP search. Without an embedding the row |
| 182 | +# is unreachable, so the CLI refuses to run if the key is missing. READ THE |
| 183 | +# TRADE-OFFS SECTION BELOW before using this mode. |
| 184 | +node generate-wiki.mjs --entity "PostgreSQL" --output-mode thought |
| 185 | +``` |
| 186 | + |
| 187 | +**Enable semantic expansion** (requires `EMBEDDING_API_KEY`): |
| 188 | + |
| 189 | +```bash |
| 190 | +node generate-wiki.mjs --entity "ExoCortex" --semantic-expand |
| 191 | +``` |
| 192 | + |
| 193 | +**Override the model per run:** |
| 194 | + |
| 195 | +```bash |
| 196 | +node generate-wiki.mjs --entity "Alan" --model "openai/gpt-4o-mini" |
| 197 | +``` |
| 198 | + |
| 199 | +Run `node generate-wiki.mjs --help` for the full flag list. |
| 200 | + |
| 201 | +## Output Mode Trade-offs |
| 202 | + |
| 203 | +Pick the mode that matches how you plan to consume the wikis. Each has its own cost. |
| 204 | + |
| 205 | +| Mode | Where it lives | Pros | Cons | |
| 206 | +|------|----------------|------|------| |
| 207 | +| `file` (default) | `./wikis/<slug>.md` | Human-readable, git-versionable, Obsidian-compatible, zero DB writes | Not queryable from SQL or MCP tools; lives outside the brain. Slug is derived from `canonical_name` with non-alphanumerics stripped, so distinct entities like `C`, `C#`, and `C++` share a base slug — the writer appends `-1`, `-2`, ... to avoid overwrites (and logs a warning). Re-running for the same entity id still overwrites its own file. | |
| 208 | +| `entity-metadata` | `entities.metadata.wiki_page` JSONB | Queryable via SQL, travels with the entity, no new rows | Not searchable via embeddings, not picked up by `search_thoughts` | |
| 209 | +| `thought` | A new row in `public.thoughts` with `metadata.type = 'dossier'` | Retrievable via normal search / MCP tools, full provenance back to the atoms it summarizes | **Requires `EMBEDDING_API_KEY`** (the dossier is embedded at write time so match_thoughts can find it). **Can pollute semantic search** — a long dossier that restates 20 atoms will match many queries and rank above the atoms themselves | |
| 210 | + |
| 211 | +> [!WARNING] |
| 212 | +> **Thought-mode pollution trade-off.** Storing the wiki back as a thought makes it show up in every search that touches the entity. Karpathy's original design argument against this is valid: a compressed summary that repeats 20 atomic facts will match any query that would have matched any of them, and because it's longer and more "on-topic" it often ranks above the atoms. That's good for "tell me about X" queries but bad for "what did I say on 2026-03-02 about X" queries. |
| 213 | +> |
| 214 | +> This recipe mitigates by tagging thought-mode output with `metadata.type = 'dossier'`, `metadata.generated_by`, and `metadata.exclude_from_default_search = true`. To keep your search clean, add a filter like `metadata->>'type' <> 'dossier'` in your default search view and only include dossiers when the user explicitly asks for them. The mitigation is a convention, not an enforcement — you have to wire the filter on the read side. |
| 215 | +> |
| 216 | +> If you are unsure, start with `file` or `entity-metadata` mode. You can always regenerate. |
| 217 | +
|
| 218 | +## Cost Notes |
| 219 | + |
| 220 | +Each wiki is **one** LLM call. Input size scales with the number of linked + semantic snippets sent (capped at `--max-linked` + `--max-semantic`, default 25 + 15, each truncated to 300 chars). A typical page uses roughly 2–6k input tokens and produces up to 2048 output tokens. |
| 221 | + |
| 222 | +At OpenRouter pricing for `anthropic/claude-haiku-4-5` (~$0.80 per million input, ~$4 per million output), a single wiki costs roughly **$0.01–$0.02**. A batch of 25 entities runs around **$0.25–$0.50**. Substitute `openai/gpt-4o-mini` or a local Ollama model to drop that by 10x or more. |
| 223 | + |
| 224 | +Bounding behavior: |
| 225 | + |
| 226 | +- `--batch-limit` caps the number of entities processed per batch run (default 25). The script stops after this many candidates, regardless of how many eligible entities exist. |
| 227 | +- `--batch-min-linked` skips entities with fewer than N linked thoughts — prevents burning LLM calls on entities that will produce thin pages. |
| 228 | +- `--max-linked` and `--max-semantic` bound per-call token usage. |
| 229 | +- Entities with zero linked thoughts, zero typed edges, and zero semantic matches are skipped without an LLM call. |
| 230 | + |
| 231 | +If you run this on a cron, start with `--batch-limit 10` for a week, measure your actual spend, then raise. |
| 232 | + |
| 233 | +## Troubleshooting |
| 234 | + |
| 235 | +**Issue: `Missing required env var: OPEN_BRAIN_URL`** |
| 236 | +The script looks for `.env.local` or `.env` in the current working directory, then falls back to the process environment. Either `cd` into the recipe folder before running, or export the vars in your shell. |
| 237 | + |
| 238 | +**Issue: `no entity found for name="..."`** |
| 239 | +`--entity` resolves against `canonical_name` (case-insensitive) and `normalized_name` (exact) only — **alias matching is not implemented** (encoding JSONB `cs` inside a PostgREST `or=(...)` clause is brittle across versions). If your entity is reachable only by alias, find its id in SQL and rerun with `--id`. Try: |
| 240 | + |
| 241 | +```sql |
| 242 | +SELECT id, entity_type, canonical_name, aliases |
| 243 | +FROM public.entities |
| 244 | +WHERE normalized_name ILIKE '%yourname%' |
| 245 | +ORDER BY last_seen_at DESC |
| 246 | +LIMIT 10; |
| 247 | +``` |
| 248 | + |
| 249 | +Then rerun with `--id <N>` against the exact id. |
| 250 | + |
| 251 | +**Issue: Wiki only has a Summary — Timeline and Relationships are empty** |
| 252 | +The entity has few linked thoughts or all its edges are `co_occurs_with` (which this recipe filters out as noise). Give the entity-extraction worker more content to process, or lower the `--max-linked` cap to force the model to use what little it has. |
| 253 | + |
| 254 | +**Issue: Batch mode is slow on a large brain** |
| 255 | + |
| 256 | +> [!WARNING] |
| 257 | +> `listBatchCandidates` runs a serial `thought_entities` count per candidate entity (up to `max(batch_limit * 4, 100)` requests) before the first LLM call. On brains with a few thousand entities this adds tens of seconds of startup latency; on 10k+ brains it dominates the run and scales linearly with `--batch-limit`. A drop-in RPC workaround is below. A follow-up recipe PR will ship this RPC by default once the entity-extraction schema lands. |
| 258 | +
|
| 259 | +`listBatchCandidates` does a best-effort per-entity count because PostgREST does not expose `GROUP BY` directly. For brains with 10k+ entities, add an RPC like the following and swap it in: |
| 260 | + |
| 261 | +<details> |
| 262 | +<summary><strong>SQL: Optional batch-candidates RPC</strong> (click to expand)</summary> |
| 263 | + |
| 264 | +```sql |
| 265 | +CREATE OR REPLACE FUNCTION public.entities_with_min_links(min_links int, lim int) |
| 266 | +RETURNS TABLE (id bigint, entity_type text, canonical_name text, link_count bigint) |
| 267 | +LANGUAGE sql STABLE AS $$ |
| 268 | + SELECT e.id, e.entity_type, e.canonical_name, count(te.thought_id) AS link_count |
| 269 | + FROM public.entities e |
| 270 | + JOIN public.thought_entities te ON te.entity_id = e.id |
| 271 | + GROUP BY e.id |
| 272 | + HAVING count(te.thought_id) >= min_links |
| 273 | + ORDER BY link_count DESC |
| 274 | + LIMIT lim; |
| 275 | +$$; |
| 276 | + |
| 277 | +GRANT EXECUTE ON FUNCTION public.entities_with_min_links(int, int) TO service_role; |
| 278 | +``` |
| 279 | + |
| 280 | +</details> |
| 281 | + |
| 282 | +**Issue: LLM returns empty or malformed markdown** |
| 283 | +Some smaller models ignore structural instructions. Try a more capable model (`--model "anthropic/claude-haiku-4-5"` or `--model "openai/gpt-4o-mini"`). If you are running a local Ollama model, pick one with strong instruction-following (`llama3.1:70b`, `qwen2.5:32b`). |
| 284 | + |
| 285 | +**Issue: `LLM call failed: 401`** |
| 286 | +`LLM_API_KEY` is missing or wrong. For OpenRouter, the key starts with `sk-or-...`. For OpenAI, `sk-...`. For a local Ollama server, any non-empty string works and you should set `LLM_BASE_URL=http://localhost:11434/v1`. |
| 287 | + |
| 288 | +**Issue: `permission denied for table entities` (or similar)** |
| 289 | +`OPEN_BRAIN_SERVICE_KEY` must be the **service role** key, not the anon key. Regenerate it from Supabase Dashboard → Settings → API if in doubt. |
0 commit comments