Skip to content

Commit d0b0dd5

Browse files
Merge pull request #213 from alanshurafa/contrib/alanshurafa/entity-wiki
[recipes] Entity wiki pages from knowledge graph
2 parents 1321674 + 53db43a commit d0b0dd5

3 files changed

Lines changed: 1248 additions & 0 deletions

File tree

recipes/entity-wiki/README.md

Lines changed: 289 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,289 @@
1+
# Entity Wiki Pages
2+
3+
> ⚠️ **Requires the entity-extraction companion PRs — not yet merged into OB1 `main`.** This recipe reads `public.entities`, `public.thought_entities`, and `public.edges`. Those tables are introduced by the in-flight entity-extraction schema + worker PRs (tracking: [#197](https://github.com/NateBJones-Projects/OB1/pull/197) schema, [#199](https://github.com/NateBJones-Projects/OB1/pull/199) worker). On the current `main` branch those tables do not exist and every query in `generate-wiki.mjs` will fail with `relation "public.entities" does not exist`. Do not try to install this recipe until both companion PRs are merged. See [Prerequisites](#prerequisites) for details.
4+
5+
> Auto-generate per-entity markdown wiki pages by aggregating every thought linked to a person, project, topic, organization, tool, or place — then synthesizing a structured narrative with an LLM.
6+
7+
## What It Does
8+
9+
Turns your scattered atomic thoughts about "Alan" or "ExoCortex" or "PostgreSQL" into a coherent wiki page. For each entity in your knowledge graph, this recipe:
10+
11+
1. Gathers every linked thought (via `thought_entities`) plus typed edges to other entities (via `edges`, skipping raw co-mention noise).
12+
2. Optionally expands with semantic search if you enable embeddings.
13+
3. Calls any OpenAI-compatible Chat Completions endpoint (OpenRouter by default) to synthesize a Summary / Key Facts / Timeline / Relationships / Open Questions page.
14+
4. Emits the result to disk, to `entities.metadata.wiki_page`, or back into the thought store as a `dossier`-typed thought — your choice.
15+
16+
The wiki is an **emergent, regenerable view** of atomic state. `public.thoughts` remains the source of truth; wikis are cached snapshots you can rebuild anytime.
17+
18+
Inspired by [Andrej Karpathy's LLM Wiki concept](https://github.com/karpathy/llm-wiki) and the ExoCortex dossier pattern.
19+
20+
## How It Works
21+
22+
```
23+
+--------------------+ +-------------------+ +------------------+
24+
| entities |--->| thought_entities |--->| thoughts |
25+
| (id, canonical_... | | (thought_id, | | (id, content, |
26+
| aliases, type) | | entity_id, role) | | metadata) |
27+
+--------------------+ +-------------------+ +------------------+
28+
| |
29+
| typed edges (excl. co_occurs_with) |
30+
v v
31+
+-----------+ +----------------+
32+
| edges | +--------+ | LLM synthesis |
33+
| (from, to,|------->| Script |----------------->| (Chat |
34+
| relation)| | | | Completions) |
35+
+-----------+ +--------+ +----------------+
36+
| |
37+
v v
38+
+-------+-----------------------------+--+
39+
v v v
40+
wikis/{slug}.md entities.metadata dossier thought
41+
(default) .wiki_page (trade-off!)
42+
```
43+
44+
The script groups typed edges by relation, truncates thought content to 300 chars per snippet, caps the prompt at ~25 linked + ~15 semantic items (configurable), and asks the model to cite thought ids inline. Sections with no material are skipped rather than filled with boilerplate.
45+
46+
## Prerequisites
47+
48+
> [!WARNING]
49+
> **Schema prereq not yet in OB1 `main`.** The `schemas/entity-extraction/` schema and the `integrations/entity-extraction-worker/` edge function referenced below are in-flight PRs, not merged code. Paths like `../../schemas/entity-extraction/` will 404 on GitHub today. This recipe will not run until both companion PRs land. Track: schema PR [#197](https://github.com/NateBJones-Projects/OB1/pull/197), worker PR [#199](https://github.com/NateBJones-Projects/OB1/pull/199).
50+
51+
- A working Open Brain setup ([guide](../../docs/01-getting-started.md)).
52+
- The `schemas/entity-extraction/` schema deployed, and the companion `integrations/entity-extraction-worker/` edge function processing the queue. This recipe reads `public.entities`, `public.edges`, and `public.thought_entities` — if those tables are empty, there is nothing to synthesize. Let the worker ingest your thoughts for at least one run before you try this.
53+
- An API key for any OpenAI-compatible Chat Completions provider (OpenRouter, OpenAI, Groq, Together, Anthropic via OpenRouter, a local Ollama/LM Studio server — anything that accepts `POST /chat/completions`).
54+
- Node.js 18+ (uses built-in `fetch`).
55+
56+
> [!NOTE]
57+
> This recipe does **not** require the `recipes/ob-graph/` manual graph layer. It uses the automatic extraction tables from the (pending) `schemas/entity-extraction/` PR. The two are independent.
58+
59+
## Credential Tracker
60+
61+
```text
62+
ENTITY-WIKI -- CREDENTIAL TRACKER
63+
--------------------------------------
64+
65+
FROM YOUR OPEN BRAIN SETUP
66+
Project URL (OPEN_BRAIN_URL): ____________
67+
Service role key (OPEN_BRAIN_SERVICE_KEY): ____________
68+
69+
LLM PROVIDER
70+
LLM_BASE_URL (default: openrouter.ai): ____________
71+
LLM_API_KEY: ____________
72+
LLM_MODEL (default: claude-haiku-4-5): ____________
73+
74+
REQUIRED WHEN --output-mode=thought OR --semantic-expand
75+
EMBEDDING_BASE_URL (default: openai): ____________
76+
EMBEDDING_API_KEY: ____________
77+
EMBEDDING_MODEL (default: text-embedding-3-small): ____________
78+
79+
--------------------------------------
80+
```
81+
82+
> [!CAUTION]
83+
> `OPEN_BRAIN_SERVICE_KEY` is the Supabase **service role** key. It bypasses RLS. Keep it server-side only. Never ship it to a browser, a mobile client, or any environment an end user can inspect. This recipe is intended to run on your own machine or a trusted server.
84+
85+
## Installation
86+
87+
![Step 1](https://img.shields.io/badge/Step_1-Install_Files-1E88E5?style=for-the-badge)
88+
89+
No npm install needed — the script uses only Node.js built-ins. Just copy the recipe:
90+
91+
```bash
92+
# From your Open Brain project root:
93+
cp -r recipes/entity-wiki ./entity-wiki
94+
cd entity-wiki
95+
```
96+
97+
Done when: `generate-wiki.mjs` is sitting next to a `.env.local` you will create in Step 2.
98+
99+
---
100+
101+
![Step 2](https://img.shields.io/badge/Step_2-Configure_Env-1E88E5?style=for-the-badge)
102+
103+
Create `.env.local` next to `generate-wiki.mjs` (or export the variables in your shell):
104+
105+
```bash
106+
OPEN_BRAIN_URL=https://<your-project-ref>.supabase.co
107+
OPEN_BRAIN_SERVICE_KEY=<service-role-key>
108+
LLM_API_KEY=<your-openrouter-or-openai-key>
109+
# Optional overrides:
110+
# LLM_BASE_URL=https://api.openai.com/v1
111+
# LLM_MODEL=gpt-4o-mini
112+
# OB_WIKI_OUT_DIR=./wikis
113+
```
114+
115+
Done when: `node generate-wiki.mjs --help` prints the usage block without errors.
116+
117+
---
118+
119+
![Step 3](https://img.shields.io/badge/Step_3-Verify_Graph_Has_Data-1E88E5?style=for-the-badge)
120+
121+
<details>
122+
<summary><strong>SQL: Sanity-check entity + link counts</strong> (click to expand)</summary>
123+
124+
```sql
125+
-- Run in Supabase SQL Editor
126+
SELECT
127+
(SELECT count(*) FROM public.entities) AS entities,
128+
(SELECT count(*) FROM public.thought_entities) AS thought_links,
129+
(SELECT count(*) FROM public.edges WHERE relation <> 'co_occurs_with') AS typed_edges;
130+
```
131+
132+
</details>
133+
134+
If `entities` or `thought_links` is 0, wait for the entity-extraction worker to process your queue before running the recipe. See the pending `schemas/entity-extraction/` PR for worker setup (not yet on `main` — see warning at the top of this README).
135+
136+
Done when: all three counts are non-zero and at least one entity has 3+ linked thoughts.
137+
138+
## Usage Examples
139+
140+
**Single entity by name:**
141+
142+
```bash
143+
node generate-wiki.mjs --entity "Alan Shurafa"
144+
# Writes ./wikis/person-alan-shurafa.md
145+
```
146+
147+
**Disambiguate by type** (useful when "Python" is both a tool and a topic):
148+
149+
```bash
150+
node generate-wiki.mjs --entity "Python" --type tool
151+
```
152+
153+
**Single entity by id** (BIGINT — not the UUID thought id):
154+
155+
```bash
156+
node generate-wiki.mjs --id 42
157+
```
158+
159+
**Dry-run** — print to stdout without writing anything:
160+
161+
```bash
162+
node generate-wiki.mjs --entity "ExoCortex" --dry-run
163+
```
164+
165+
**Batch mode** — generate pages for every entity with 3+ linked thoughts, capped at 25 entities per run:
166+
167+
```bash
168+
node generate-wiki.mjs --batch --batch-min-linked 3 --batch-limit 25
169+
```
170+
171+
**Choose an output mode:**
172+
173+
```bash
174+
# Default: write to ./wikis/<slug>.md
175+
node generate-wiki.mjs --entity "PostgreSQL" --output-mode file
176+
177+
# Cache under entities.metadata.wiki_page — no filesystem, queryable via SQL
178+
node generate-wiki.mjs --entity "PostgreSQL" --output-mode entity-metadata
179+
180+
# Store as a dossier thought — REQUIRES EMBEDDING_API_KEY so the dossier is
181+
# retrievable via match_thoughts / MCP search. Without an embedding the row
182+
# is unreachable, so the CLI refuses to run if the key is missing. READ THE
183+
# TRADE-OFFS SECTION BELOW before using this mode.
184+
node generate-wiki.mjs --entity "PostgreSQL" --output-mode thought
185+
```
186+
187+
**Enable semantic expansion** (requires `EMBEDDING_API_KEY`):
188+
189+
```bash
190+
node generate-wiki.mjs --entity "ExoCortex" --semantic-expand
191+
```
192+
193+
**Override the model per run:**
194+
195+
```bash
196+
node generate-wiki.mjs --entity "Alan" --model "openai/gpt-4o-mini"
197+
```
198+
199+
Run `node generate-wiki.mjs --help` for the full flag list.
200+
201+
## Output Mode Trade-offs
202+
203+
Pick the mode that matches how you plan to consume the wikis. Each has its own cost.
204+
205+
| Mode | Where it lives | Pros | Cons |
206+
|------|----------------|------|------|
207+
| `file` (default) | `./wikis/<slug>.md` | Human-readable, git-versionable, Obsidian-compatible, zero DB writes | Not queryable from SQL or MCP tools; lives outside the brain. Slug is derived from `canonical_name` with non-alphanumerics stripped, so distinct entities like `C`, `C#`, and `C++` share a base slug — the writer appends `-1`, `-2`, ... to avoid overwrites (and logs a warning). Re-running for the same entity id still overwrites its own file. |
208+
| `entity-metadata` | `entities.metadata.wiki_page` JSONB | Queryable via SQL, travels with the entity, no new rows | Not searchable via embeddings, not picked up by `search_thoughts` |
209+
| `thought` | A new row in `public.thoughts` with `metadata.type = 'dossier'` | Retrievable via normal search / MCP tools, full provenance back to the atoms it summarizes | **Requires `EMBEDDING_API_KEY`** (the dossier is embedded at write time so match_thoughts can find it). **Can pollute semantic search** — a long dossier that restates 20 atoms will match many queries and rank above the atoms themselves |
210+
211+
> [!WARNING]
212+
> **Thought-mode pollution trade-off.** Storing the wiki back as a thought makes it show up in every search that touches the entity. Karpathy's original design argument against this is valid: a compressed summary that repeats 20 atomic facts will match any query that would have matched any of them, and because it's longer and more "on-topic" it often ranks above the atoms. That's good for "tell me about X" queries but bad for "what did I say on 2026-03-02 about X" queries.
213+
>
214+
> This recipe mitigates by tagging thought-mode output with `metadata.type = 'dossier'`, `metadata.generated_by`, and `metadata.exclude_from_default_search = true`. To keep your search clean, add a filter like `metadata->>'type' <> 'dossier'` in your default search view and only include dossiers when the user explicitly asks for them. The mitigation is a convention, not an enforcement — you have to wire the filter on the read side.
215+
>
216+
> If you are unsure, start with `file` or `entity-metadata` mode. You can always regenerate.
217+
218+
## Cost Notes
219+
220+
Each wiki is **one** LLM call. Input size scales with the number of linked + semantic snippets sent (capped at `--max-linked` + `--max-semantic`, default 25 + 15, each truncated to 300 chars). A typical page uses roughly 2–6k input tokens and produces up to 2048 output tokens.
221+
222+
At OpenRouter pricing for `anthropic/claude-haiku-4-5` (~$0.80 per million input, ~$4 per million output), a single wiki costs roughly **$0.01–$0.02**. A batch of 25 entities runs around **$0.25–$0.50**. Substitute `openai/gpt-4o-mini` or a local Ollama model to drop that by 10x or more.
223+
224+
Bounding behavior:
225+
226+
- `--batch-limit` caps the number of entities processed per batch run (default 25). The script stops after this many candidates, regardless of how many eligible entities exist.
227+
- `--batch-min-linked` skips entities with fewer than N linked thoughts — prevents burning LLM calls on entities that will produce thin pages.
228+
- `--max-linked` and `--max-semantic` bound per-call token usage.
229+
- Entities with zero linked thoughts, zero typed edges, and zero semantic matches are skipped without an LLM call.
230+
231+
If you run this on a cron, start with `--batch-limit 10` for a week, measure your actual spend, then raise.
232+
233+
## Troubleshooting
234+
235+
**Issue: `Missing required env var: OPEN_BRAIN_URL`**
236+
The script looks for `.env.local` or `.env` in the current working directory, then falls back to the process environment. Either `cd` into the recipe folder before running, or export the vars in your shell.
237+
238+
**Issue: `no entity found for name="..."`**
239+
`--entity` resolves against `canonical_name` (case-insensitive) and `normalized_name` (exact) only — **alias matching is not implemented** (encoding JSONB `cs` inside a PostgREST `or=(...)` clause is brittle across versions). If your entity is reachable only by alias, find its id in SQL and rerun with `--id`. Try:
240+
241+
```sql
242+
SELECT id, entity_type, canonical_name, aliases
243+
FROM public.entities
244+
WHERE normalized_name ILIKE '%yourname%'
245+
ORDER BY last_seen_at DESC
246+
LIMIT 10;
247+
```
248+
249+
Then rerun with `--id <N>` against the exact id.
250+
251+
**Issue: Wiki only has a Summary — Timeline and Relationships are empty**
252+
The entity has few linked thoughts or all its edges are `co_occurs_with` (which this recipe filters out as noise). Give the entity-extraction worker more content to process, or lower the `--max-linked` cap to force the model to use what little it has.
253+
254+
**Issue: Batch mode is slow on a large brain**
255+
256+
> [!WARNING]
257+
> `listBatchCandidates` runs a serial `thought_entities` count per candidate entity (up to `max(batch_limit * 4, 100)` requests) before the first LLM call. On brains with a few thousand entities this adds tens of seconds of startup latency; on 10k+ brains it dominates the run and scales linearly with `--batch-limit`. A drop-in RPC workaround is below. A follow-up recipe PR will ship this RPC by default once the entity-extraction schema lands.
258+
259+
`listBatchCandidates` does a best-effort per-entity count because PostgREST does not expose `GROUP BY` directly. For brains with 10k+ entities, add an RPC like the following and swap it in:
260+
261+
<details>
262+
<summary><strong>SQL: Optional batch-candidates RPC</strong> (click to expand)</summary>
263+
264+
```sql
265+
CREATE OR REPLACE FUNCTION public.entities_with_min_links(min_links int, lim int)
266+
RETURNS TABLE (id bigint, entity_type text, canonical_name text, link_count bigint)
267+
LANGUAGE sql STABLE AS $$
268+
SELECT e.id, e.entity_type, e.canonical_name, count(te.thought_id) AS link_count
269+
FROM public.entities e
270+
JOIN public.thought_entities te ON te.entity_id = e.id
271+
GROUP BY e.id
272+
HAVING count(te.thought_id) >= min_links
273+
ORDER BY link_count DESC
274+
LIMIT lim;
275+
$$;
276+
277+
GRANT EXECUTE ON FUNCTION public.entities_with_min_links(int, int) TO service_role;
278+
```
279+
280+
</details>
281+
282+
**Issue: LLM returns empty or malformed markdown**
283+
Some smaller models ignore structural instructions. Try a more capable model (`--model "anthropic/claude-haiku-4-5"` or `--model "openai/gpt-4o-mini"`). If you are running a local Ollama model, pick one with strong instruction-following (`llama3.1:70b`, `qwen2.5:32b`).
284+
285+
**Issue: `LLM call failed: 401`**
286+
`LLM_API_KEY` is missing or wrong. For OpenRouter, the key starts with `sk-or-...`. For OpenAI, `sk-...`. For a local Ollama server, any non-empty string works and you should set `LLM_BASE_URL=http://localhost:11434/v1`.
287+
288+
**Issue: `permission denied for table entities` (or similar)**
289+
`OPEN_BRAIN_SERVICE_KEY` must be the **service role** key, not the anon key. Regenerate it from Supabase Dashboard → Settings → API if in doubt.

0 commit comments

Comments
 (0)