Skip to content

Commit 2943323

Browse files
Merge pull request #199 from alanshurafa/contrib/alanshurafa/entity-extraction-worker
[integrations] Entity extraction worker
2 parents e4165d1 + 322ba57 commit 2943323

6 files changed

Lines changed: 1965 additions & 0 deletions

File tree

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# Entity Extraction Worker
2+
3+
> Async worker that drains the entity extraction queue, extracting people, projects, topics, tools, organizations, and places from thoughts via LLM and building a knowledge graph.
4+
5+
## What It Does
6+
7+
Processes the `entity_extraction_queue` table in batches. For each queued thought, the worker calls an LLM to extract named entities and their relationships, then upserts them into the `entities`, `edges`, and `thought_entities` tables.
8+
9+
The knowledge graph enables queries like "what projects does Sarah work on?" or "which tools are related to this topic?" — turning unstructured thoughts into a navigable graph of people, projects, and concepts.
10+
11+
**Entity types:** person, project, topic, tool, organization, place
12+
13+
**Relationship types:** works_on, uses, related_to, member_of, located_in, co_occurs_with
14+
15+
**Worker behavior:**
16+
- Claims pending queue items atomically (prevents duplicate processing)
17+
- Retries failed items up to 5 times before marking as permanently failed
18+
- Skips system-generated thoughts (those with `metadata.generated_by`)
19+
- Supports dry-run mode for previewing extractions without writing
20+
- Enforces canonical ordering for symmetric relations to avoid duplicate edges
21+
22+
## Prerequisites
23+
24+
- Working Open Brain setup ([guide](../../docs/01-getting-started.md))
25+
- **Enhanced thoughts schema** applied — install `schemas/enhanced-thoughts`
26+
- **Knowledge graph schema** applied — install `schemas/knowledge-graph` to create the `entities`, `edges`, `thought_entities`, and `entity_extraction_queue` tables
27+
- At least one LLM API key: OpenRouter (recommended), OpenAI, or Anthropic
28+
- Supabase CLI installed for deployment
29+
30+
## Steps
31+
32+
### 1. Deploy the Edge Function
33+
34+
Copy the `integrations/entity-extraction-worker/` folder into your Supabase project's `supabase/functions/` directory, then deploy:
35+
36+
```bash
37+
supabase functions deploy entity-extraction-worker --no-verify-jwt
38+
```
39+
40+
### 2. Set Environment Variables
41+
42+
```bash
43+
supabase secrets set \
44+
MCP_ACCESS_KEY="your-access-key" \
45+
OPENROUTER_API_KEY="your-openrouter-key"
46+
```
47+
48+
Optional multi-provider fallback:
49+
50+
```bash
51+
supabase secrets set \
52+
OPENAI_API_KEY="your-openai-key" \
53+
ANTHROPIC_API_KEY="your-anthropic-key"
54+
```
55+
56+
Optional safety knobs:
57+
58+
```bash
59+
supabase secrets set \
60+
ENTITY_EXTRACTION_MAX_CALLS="10000" \
61+
FETCH_TIMEOUT_MS="60000"
62+
```
63+
64+
- `ENTITY_EXTRACTION_MAX_CALLS` — cap on LLM extraction calls per container
65+
lifetime (default `10000`; set to `0` to disable). When the cap trips,
66+
the worker releases remaining claimed rows back to `pending` and returns
67+
`{ truncated: true, truncated_reason: "call_cap_reached", ... }` so the
68+
next invocation resumes cleanly.
69+
- `FETCH_TIMEOUT_MS` — hard timeout on every LLM fetch (default `60000`).
70+
Protects against stalled upstreams consuming the 150s Edge Function
71+
wall-clock.
72+
73+
### 3. Backfill the Extraction Queue
74+
75+
If you have existing thoughts that need entity extraction, enqueue them:
76+
77+
```sql
78+
INSERT INTO entity_extraction_queue (thought_id, status)
79+
SELECT id, 'pending'
80+
FROM thoughts
81+
WHERE id NOT IN (SELECT thought_id FROM entity_extraction_queue)
82+
ORDER BY created_at DESC
83+
LIMIT 100;
84+
```
85+
86+
New thoughts are automatically enqueued by the `queue_entity_extraction` trigger from the knowledge graph schema.
87+
88+
### 4. Run the Worker
89+
90+
Trigger the worker to process the queue:
91+
92+
```bash
93+
curl -X POST "https://<your-project-ref>.supabase.co/functions/v1/entity-extraction-worker?limit=10" \
94+
-H "x-brain-key: your-access-key"
95+
```
96+
97+
For a dry run (preview without writing):
98+
99+
```bash
100+
curl -X POST "https://<your-project-ref>.supabase.co/functions/v1/entity-extraction-worker?limit=5&dry_run=true" \
101+
-H "x-brain-key: your-access-key"
102+
```
103+
104+
### 5. Verify Results
105+
106+
Check that entities and edges were created:
107+
108+
```sql
109+
SELECT entity_type, canonical_name, last_seen_at
110+
FROM entities
111+
ORDER BY last_seen_at DESC
112+
LIMIT 20;
113+
114+
SELECT e1.canonical_name AS from_entity, e2.canonical_name AS to_entity, ed.relation, ed.support_count
115+
FROM edges ed
116+
JOIN entities e1 ON ed.from_entity_id = e1.id
117+
JOIN entities e2 ON ed.to_entity_id = e2.id
118+
ORDER BY ed.updated_at DESC
119+
LIMIT 20;
120+
```
121+
122+
## API Reference
123+
124+
### `POST /entity-extraction-worker`
125+
126+
| Parameter | Type | Default | Description |
127+
|-----------|------|---------|-------------|
128+
| `limit` | query param | 10 | Number of queue items to process (max 50) |
129+
| `dry_run` | query param | false | Preview extractions without writing to DB |
130+
131+
**Response:**
132+
133+
```json
134+
{
135+
"processed": 10,
136+
"succeeded": 8,
137+
"failed": 2,
138+
"entities_created": 15,
139+
"edges_created": 7,
140+
"dry_run": false,
141+
"truncated": false,
142+
"truncated_reason": null,
143+
"llm_calls": 10,
144+
"elapsed_ms": 8421
145+
}
146+
```
147+
148+
- `truncated``true` when the worker aborted early because a safety cap
149+
was hit. Remaining claimed rows are returned to `pending`.
150+
- `truncated_reason``"call_cap_reached"` (ENTITY_EXTRACTION_MAX_CALLS)
151+
or `"wall_clock_budget"` (approaching the 150s platform timeout).
152+
- `llm_calls` — cumulative LLM calls across this container's lifetime.
153+
- `elapsed_ms` — wall-clock duration of this invocation.
154+
155+
**Queue statuses:**
156+
157+
- `pending` — awaiting extraction
158+
- `processing` — currently being worked on
159+
- `complete` — entities extracted and written
160+
- `skipped` — intentionally not extracted (e.g. `metadata.generated_by`
161+
is set, indicating a system-synthesized artifact)
162+
- `failed` — exceeded `MAX_ATTEMPTS` (5) retries; check `last_error`
163+
164+
Dry-run preview leaves the queue untouched — rows stay in `pending` until
165+
the worker runs without `dry_run=true`.
166+
167+
## How It Connects to Other Components
168+
169+
The Smart Ingest Edge Function (`integrations/smart-ingest`) automatically triggers this worker after writing new thoughts. The Enhanced MCP Server (`integrations/enhanced-mcp`) exposes `graph_search` and `entity_detail` tools that query the graph this worker builds.
170+
171+
For guidance on managing tool count and token overhead as you add more integrations, see the [tool audit guide](../../docs/05-tool-audit.md).
172+
173+
## Expected Outcome
174+
175+
After completing setup and running the worker, you should be able to:
176+
177+
1. See entities extracted from your thoughts in the `entities` table
178+
2. See relationships between entities in the `edges` table
179+
3. Query `thought_entities` to find which thoughts mention which entities
180+
4. Use the `graph_search` and `entity_detail` MCP tools (if the enhanced MCP server is deployed)
181+
5. Observe the queue draining — items move from `pending``processing``complete`
182+
183+
## Troubleshooting
184+
185+
**"No LLM API key configured"**
186+
Set at least one of `OPENROUTER_API_KEY`, `OPENAI_API_KEY`, or `ANTHROPIC_API_KEY` as a Supabase secret.
187+
188+
**Queue items stuck in "processing"**
189+
If the worker crashes mid-batch, items remain in "processing" status. Reset them:
190+
191+
```sql
192+
UPDATE entity_extraction_queue
193+
SET status = 'pending', started_at = NULL
194+
WHERE status = 'processing'
195+
AND started_at < now() - interval '10 minutes';
196+
```
197+
198+
**Items repeatedly failing**
199+
Check the `last_error` column in `entity_extraction_queue`. After 5 failed attempts, items are marked as permanently `failed`. Common causes: LLM rate limiting, empty thought content, malformed responses.
200+
201+
**No entities extracted from a thought**
202+
The LLM only extracts entities with confidence >= 0.5. Vague or very short thoughts may not yield any entities. This is expected behavior — check `dry_run` output to see what the LLM returns.

0 commit comments

Comments
 (0)