|
| 1 | +# Entity Extraction Worker |
| 2 | + |
| 3 | +> Async worker that drains the entity extraction queue, extracting people, projects, topics, tools, organizations, and places from thoughts via LLM and building a knowledge graph. |
| 4 | +
|
| 5 | +## What It Does |
| 6 | + |
| 7 | +Processes the `entity_extraction_queue` table in batches. For each queued thought, the worker calls an LLM to extract named entities and their relationships, then upserts them into the `entities`, `edges`, and `thought_entities` tables. |
| 8 | + |
| 9 | +The knowledge graph enables queries like "what projects does Sarah work on?" or "which tools are related to this topic?" — turning unstructured thoughts into a navigable graph of people, projects, and concepts. |
| 10 | + |
| 11 | +**Entity types:** person, project, topic, tool, organization, place |
| 12 | + |
| 13 | +**Relationship types:** works_on, uses, related_to, member_of, located_in, co_occurs_with |
| 14 | + |
| 15 | +**Worker behavior:** |
| 16 | +- Claims pending queue items atomically (prevents duplicate processing) |
| 17 | +- Retries failed items up to 5 times before marking as permanently failed |
| 18 | +- Skips system-generated thoughts (those with `metadata.generated_by`) |
| 19 | +- Supports dry-run mode for previewing extractions without writing |
| 20 | +- Enforces canonical ordering for symmetric relations to avoid duplicate edges |
| 21 | + |
| 22 | +## Prerequisites |
| 23 | + |
| 24 | +- Working Open Brain setup ([guide](../../docs/01-getting-started.md)) |
| 25 | +- **Enhanced thoughts schema** applied — install `schemas/enhanced-thoughts` |
| 26 | +- **Knowledge graph schema** applied — install `schemas/knowledge-graph` to create the `entities`, `edges`, `thought_entities`, and `entity_extraction_queue` tables |
| 27 | +- At least one LLM API key: OpenRouter (recommended), OpenAI, or Anthropic |
| 28 | +- Supabase CLI installed for deployment |
| 29 | + |
| 30 | +## Steps |
| 31 | + |
| 32 | +### 1. Deploy the Edge Function |
| 33 | + |
| 34 | +Copy the `integrations/entity-extraction-worker/` folder into your Supabase project's `supabase/functions/` directory, then deploy: |
| 35 | + |
| 36 | +```bash |
| 37 | +supabase functions deploy entity-extraction-worker --no-verify-jwt |
| 38 | +``` |
| 39 | + |
| 40 | +### 2. Set Environment Variables |
| 41 | + |
| 42 | +```bash |
| 43 | +supabase secrets set \ |
| 44 | + MCP_ACCESS_KEY="your-access-key" \ |
| 45 | + OPENROUTER_API_KEY="your-openrouter-key" |
| 46 | +``` |
| 47 | + |
| 48 | +Optional multi-provider fallback: |
| 49 | + |
| 50 | +```bash |
| 51 | +supabase secrets set \ |
| 52 | + OPENAI_API_KEY="your-openai-key" \ |
| 53 | + ANTHROPIC_API_KEY="your-anthropic-key" |
| 54 | +``` |
| 55 | + |
| 56 | +Optional safety knobs: |
| 57 | + |
| 58 | +```bash |
| 59 | +supabase secrets set \ |
| 60 | + ENTITY_EXTRACTION_MAX_CALLS="10000" \ |
| 61 | + FETCH_TIMEOUT_MS="60000" |
| 62 | +``` |
| 63 | + |
| 64 | +- `ENTITY_EXTRACTION_MAX_CALLS` — cap on LLM extraction calls per container |
| 65 | + lifetime (default `10000`; set to `0` to disable). When the cap trips, |
| 66 | + the worker releases remaining claimed rows back to `pending` and returns |
| 67 | + `{ truncated: true, truncated_reason: "call_cap_reached", ... }` so the |
| 68 | + next invocation resumes cleanly. |
| 69 | +- `FETCH_TIMEOUT_MS` — hard timeout on every LLM fetch (default `60000`). |
| 70 | + Protects against stalled upstreams consuming the 150s Edge Function |
| 71 | + wall-clock. |
| 72 | + |
| 73 | +### 3. Backfill the Extraction Queue |
| 74 | + |
| 75 | +If you have existing thoughts that need entity extraction, enqueue them: |
| 76 | + |
| 77 | +```sql |
| 78 | +INSERT INTO entity_extraction_queue (thought_id, status) |
| 79 | +SELECT id, 'pending' |
| 80 | +FROM thoughts |
| 81 | +WHERE id NOT IN (SELECT thought_id FROM entity_extraction_queue) |
| 82 | +ORDER BY created_at DESC |
| 83 | +LIMIT 100; |
| 84 | +``` |
| 85 | + |
| 86 | +New thoughts are automatically enqueued by the `queue_entity_extraction` trigger from the knowledge graph schema. |
| 87 | + |
| 88 | +### 4. Run the Worker |
| 89 | + |
| 90 | +Trigger the worker to process the queue: |
| 91 | + |
| 92 | +```bash |
| 93 | +curl -X POST "https://<your-project-ref>.supabase.co/functions/v1/entity-extraction-worker?limit=10" \ |
| 94 | + -H "x-brain-key: your-access-key" |
| 95 | +``` |
| 96 | + |
| 97 | +For a dry run (preview without writing): |
| 98 | + |
| 99 | +```bash |
| 100 | +curl -X POST "https://<your-project-ref>.supabase.co/functions/v1/entity-extraction-worker?limit=5&dry_run=true" \ |
| 101 | + -H "x-brain-key: your-access-key" |
| 102 | +``` |
| 103 | + |
| 104 | +### 5. Verify Results |
| 105 | + |
| 106 | +Check that entities and edges were created: |
| 107 | + |
| 108 | +```sql |
| 109 | +SELECT entity_type, canonical_name, last_seen_at |
| 110 | +FROM entities |
| 111 | +ORDER BY last_seen_at DESC |
| 112 | +LIMIT 20; |
| 113 | + |
| 114 | +SELECT e1.canonical_name AS from_entity, e2.canonical_name AS to_entity, ed.relation, ed.support_count |
| 115 | +FROM edges ed |
| 116 | +JOIN entities e1 ON ed.from_entity_id = e1.id |
| 117 | +JOIN entities e2 ON ed.to_entity_id = e2.id |
| 118 | +ORDER BY ed.updated_at DESC |
| 119 | +LIMIT 20; |
| 120 | +``` |
| 121 | + |
| 122 | +## API Reference |
| 123 | + |
| 124 | +### `POST /entity-extraction-worker` |
| 125 | + |
| 126 | +| Parameter | Type | Default | Description | |
| 127 | +|-----------|------|---------|-------------| |
| 128 | +| `limit` | query param | 10 | Number of queue items to process (max 50) | |
| 129 | +| `dry_run` | query param | false | Preview extractions without writing to DB | |
| 130 | + |
| 131 | +**Response:** |
| 132 | + |
| 133 | +```json |
| 134 | +{ |
| 135 | + "processed": 10, |
| 136 | + "succeeded": 8, |
| 137 | + "failed": 2, |
| 138 | + "entities_created": 15, |
| 139 | + "edges_created": 7, |
| 140 | + "dry_run": false, |
| 141 | + "truncated": false, |
| 142 | + "truncated_reason": null, |
| 143 | + "llm_calls": 10, |
| 144 | + "elapsed_ms": 8421 |
| 145 | +} |
| 146 | +``` |
| 147 | + |
| 148 | +- `truncated` — `true` when the worker aborted early because a safety cap |
| 149 | + was hit. Remaining claimed rows are returned to `pending`. |
| 150 | +- `truncated_reason` — `"call_cap_reached"` (ENTITY_EXTRACTION_MAX_CALLS) |
| 151 | + or `"wall_clock_budget"` (approaching the 150s platform timeout). |
| 152 | +- `llm_calls` — cumulative LLM calls across this container's lifetime. |
| 153 | +- `elapsed_ms` — wall-clock duration of this invocation. |
| 154 | + |
| 155 | +**Queue statuses:** |
| 156 | + |
| 157 | +- `pending` — awaiting extraction |
| 158 | +- `processing` — currently being worked on |
| 159 | +- `complete` — entities extracted and written |
| 160 | +- `skipped` — intentionally not extracted (e.g. `metadata.generated_by` |
| 161 | + is set, indicating a system-synthesized artifact) |
| 162 | +- `failed` — exceeded `MAX_ATTEMPTS` (5) retries; check `last_error` |
| 163 | + |
| 164 | +Dry-run preview leaves the queue untouched — rows stay in `pending` until |
| 165 | +the worker runs without `dry_run=true`. |
| 166 | + |
| 167 | +## How It Connects to Other Components |
| 168 | + |
| 169 | +The Smart Ingest Edge Function (`integrations/smart-ingest`) automatically triggers this worker after writing new thoughts. The Enhanced MCP Server (`integrations/enhanced-mcp`) exposes `graph_search` and `entity_detail` tools that query the graph this worker builds. |
| 170 | + |
| 171 | +For guidance on managing tool count and token overhead as you add more integrations, see the [tool audit guide](../../docs/05-tool-audit.md). |
| 172 | + |
| 173 | +## Expected Outcome |
| 174 | + |
| 175 | +After completing setup and running the worker, you should be able to: |
| 176 | + |
| 177 | +1. See entities extracted from your thoughts in the `entities` table |
| 178 | +2. See relationships between entities in the `edges` table |
| 179 | +3. Query `thought_entities` to find which thoughts mention which entities |
| 180 | +4. Use the `graph_search` and `entity_detail` MCP tools (if the enhanced MCP server is deployed) |
| 181 | +5. Observe the queue draining — items move from `pending` → `processing` → `complete` |
| 182 | + |
| 183 | +## Troubleshooting |
| 184 | + |
| 185 | +**"No LLM API key configured"** |
| 186 | +Set at least one of `OPENROUTER_API_KEY`, `OPENAI_API_KEY`, or `ANTHROPIC_API_KEY` as a Supabase secret. |
| 187 | + |
| 188 | +**Queue items stuck in "processing"** |
| 189 | +If the worker crashes mid-batch, items remain in "processing" status. Reset them: |
| 190 | + |
| 191 | +```sql |
| 192 | +UPDATE entity_extraction_queue |
| 193 | +SET status = 'pending', started_at = NULL |
| 194 | +WHERE status = 'processing' |
| 195 | + AND started_at < now() - interval '10 minutes'; |
| 196 | +``` |
| 197 | + |
| 198 | +**Items repeatedly failing** |
| 199 | +Check the `last_error` column in `entity_extraction_queue`. After 5 failed attempts, items are marked as permanently `failed`. Common causes: LLM rate limiting, empty thought content, malformed responses. |
| 200 | + |
| 201 | +**No entities extracted from a thought** |
| 202 | +The LLM only extracts entities with confidence >= 0.5. Vague or very short thoughts may not yield any entities. This is expected behavior — check `dry_run` output to see what the LLM returns. |
0 commit comments