Status: ✓ complete (with 20.7b classifier inference parked as immediate follow-up). Superseded by the per-sub-task plans in
.claude/plans/phase-20*.mdand the "Phase 20" entry inROADMAP.md.Scope cut: the candle forward pass for the bundled NER classifier is parked as phase 20.7b. The framework path is production-ready; operators provision
BRAIN_NER_MODEL_PATHpercrates/brain-extractors/docs/bundled-ner.md.
Implement the extractor framework, run pattern extractors synchronously on ENCODE, run classifier extractors near-foreground, write extraction audit logs, ship one built-in pattern extractor (brain.entity_mentions) and one built-in classifier (NER).
- Phase 19 complete.
22_extractors/00_purpose.md27_knowledge_workers/00_purpose.md(pattern + classifier sections)25_provenance_versioning/00_purpose.md(audit log)
- Extractor trait and registry.
- Pattern extractor implementation (regex-based).
- Classifier extractor implementation (call out to a model).
- Built-in
brain.entity_mentionsextractor. - Built-in
brain.basic_nerclassifier (small bundled model). - Audit log writes for every extraction.
- ENCODE handler invokes extractors synchronously for pattern + classifier kinds.
Reads: 22_extractors/00_purpose.md.
Writes: crates/brain-core/src/extractor/mod.rs.
Done when: Extractor trait, ExtractorRegistry (maps ExtractorId to instance), ExtractedItem output type.
Pitfalls: Trait must be object-safe (dyn Extractor). Outputs are an enum: Entity, Statement, Relation.
Reads: 22_extractors/00_purpose.md (pattern section).
Writes: crates/brain-extractors/src/pattern.rs.
Done when: loads patterns from declared extractor, applies regex to memory text, resolves entities or produces statements per the target.
Pitfalls: Use regex crate; precompile patterns at schema load. Cap pattern compile time and runtime per memory.
Reads: 22_extractors/00_purpose.md (classifier section).
Writes: crates/brain-extractors/src/classifier.rs.
Done when: loads a pinned model, runs inference per memory, produces extracted items. Initially CPU-only (no GPU dependency).
Pitfalls: Pin model framework version. Use ONNX runtime or pure-Rust inference (e.g., candle).
Reads: 22_extractors/00_purpose.md (built-ins).
Writes: crates/brain-extractors/src/builtin/entity_mentions.rs.
Done when: ships with patterns for common person names (English, capital initials), email addresses, ticket IDs ([A-Z]+-\d+).
Pitfalls: Patterns are heuristic; precision/recall trade-offs documented.
Reads: 22_extractors/00_purpose.md (built-ins).
Writes: crates/brain-extractors/src/builtin/basic_ner.rs.
Done when: small bundled CONLL-trained NER model classifies person, org, place mentions.
Pitfalls: Model size matters; aim for ~50 MB or less. Document accuracy.
Reads: 25_provenance_versioning/00_purpose.md (audit).
Writes: crates/brain-metadata/src/audit_ops.rs.
Done when: every extraction writes an audit entry; admin can query audits by memory, extractor, time.
Reads: 22_extractors/00_purpose.md; spec/05_operations/.
Writes: crates/brain-server/src/handlers/substrate/encode.rs (extended).
Done when: on ENCODE, after memory write succeeds: for each active pattern + classifier extractor whose trigger matches, run synchronously, write outputs.
Pitfalls: Don't extend ENCODE's contract with extraction failures — if an extractor fails, log it and continue. Memory write succeeds independently.
Writes: tests/knowledge_extractors_pattern_classifier.rs.
Done when: end-to-end: schema with pattern+classifier extractor → ENCODE memory → entity + statement created → query returns them. Audit log shows the extraction.
- Pattern extractors run on ENCODE without breaking ENCODE latency budget significantly (P99 ≤ 20 ms).
- Classifier extractor runs and produces entities.
- Built-ins ship and work on a sample dataset.
- Audit logs queryable.
- Don't enable LLM extractors here (phase 21).
- Don't trigger backfill (phase 24).
- Classifier model bundling: consider distribution size; users may prefer to download separately.