Skip to content

Commit 08986e9

Browse files
authored
feat: add Presidio PII detection and redaction (RFC-013) (#118)
* feat: add Presidio PII detection and redaction (RFC-013) Introduces optional PII detection via Microsoft Presidio for content before it is stored by remember(). Disabled by default; no new core dependencies. Requires pip install zettelforge[pii]. Key features: - Three actions: log (warn only), redact (replace with placeholder), block (raise exception before storage) - CTI allowlist for IP_ADDRESS, URL, DOMAIN_NAME (legitimate indicators) - Lazy SDK loading -- ImportError deferred to first detect() call - Graceful degradation when presidio-analyzer is not installed - Config via governance.pii section in config.yaml - Env overrides: ZETTELFORGE_PII_ENABLED, ZETTELFORGE_PII_ACTION Files: - src/zettelforge/pii_validator.py -- PIIValidator, PIIDetection, PIIBlockedError - src/zettelforge/config.py -- PIIConfig dataclass, pii field on GovernanceConfig - src/zettelforge/governance_validator.py -- PIIValidator integration, validate_remember() - src/zettelforge/memory_manager.py -- PII config wired to GovernanceValidator - pyproject.toml -- pii optional extra (presidio-analyzer, spacy) - config.default.yaml -- PII documentation with examples - tests/test_pii_validator.py -- 26 unit tests - docs/how-to/configure-pii.md -- setup guide - docs/rfcs/RFC-013-presidio-pii-detection.md -- RFC document RFC: docs/rfcs/RFC-013-presidio-pii-detection.md * style: fix UP006 type annotations in pii_validator and governance_validator Replace typing.List/Dict/Tuple/Optional with built-in list/dict/tuple and | None syntax. All files are under from __future__ import annotations so these are PEP 604 compatible at runtime. * style: replace `List[str]` with `list[str]` to satisfy UP006/F821 The PR was authored before #109 (UP rule batch) landed, which removed `from typing import List` across the codebase as part of the PEP 585 modernization. After rebase, `List` was undefined here. Drop-in replacement with the built-in `list` matches the rest of the post-#109 codebase. * style: fix UP006/UP037 type annotations in config.py Add from __future__ import annotations, replace List[str] with list[str], remove unnecessary quotes from type annotation (re.Match[str]). Also removes the unused Dict and Optional imports that were only needed for PIIConfig. * fix: handle nested PIIConfig dataclass in _apply_yaml When config.default.yaml contains governance.pii section, _apply_yaml was setting cfg.governance.pii = raw_dict instead of populating the existing PIIConfig dataclass. Add nested handling for the 'pii' key to properly merge dict values into the dataclass fields. This caused AttributeError: 'dict' object has no attribute 'enabled' in all downstream tests that load the default config. * fix(rfc-013): governance enforce() ordering + nlp_engine stub shape Two test failures remained on top of e05cf67's nested-PIIConfig fix: ## 1. enforce("remember", None) silently returned instead of raising The new `enforce()` short-circuited on `operation == "remember"` and returned data unchanged when it wasn't a string, bypassing the `validate_operation()` GOV-011 check that `test_governance_violation_raises` depends on. Reordered so the structural validation runs first; the PII path runs only after data passes the str/has-content check. ## 2. tests/test_pii_validator.py stub patched the wrong target `from presidio_analyzer.nlp_engine import NlpEngineProvider` resolves `NlpEngineProvider` as a module attribute on `presidio_analyzer.nlp_engine`, which is a *module* — not a class. The stub installed a class (`_StubNlpEngine`) at that sys.modules slot, so the import raised `ImportError: cannot import name 'NlpEngineProvider' from '_StubNlpEngine'`. Built a real `types.ModuleType` for the submodule with `NlpEngineProvider = _StubNlpProvider` attached. Also renamed `test_validate_remember_rejects_empty_string` to `test_validate_remember_rejects_invalid_data` and switched the operation from `"synthesize"` to `"remember"` — the test name said one thing, the body asserted something only the remember path validates. 25/25 tests pass in test_governance.py + test_pii_validator.py. * fix: address all Copilot/Codex review comments on PR #118 Fixed 6 issues identified in the automated review: 1. PII text leakage in validate() log — removed raw PII text from structured log entities; now logs only entity_type and score. 2. CTI allowlist filter location — moved from constructor to detect() time so that entities=None (detect-all mode) filters out IP_ADDRESS, URL, and DOMAIN_NAME at runtime instead of at construction. 3. Overlapping span resolution — greedy algorithm now resolves containment (longest span wins) instead of only exact (start,end) deduplication. Prevents sub-spans from being redacted separately. 4. PIIConfig.entities=[] → PIIValidator entities=None — empty list in config means "detect all supported types". Convert [] to None in constructor so Presidio receives entities=None (semantic for 'detect all') instead of entities=[] (semantic for 'detect none'). 5. PIIBlockedError handling — caught in validate_remember() and converted to GovernanceViolationError so memory_manager's existing except GovernanceViolationError handler covers the block action. 6. Unused imports — removed unused field import from pii_validator, unused importlib/patch imports from test file, unused _make_mock_analyzer helper. Also fixed merge conflicts from rebase in governance_validator.py and test_pii_validator.py. * style: remove unused MagicMock import from test_pii_validator.py * test: fix CTI allowlist test for detect-time filtering Allowlist filtering moved from constructor to detect() in the review fix. Updated test_cti_allowlist_filters_entities to match the new behavior: constructor preserves user-specified entities as-is, filtering happens at runtime in detect() instead. Also added test for empty-list-to-None conversion.
1 parent 46bc414 commit 08986e9

9 files changed

Lines changed: 1510 additions & 13 deletions

File tree

config.default.yaml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -329,6 +329,15 @@ synthesis:
329329
# Validation rules applied before remember() and recall() operations.
330330
# Enforces data classification, retention, access control, and audit logging.
331331
#
332+
# PII detection (RFC-013, optional) uses Microsoft Presidio to scan content
333+
# for personally identifiable information before storage. Disabled by default.
334+
# Requires: pip install zettelforge[pii]
335+
#
336+
# PII action options:
337+
# log — detect and warn, pass content through unchanged
338+
# redact — replace PII with [REDACTED] before storage
339+
# block — raise exception if any PII is detected
340+
#
332341
# Examples:
333342
# # Production (default)
334343
# enabled: true
@@ -341,9 +350,41 @@ synthesis:
341350
# enabled: true
342351
# min_content_length: 20
343352
#
353+
# # PII log-only (see what PII flows through your pipeline)
354+
# enabled: true
355+
# min_content_length: 1
356+
# pii:
357+
# enabled: true
358+
# action: log
359+
#
360+
# # PII redact (automatically remove PII before storage)
361+
# enabled: true
362+
# min_content_length: 1
363+
# pii:
364+
# enabled: true
365+
# action: redact
366+
#
367+
# # PII block (strict — reject content with detected PII)
368+
# enabled: true
369+
# min_content_length: 1
370+
# pii:
371+
# enabled: true
372+
# action: block
373+
#
374+
# Env overrides:
375+
# ZETTELFORGE_PII_ENABLED=true
376+
# ZETTELFORGE_PII_ACTION=redact
377+
#
344378
governance:
345379
enabled: true
346380
min_content_length: 1
381+
pii:
382+
enabled: false
383+
action: log
384+
redact_placeholder: "[REDACTED]"
385+
entities: []
386+
language: en
387+
nlp_model: en_core_web_sm
347388

348389

349390
# ── Cache ───────────────────────────────────────────────────────────────────

docs/how-to/configure-pii.md

Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
---
2+
title: "Configure PII Detection and Redaction"
3+
description: "Set up Microsoft Presidio for PII detection in ZettelForge. Scan content for personally identifiable information before storage, with configurable log/redact/block policies."
4+
diataxis_type: "how-to"
5+
audience: "Compliance Officer, SOC Manager, FedRAMP Engineer"
6+
tags: [pii, governance, compliance, presidio, privacy, fedramp, gdpr]
7+
last_updated: "2026-04-25"
8+
version: "2.5.0"
9+
---
10+
11+
# Configure PII Detection and Redaction
12+
13+
ZettelForge uses [Microsoft Presidio](https://github.com/microsoft/presidio) (open-source, MIT license) to detect and optionally redact PII (Personally Identifiable Information) from content before it is stored in the vector database and knowledge graph.
14+
15+
PII detection is **disabled by default** with no new core dependencies. It is a fully optional feature -- `pip install zettelforge[pii]` activates it.
16+
17+
## Prerequisites
18+
19+
- ZettelForge installed
20+
- `pip install zettelforge[pii]` to install presidio-analyzer, presidio-anonymizer, and spaCy
21+
- About ~12-50 MB of disk space for the spaCy model (auto-downloads on first use)
22+
23+
## How It Works
24+
25+
Presidio runs **in-process** as a validation step inside `GovernanceValidator`, invoked before every `remember()` operation:
26+
27+
```
28+
remember(content)
29+
-> GovernanceValidator.validate_remember(content)
30+
-> PIIValidator.validate(content)
31+
-> presidio-analyzer scans for 20+ PII types
32+
-> Returns (passed, processed_content, detections)
33+
-> Returns processed_content (possibly redacted)
34+
-> MemoryStore.save(processed_content)
35+
```
36+
37+
Three actions control what happens when PII is detected:
38+
39+
| Action | Behavior | Use Case |
40+
|:-------|:---------|:---------|
41+
| `log` | Detect, log a warning, pass content through unchanged | Discovery -- see what PII is in your pipeline |
42+
| `redact` | Replace PII with `[REDACTED]` before storage | Compliance -- prevent PII persistence |
43+
| `block` | Raise an exception, storage is cancelled | Strict environments -- no PII allowed through |
44+
45+
## Configuration
46+
47+
Add a `pii:` section under `governance:` in your `config.yaml`:
48+
49+
```yaml
50+
governance:
51+
enabled: true
52+
pii:
53+
enabled: true # enable PII detection
54+
action: log # log | redact | block
55+
redact_placeholder: "[REDACTED]"
56+
entities: [] # empty = all PII types
57+
language: en
58+
nlp_model: en_core_web_sm
59+
```
60+
61+
### Entity Filtering
62+
63+
The `entities` list lets you scope detection to specific PII types. When empty (default), all supported types are detected.
64+
65+
Common entity types:
66+
67+
| Entity | Example | Notes |
68+
|:-------|:--------|:------|
69+
| `EMAIL_ADDRESS` | `user@example.com` | Enables spam from phishing reports |
70+
| `PHONE_NUMBER` | `(555) 123-4567` | |
71+
| `PERSON` | `John Smith` | |
72+
| `CREDIT_CARD` | `4111-1111-1111-1111` | |
73+
| `SSN` | `123-45-6789` | |
74+
| `CRYPTO` | `1A1zP1eP5QGefi2DMP` | Bitcoin addresses |
75+
| `LOCATION` | `New York City` | |
76+
| `ORGANIZATION` | `Microsoft Corp` | |
77+
78+
IP addresses, URLs, and domain names are **exempt from detection by default** -- these are legitimate CTI indicators (IOCs), not PII in the threat intelligence context. To include them, set `entities` explicitly:
79+
80+
```yaml
81+
pii:
82+
enabled: true
83+
entities: ["IP_ADDRESS", "EMAIL_ADDRESS"] # IPs will now be detected
84+
```
85+
86+
## Example Configurations
87+
88+
### 1. Log-Only (Discovery Mode)
89+
90+
Use this first to understand what PII flows through your pipeline without changing any data:
91+
92+
```yaml
93+
governance:
94+
pii:
95+
enabled: true
96+
action: log
97+
```
98+
99+
Every PII detection is logged as a structured `pii_detected` log event with count, entity types, and scores. Content is stored unchanged.
100+
101+
### 2. Redact (Compliance Mode)
102+
103+
Automatically replace PII with placeholders before storage:
104+
105+
```yaml
106+
governance:
107+
pii:
108+
enabled: true
109+
action: redact
110+
redact_placeholder: "[PII REMOVED]"
111+
```
112+
113+
The redacted content is what gets stored and indexed. The original content with PII is never persisted.
114+
115+
### 3. Block (Strict Mode)
116+
117+
Reject any content containing PII entirely:
118+
119+
```yaml
120+
governance:
121+
pii:
122+
enabled: true
123+
action: block
124+
```
125+
126+
If PII is detected, `remember()` raises a `PIIBlockedError` and the operation is cancelled. The calling code receives the exception and can handle it (e.g., ask the user to retry without PII).
127+
128+
### 4. Targeted Detection (Only Emails and Phones)
129+
130+
Scope detection to specific entity types to reduce noise:
131+
132+
```yaml
133+
governance:
134+
pii:
135+
enabled: true
136+
action: redact
137+
entities: ["EMAIL_ADDRESS", "PHONE_NUMBER"]
138+
```
139+
140+
### 5. Complete Compliance Setup (FedRAMP-aligned)
141+
142+
```yaml
143+
governance:
144+
enabled: true
145+
min_content_length: 1
146+
pii:
147+
enabled: true
148+
action: redact
149+
redact_placeholder: "[REDACTED]"
150+
entities: []
151+
language: en
152+
nlp_model: en_core_web_sm
153+
```
154+
155+
## Environment Variables
156+
157+
| Variable | Maps To | Default |
158+
|:---------|:--------|:--------|
159+
| `ZETTELFORGE_PII_ENABLED` | `governance.pii.enabled` | `false` |
160+
| `ZETTELFORGE_PII_ACTION` | `governance.pii.action` | `log` |
161+
162+
## spaCy Model Download
163+
164+
The spaCy NLP model downloads automatically on the first `remember()` call after PII is enabled. The download is a one-time cost:
165+
166+
| Model | Size | Speed | Notes |
167+
|:------|:-----|:------|:------|
168+
| `en_core_web_sm` | ~12 MB | Fast | Default. Good accuracy for standard PII |
169+
| `en_core_web_md` | ~40 MB | Medium | Better person/location disambiguation |
170+
| `en_core_web_lg` | ~560 MB | Slow | Best accuracy, word vectors |
171+
| `en_core_web_trf` | ~400 MB | Slowest | Transformer-based, best for context |
172+
173+
To pre-download (recommended for air-gapped deployments):
174+
175+
```bash
176+
python -m spacy download en_core_web_sm
177+
```
178+
179+
## Verification
180+
181+
After configuration, test that PII detection is working:
182+
183+
```python
184+
from zettelforge import MemoryManager
185+
186+
mm = MemoryManager()
187+
188+
# This should trigger a PII warning if action=log
189+
note, status = mm.remember(
190+
"Contact analyst John Smith at john@example.com or 555-1234 for details."
191+
)
192+
```
193+
194+
With `action=log`, you will see a `pii_detected` structured log event.
195+
196+
With `action=redact`, the stored content will have PII replaced:
197+
198+
```python
199+
print(note.content.raw)
200+
# "Contact analyst [REDACTED] at [REDACTED] or [REDACTED] for details."
201+
```
202+
203+
With `action=block`, `remember()` will raise `PIIBlockedError`.
204+
205+
## Performance Impact
206+
207+
- First call: ~2-3 seconds (spaCy model loading). Subsequent calls are fast.
208+
- Detection latency: ~50-200ms per `remember()` depending on content length and model size.
209+
- No network calls (all detection is local).
210+
- No impact when `governance.pii.enabled: false` (disabled by default).
211+
212+
## Related
213+
214+
- [Configuration Reference](../reference/configuration.md) -- all `config.yaml` keys
215+
- [Governance Controls](../reference/governance-controls.md) -- GOV-013 PII enforcement
216+
- [Microsoft Presidio](https://github.com/microsoft/presidio) -- upstream project
217+
- RFC-013: PII Detection and Redaction via Microsoft Presidio

0 commit comments

Comments
 (0)