Skip to content

Commit bcf3f1f

Browse files
authored
spec(004-memory): add shadow memory, implicit conflict detection, five-signal retrieval specs (#4375)
Add three memory subsystem specifications derived from arXiv research: - 004-16: TrajectoryRiskAccumulator for multi-turn attack defense (MAGE, #3695) - 004-17: ImplicitConflictDetector extending APEX-MEM write path (STALE, #3702) - 004-18: Five-signal SYNAPSE retrieval + async consolidation daemon (MemTier, #3703)
1 parent c92702d commit bcf3f1f

4 files changed

Lines changed: 1174 additions & 0 deletions

File tree

Lines changed: 372 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,372 @@
1+
---
2+
aliases:
3+
- MAGE Shadow Memory
4+
- Trajectory Risk Accumulator
5+
- Shadow Memory Defense
6+
tags:
7+
- sdd
8+
- spec
9+
- memory
10+
- security
11+
- experimental
12+
created: 2026-05-18
13+
status: draft
14+
related:
15+
- "[[MOC-specs]]"
16+
- "[[constitution]]"
17+
- "[[004-memory/spec]]"
18+
- "[[004-7-memory-apex-magma]]"
19+
- "[[001-system-invariants/spec]]"
20+
- "[[010-security/010-7-shadow-memory-guardrail]]"
21+
---
22+
23+
# Spec: Shadow Memory Safety — Trajectory-Level Attack Defense (MAGE)
24+
25+
> [!info]
26+
> Implements a parallel shadow memory stream that accumulates safety-critical
27+
> signals across an agent's full execution trajectory, enabling detection and
28+
> blocking of multi-turn attacks that evade per-turn controls.
29+
> Resolves GitHub issue [#3695](https://github.com/rabax/zeph/issues/3695).
30+
31+
## Sources
32+
33+
### External
34+
- **MAGE: Multi-turn Agent Guard with Extended memory** (arXiv:2605.03228, 2026) —
35+
shadow memory for trajectory-level threat detection; reduces tool-attack chaining
36+
from 100% → 8.3% success rate; eliminates persistent indirect prompt injection
37+
38+
### Internal
39+
40+
| File | Contents |
41+
|------|----------|
42+
| `crates/zeph-sanitizer/src/lib.rs` | `ContentSanitizer`, `PolicyGate`, per-turn audit events |
43+
| `crates/zeph-agent-tools/src/executor.rs` | Tool execution gate; pre-execution policy check |
44+
| `crates/zeph-memory/src/semantic/mod.rs` | `SemanticMemory`; shadow memory is a sibling stream |
45+
| `crates/zeph-core/src/agent/mod.rs` | Agent turn loop; `MemoryState` lifecycle |
46+
47+
---
48+
49+
## 1. Overview
50+
51+
### Problem Statement
52+
53+
Zeph's current safety controls in `zeph-sanitizer` operate per-turn: `ContentSanitizer`
54+
filters individual messages and `PolicyGate` enforces policies on individual tool
55+
invocations. These mechanisms cannot detect cumulative threats that unfold gradually
56+
across multiple turns — a class of attack where no single turn triggers a policy
57+
violation but the trajectory as a whole enacts adversarial behavior.
58+
59+
Three concrete threat classes are undetected today:
60+
61+
1. **Sequential tool-attack chaining**: an adversary plants a goal across 3–5 turns
62+
(each plausible in isolation), then triggers execution in a later turn. Per-turn
63+
controls see only individual turns and admit all of them.
64+
2. **Persistent indirect prompt injection**: malicious instructions injected via tool
65+
output (e.g., a web-scrape result) persist in episodic memory and are recalled in
66+
future turns, re-injecting the adversarial directive.
67+
3. **Multi-turn poisoning**: repeated low-severity signals accumulate into a high-risk
68+
trajectory. Each signal alone falls below the per-turn policy threshold.
69+
70+
MAGMA [[004-7-memory-apex-magma]] tracks semantic entity relationships but does not
71+
accumulate trajectory-level safety signals. SafeAgent (#3570) addresses trajectory-
72+
stateful mediation conceptually but remains unimplemented.
73+
74+
### Goal
75+
76+
Implement `TrajectoryRiskAccumulator` — a lightweight shadow memory attached to each
77+
agent session — that ingests per-turn audit events from `zeph-sanitizer`, maintains
78+
a rolling trajectory risk score, and gates tool execution when cumulative risk exceeds
79+
a configured threshold. No changes to the primary memory pipeline are required; shadow
80+
memory is an orthogonal stream.
81+
82+
### Out of Scope
83+
84+
- Replacing or modifying the per-turn `ContentSanitizer` or `PolicyGate` (shadow memory
85+
is additive, not a substitute)
86+
- Cross-session risk propagation (risk resets when the session ends)
87+
- LLM-based intent classification of each turn (signal detection is rule-based to keep
88+
overhead low; LLM escalation is a separate optional path)
89+
- Changes to the MAGMA semantic graph schema
90+
- Modifying `zeph-sanitizer` internals beyond adding audit event emission
91+
92+
---
93+
94+
## 2. User Stories
95+
96+
### US-001: Multi-turn attack detection
97+
AS AN operator running long-lived Zeph agents in production
98+
I WANT the agent to detect and block tool-attack chains that develop over multiple turns
99+
SO THAT an adversary who plants context incrementally cannot reach tool execution
100+
101+
**Acceptance criteria:**
102+
```
103+
GIVEN a session where turns 1–4 each emit one low-severity policy warning
104+
AND turn 5 requests a tool execution
105+
WHEN the cumulative trajectory risk exceeds the configured threshold
106+
THEN the tool execution is denied
107+
AND the agent returns a rationale explaining the denial
108+
AND the incident is logged in the sanitizer audit trail
109+
```
110+
111+
### US-002: Persistent prompt injection blocking
112+
AS AN operator
113+
I WANT recalled tool output that contains injection patterns to be flagged at the
114+
trajectory level
115+
SO THAT injected instructions planted via episodic memory do not execute silently
116+
117+
**Acceptance criteria:**
118+
```
119+
GIVEN a tool result containing a prompt-injection pattern was stored in episodic memory
120+
AND the injected instruction is recalled in a later turn
121+
WHEN the recall surface triggers a prompt-injection signal
122+
THEN the TrajectoryRiskAccumulator records a high-severity injection signal
123+
AND if the trajectory risk exceeds the threshold, the current tool call is blocked
124+
AND the event is emitted to the audit log
125+
```
126+
127+
### US-003: Benign session pass-through
128+
AS A regular user running normal agent tasks
129+
I WANT safety checks to add no observable latency on clean sessions
130+
SO THAT the security mechanism does not degrade normal operations
131+
132+
**Acceptance criteria:**
133+
```
134+
GIVEN a session with no policy violations, no anomalous tool patterns, and
135+
no injection signals across 50 turns
136+
WHEN the agent processes each turn
137+
THEN no tool calls are denied by the TrajectoryRiskAccumulator
138+
AND per-turn overhead from shadow memory is < 1 ms at p95
139+
```
140+
141+
---
142+
143+
## 3. Functional Requirements
144+
145+
| ID | Requirement | Priority |
146+
|----|------------|----------|
147+
| FR-001 | THE SYSTEM SHALL maintain one `TrajectoryRiskAccumulator` per agent session, created at session start and dropped at session end | must |
148+
| FR-002 | WHEN `zeph-sanitizer` emits an `AuditEvent` THE SYSTEM SHALL ingest it into the session's `TrajectoryRiskAccumulator` within the same turn | must |
149+
| FR-003 | `TrajectoryRiskAccumulator` SHALL accumulate a `trajectory_risk` score in `[0.0, 1.0]` via a weighted sum of ingested signals with exponential temporal decay per `risk_halflife_turns` | must |
150+
| FR-004 | WHEN `zeph-agent-tools` prepares a tool execution THE SYSTEM SHALL query the session's `TrajectoryRiskAccumulator` for the current `trajectory_risk` | must |
151+
| FR-005 | WHEN `trajectory_risk``risk_threshold` (default `0.75`) THE SYSTEM SHALL block the tool execution and return a `ToolError::TrajectoryRiskExceeded { score, signals }` to the agent loop | must |
152+
| FR-006 | WHEN `trajectory_risk` is in `[escalation_threshold, risk_threshold)` (default `[0.5, 0.75)`) THE SYSTEM SHALL escalate to human confirmation before allowing tool execution | should |
153+
| FR-007 | Signal types ingested from `AuditEvent` SHALL include at minimum: `policy_violation`, `prompt_injection_pattern`, `tool_chain_anomaly`, `confidence_drop` | must |
154+
| FR-008 | Each signal type SHALL carry a configurable `base_weight` in `(0.0, 1.0]` and a configurable `severity_multiplier` in `{low=0.5, medium=1.0, high=2.0}` | must |
155+
| FR-009 | `TrajectoryRiskAccumulator` SHALL apply temporal decay: at each new turn, all accumulated signal contributions are multiplied by `exp(-ln(2) / risk_halflife_turns)` before the new turn's signals are added | must |
156+
| FR-010 | WHEN a tool execution is blocked THE SYSTEM SHALL emit an `AuditEvent::TrajectoryBlock { trajectory_risk, top_signals, turn_count }` to the sanitizer audit log | must |
157+
| FR-011 | The TUI SHALL display the current session `trajectory_risk` as a gauge in the security panel when `[memory.shadow_memory] enabled = true` and `tui_show_risk_gauge = true` | should |
158+
| FR-012 | Config flag `[memory.shadow_memory] enabled` SHALL gate all shadow memory code paths; when `false`, `TrajectoryRiskAccumulator` is a no-op struct that always returns `trajectory_risk = 0.0` | must |
159+
| FR-013 | WHEN shadow memory is enabled THE SYSTEM SHALL emit Prometheus counters: `shadow_memory_signals_total{type}`, `shadow_memory_blocks_total`, `shadow_memory_escalations_total` | should |
160+
| FR-014 | Every new code path introduced by this spec SHALL be instrumented with `tracing::info_span!` per the naming convention `memory.shadow.<operation>` | must |
161+
162+
---
163+
164+
## 4. Non-Functional Requirements
165+
166+
| ID | Category | Requirement |
167+
|----|----------|-------------|
168+
| NFR-001 | Performance | `TrajectoryRiskAccumulator::ingest` SHALL complete in < 0.5 ms at p99 (in-memory accumulation only; no I/O on the hot path) |
169+
| NFR-002 | Performance | `TrajectoryRiskAccumulator::current_risk` query (called before each tool execution) SHALL complete in < 0.1 ms at p99 |
170+
| NFR-003 | Performance | When `enabled = false`, shadow memory code contributes zero overhead — all calls are dispatched through a zero-cost no-op implementation |
171+
| NFR-004 | Reliability | `TrajectoryRiskAccumulator` is session-scoped; a session crash or reset creates a fresh accumulator with `trajectory_risk = 0.0`. No persistence is required |
172+
| NFR-005 | Reliability | Shadow memory NEVER blocks the agent loop on I/O. Signal ingestion is synchronous and in-memory only |
173+
| NFR-006 | Security | The shadow memory stream is separate from the primary `SemanticMemory` pipeline; signals from shadow memory are NEVER written to SQLite or Qdrant as user-visible memory |
174+
| NFR-007 | Observability | Prometheus counters export `shadow_memory_signals_total{type,severity}`, `shadow_memory_blocks_total`, `shadow_memory_escalations_total` |
175+
| NFR-008 | Maintainability | Signal type registry is a configurable TOML section; operators can add new signal types and adjust weights without code changes |
176+
177+
---
178+
179+
## 5. Data Model
180+
181+
Shadow memory is entirely in-process and session-scoped. No new database tables are
182+
required.
183+
184+
### `TrajectoryRiskAccumulator` struct
185+
186+
```
187+
TrajectoryRiskAccumulator {
188+
session_id: SessionId,
189+
turn_count: u32,
190+
trajectory_risk: f64, // current accumulated score ∈ [0.0, 1.0]
191+
signal_history: Vec<SignalEvent>, // capped ring buffer (last N signals)
192+
config: ShadowMemoryConfig,
193+
}
194+
```
195+
196+
### `SignalEvent`
197+
198+
```
199+
SignalEvent {
200+
turn_id: u32,
201+
signal_type: SignalType,
202+
severity: Severity, // Low | Medium | High
203+
raw_score: f64, // base_weight × severity_multiplier
204+
timestamp: Instant,
205+
}
206+
```
207+
208+
### `SignalType` (extensible enum)
209+
210+
| Variant | Source | Default base_weight |
211+
|---------|--------|---------------------|
212+
| `PolicyViolation` | `AuditEvent::PolicyViolation` | 0.30 |
213+
| `PromptInjectionPattern` | `AuditEvent::InjectionDetected` | 0.50 |
214+
| `ToolChainAnomaly` | `AuditEvent::ToolChainPattern` | 0.25 |
215+
| `ConfidenceDrop` | `AuditEvent::ConfidenceDrop` | 0.15 |
216+
217+
### `AuditEvent` additions (in `zeph-sanitizer`)
218+
219+
Two new variants emitted by existing per-turn checks:
220+
221+
```
222+
AuditEvent::ToolChainPattern { turn_id, tool_sequence, anomaly_score }
223+
AuditEvent::TrajectoryBlock { trajectory_risk, top_signals, turn_count }
224+
```
225+
226+
---
227+
228+
## 6. Edge Cases and Error Handling
229+
230+
| Scenario | Expected Behavior |
231+
|----------|-------------------|
232+
| `trajectory_risk` overflows 1.0 from accumulated signals | Clamp to 1.0; do not error |
233+
| `AuditEvent` ingestion panics (bug in signal parsing) | Catch unwind; log `WARN`; treat as zero-signal; never crash the agent loop |
234+
| Session resets mid-turn (e.g., context compaction) | `TrajectoryRiskAccumulator` is tied to the session; compaction does not reset it unless `reset_on_compaction = true` (config opt-in) |
235+
| Tool execution denied; agent loop retries with a different tool | Each retry re-queries `current_risk`; if risk has not decayed below threshold, retry is also blocked |
236+
| Human escalation response is "deny" | Block recorded as a block event; risk score unchanged (escalation itself does not affect score) |
237+
| `risk_halflife_turns = 0` (misconfiguration) | Treat as `risk_halflife_turns = 1`; log `WARN` at startup |
238+
| Shadow memory disabled at runtime | All paths return no-op immediately; no signals accumulated; no blocks issued |
239+
| Injection pattern detected in recalled (not fresh) content | Signals are emitted by recall surface checks in `zeph-sanitizer`; ingested by accumulator identically to fresh-content signals |
240+
241+
---
242+
243+
## 7. Config
244+
245+
```toml
246+
[memory.shadow_memory]
247+
enabled = false # opt-in; default off
248+
249+
risk_threshold = 0.75 # block tool execution at or above this score
250+
escalation_threshold = 0.50 # escalate to human confirmation above this score
251+
risk_halflife_turns = 10 # decay half-life in agent turns
252+
signal_history_cap = 200 # ring buffer max capacity
253+
tui_show_risk_gauge = true # show trajectory_risk gauge in TUI security panel
254+
reset_on_compaction = false # reset accumulator on context compaction
255+
256+
[memory.shadow_memory.signal_weights]
257+
policy_violation = 0.30
258+
prompt_injection = 0.50
259+
tool_chain_anomaly = 0.25
260+
confidence_drop = 0.15
261+
262+
[memory.shadow_memory.severity_multipliers]
263+
low = 0.5
264+
medium = 1.0
265+
high = 2.0
266+
```
267+
268+
---
269+
270+
## 8. Key Invariants
271+
272+
### Always (without asking)
273+
- One `TrajectoryRiskAccumulator` per session; created at session start, dropped at session end
274+
- Signal ingestion is synchronous, in-memory, and completes before the turn continues
275+
- `trajectory_risk` is clamped to `[0.0, 1.0]` at all times
276+
- Shadow memory signals are never written to primary `SemanticMemory` stores (SQLite, Qdrant)
277+
- Temporal decay is applied at the start of each turn before new signals are added
278+
- `enabled = false` is a zero-overhead no-op — no allocations, no checks
279+
280+
### Ask First
281+
- Changing `risk_threshold` below 0.5 (increases false-positive rate significantly)
282+
- Adding new `SignalType` variants (requires validation of base_weight calibration)
283+
- Enabling cross-session risk accumulation (introduces session-state persistence complexity)
284+
- Exposing `trajectory_risk` in user-visible agent output (privacy and gaming concerns)
285+
286+
### Never
287+
- Block the agent turn thread on I/O from within shadow memory
288+
- Write shadow memory signals to `graph_edges`, `messages`, or any primary store
289+
- Return shadow memory state in default recall paths
290+
- Allow `TrajectoryRiskAccumulator` to survive session reset without explicit opt-in
291+
292+
---
293+
294+
## 9. Success Criteria
295+
296+
| ID | Metric | Target |
297+
|----|--------|--------|
298+
| SC-001 | Sequential tool-attack chaining success rate (lab scenario) | ≤ 10% with default config |
299+
| SC-002 | Persistent indirect prompt injection success rate | 0% — blocked by injection signal weight |
300+
| SC-003 | False-positive block rate on benign 50-turn sessions | < 1% |
301+
| SC-004 | Per-turn shadow memory overhead (ingest + query) | < 1 ms at p95 |
302+
| SC-005 | Prometheus counters exported when enabled | All 3 counter families present |
303+
304+
---
305+
306+
## 10. Acceptance Criteria
307+
308+
```
309+
GIVEN shadow_memory.enabled = true
310+
AND a session accumulates 5 turns each emitting one PolicyViolation (medium severity)
311+
WHEN the agent attempts a tool call on turn 6
312+
THEN trajectory_risk = f(5 × 0.30 × 1.0 × decay_factor) is computed correctly
313+
AND IF trajectory_risk ≥ 0.75 the tool is blocked with ToolError::TrajectoryRiskExceeded
314+
AND shadow_memory_blocks_total increments
315+
AND AuditEvent::TrajectoryBlock is emitted to the audit log
316+
317+
GIVEN shadow_memory.enabled = false
318+
WHEN the agent processes any number of turns
319+
THEN TrajectoryRiskAccumulator::current_risk always returns 0.0
320+
AND no Prometheus counters are updated
321+
AND no audit events of type TrajectoryBlock are emitted
322+
323+
GIVEN a session with 50 clean turns (no AuditEvents of tracked signal types)
324+
WHEN the agent processes turn 51
325+
THEN trajectory_risk = 0.0
326+
AND no tool call is blocked by shadow memory
327+
```
328+
329+
---
330+
331+
## 11. Implementation Notes
332+
333+
- New module: `crates/zeph-memory/src/shadow/mod.rs` — owns `TrajectoryRiskAccumulator`
334+
and `SignalEvent`. No dependency on graph or semantic memory modules.
335+
- `zeph-sanitizer` gains two new `AuditEvent` variants (`ToolChainPattern`,
336+
`TrajectoryBlock`) — additive change, no existing variant modified.
337+
- `zeph-agent-tools` wires the accumulator into the pre-tool-execution gate; receives it
338+
as an `Arc<Mutex<TrajectoryRiskAccumulator>>` from the session context.
339+
- Signal weight calibration: start with the MAGE paper's reported thresholds; adjust via
340+
integration tests against known-attack scenarios.
341+
- The in-memory ring buffer for `signal_history` is sized by `signal_history_cap` (default
342+
200 entries); oldest entries evicted when capacity is reached. The risk score itself is
343+
not affected by eviction — it is a running accumulator, not recomputed from history.
344+
- Temporal decay formula: at each turn boundary, `trajectory_risk *= exp(-ln(2) / halflife)`.
345+
This ensures the score halves every `risk_halflife_turns` turns without any signals.
346+
- No database migration is required for this feature.
347+
- TUI gauge integration uses the existing security panel widget added in `zeph-tui`.
348+
349+
---
350+
351+
## 12. Open Questions
352+
353+
> [!question]
354+
> - **Escalation UX**: when `trajectory_risk` is in the escalation band, the agent
355+
> pauses for human confirmation. The confirmation channel in CLI mode is a blocking
356+
> prompt; in Telegram/Discord modes it is an async message-reply. The exact API for
357+
> channel-agnostic human confirmation is not yet defined. This must be resolved before
358+
> the escalation path (FR-006) can be implemented.
359+
> - **Signal calibration**: the default `base_weight` values are derived from the MAGE
360+
> paper's scenario descriptions but have not been validated against Zeph's specific
361+
> attack surface. Calibration experiments should be run before enabling this feature
362+
> in production configs.
363+
364+
---
365+
366+
## 13. See Also
367+
368+
- [[constitution]] — project principles
369+
- [[004-memory/spec]] — memory system parent index
370+
- [[004-7-memory-apex-magma]] — APEX-MEM (orthogonal: semantic graph, not safety signals)
371+
- [[001-system-invariants/spec]] — system-wide invariants
372+
- [[MOC-specs]] — all specifications

0 commit comments

Comments
 (0)