From d6abf1aabb9b51bd1191e70f0ae82239bf602891 Mon Sep 17 00:00:00 2001 From: Laith Al-Saadoon Date: Thu, 2 Apr 2026 19:43:22 +0000 Subject: [PATCH] Add Iteration 3e: Memory security and integrity Address OWASP ASI06 (Memory & Context Poisoning) with a 4-phase implementation plan based on deep research (33 sources). ROADMAP.md: - Add Iteration 3e between 3d and 4 with 4 phases: Phase 1 (input hardening), Phase 2 (trust-aware retrieval), Phase 3 (detection and response), Phase 4 (advanced protections) - Update summary section with 3e entry MEMORY.md: - Add Memory Security Analysis section with full threat taxonomy (intentional and emergent corruption vectors) - Document 9 identified gaps with severity ratings - Define 6-layer defense architecture - Catalog existing partial mitigations - Add academic and industry references SECURITY.md: - Add OWASP ASI06 classification and context - Expand attack vectors beyond PR review comments (MINJA, GitHub issue injection, experience grafting, RAG poisoning, emergent self-corruption) - Document 6-layer defense architecture requirements - Update Known Limitations with memory security gaps --- docs/design/MEMORY.md | 97 +++++++++++++++++++++++++++++++++++++++++ docs/design/SECURITY.md | 31 +++++++++++++ docs/guides/ROADMAP.md | 46 +++++++++++++++++++ 3 files changed, 174 insertions(+) diff --git a/docs/design/MEMORY.md b/docs/design/MEMORY.md index 7fcbe84..cf6181f 100644 --- a/docs/design/MEMORY.md +++ b/docs/design/MEMORY.md @@ -399,6 +399,103 @@ Add user preference tracking and enable episodic reflection for cross-task patte Only if Tiers 1–3 show value but semantic search proves insufficient for specific query patterns (e.g. "which files are always modified together?" or "what's the dependency impact of changing module X?"). At this point, consider Neptune Serverless or similar for relational queries. **Only build this if there is evidence that semantic retrieval fails on identifiable query patterns.** +## Memory security analysis + +OWASP classifies memory and context poisoning as **ASI06** in the [2026 Top 10 for Agentic Applications](https://owasp.org/www-project-top-10-for-large-language-model-applications/), recognizing it as a first-class risk distinct from standard prompt injection. Unlike single-session prompt injection, memory poisoning creates **persistent corruption** that influences every subsequent interaction — a single poisoned entry can affect all future tasks on a repository. + +### Threat model + +The memory system faces two categories of corruption: + +**Intentional corruption (adversarial)** + +| Vector | Description | Severity | +|---|---|---| +| **Query-based memory injection (MINJA)** | Attacker crafts task descriptions or issue content that, when processed by the agent, gets stored as legitimate repository knowledge. Subsequent tasks retrieve and act on the poisoned memory. Research shows 95%+ injection success rates against undefended systems. | Critical | +| **Indirect injection via tool outputs** | Poisoned data from external sources (GitHub issues, PR comments, linked documentation) flows through context hydration into the agent's context, and from there into memory via the post-task extraction prompt. The agent trusts its own tool outputs as ground truth. | Critical | +| **Experience grafting** | Adversary manipulates the agent's experiential memory (task episodes) to induce behavioral drift — e.g., injecting a fake episode that claims "tests always fail on this repo, skip them" to suppress quality checks. | High | +| **Poisoned RAG retrieval** | Adversarial content engineered to rank highly for specific semantic queries, ensuring it is retrieved and incorporated into the agent's context during memory load. AgentPoison achieves 80%+ attack success across multiple agent domains. | High | +| **Review comment injection** | Malicious PR review comments containing embedded instructions that get extracted as persistent rules by the review feedback pipeline. See [SECURITY.md](./SECURITY.md) for existing mitigations. | High | + +**Emergent corruption (non-adversarial)** + +| Pattern | Description | Severity | +|---|---|---| +| **Hallucination crystallization** | Agent hallucinates a fact during a task and writes it as a repository learning. Future tasks retrieve the false memory and reinforce it through repeated use, converting an ephemeral error into a durable false belief. | High | +| **Error compounding feedback loops** | When an agent makes an error, the erroneous output enters the task episode. If similar tasks retrieve that episode, they may repeat the error, write another bad episode, and amplify the mistake across sessions. | High | +| **Stale context accumulation** | Without temporal decay, memories from 6 months ago carry the same retrieval weight as memories from yesterday. The agent operates on increasingly outdated context — referencing approaches, conventions, or patterns the team has since abandoned. | Medium | +| **Contradictory memory accumulation** | Over many tasks, the memory store accumulates contradictory records (see Memory consolidation section above). Without effective resolution, the agent receives conflicting guidance that degrades decision quality. | Medium | + +### Current gaps + +Analysis of the current implementation identified 9 specific memory security gaps: + +| # | Gap | Affected files | Severity | +|---|---|---|---| +| 1 | No memory content validation — retrieved records are injected into agent context without sanitization | `memory.ts:loadMemoryContext()` | Critical | +| 2 | No source provenance tracking — cannot distinguish agent-written memory from externally-influenced content | `memory.ts`, `agent/memory.py` | Critical | +| 3 | GitHub issue content (attacker-controlled) injected without trust differentiation | `context-hydration.ts` | Critical | +| 4 | No trust scoring at retrieval — all memories treated equally regardless of age, source, or consistency | `memory.ts:loadMemoryContext()` | High | +| 5 | No memory integrity checking — no hashing or signatures to detect modification | `memory.ts`, `agent/memory.py` | High | +| 6 | No anomaly detection on memory write/retrieval patterns | (no implementation) | High | +| 7 | No memory rollback — 365-day expiration is the only cleanup mechanism | (no implementation) | High | +| 8 | No write-ahead validation (guardian pattern) for memory commits | (no implementation) | Medium | +| 9 | No circuit breaker for memory-influenced behavioral anomalies | `orchestrator.ts` | Medium | + +### Defense architecture + +The target defense architecture follows a six-layer model (see [ROADMAP.md Iteration 3e](../guides/ROADMAP.md) for the implementation plan): + +``` +┌─────────────────────────────────────────────────────────┐ +│ Layer 1: Input Moderation + Trust Scoring │ +│ Content sanitization, injection pattern detection, │ +│ source classification (trusted/untrusted) │ +├─────────────────────────────────────────────────────────┤ +│ Layer 2: Memory Sanitization + Provenance Tagging │ +│ Source metadata on every write, content hashing, │ +│ schema versioning │ +├─────────────────────────────────────────────────────────┤ +│ Layer 3: Storage Isolation + Access Controls │ +│ Per-repo namespace isolation, expiration limits, │ +│ size caps per memory store │ +├─────────────────────────────────────────────────────────┤ +│ Layer 4: Trust-Scored Retrieval │ +│ Temporal decay, source reliability weighting, │ +│ pattern consistency checking, threshold filtering │ +├─────────────────────────────────────────────────────────┤ +│ Layer 5: Write-Ahead Validation (Guardian Pattern) │ +│ Separate model evaluates proposed memory updates │ +│ before commit │ +├─────────────────────────────────────────────────────────┤ +│ Layer 6: Continuous Monitoring + Circuit Breakers │ +│ Anomaly detection, behavioral drift detection, │ +│ automatic halt on suspicious patterns │ +└─────────────────────────────────────────────────────────┘ +``` + +No single layer is sufficient. Research demonstrates that even sophisticated input filtering can be bypassed — defense-in-depth is mandatory. + +### Existing mitigations + +The current architecture already provides partial coverage for some layers: + +- **Layer 3 (partial):** Per-repo namespace isolation via `/{actorId}/knowledge/` and `/{actorId}/episodes/{sessionId}/` prevents cross-repo contamination within the same memory resource. Token budget (2,000 tokens) limits blast radius. `schema_version` metadata enables migration tracking. +- **Fail-open design:** Memory failures never block task execution — this limits the impact of denial-of-service attacks against the memory system. +- **Repo format validation:** `_validate_repo()` prevents namespace confusion from malformed repo identifiers. +- **Model invocation logging:** Bedrock logs provide audit trail for what the model receives and generates, enabling post-hoc investigation of memory-influenced behavior. + +### References + +- OWASP ASI06 — Memory & Context Poisoning (2026 Top 10 for Agentic Applications) +- Dong et al. (2025), "MINJA: Memory Injection Attack on LLM Agents" — 95%+ injection success rates +- Sunil et al. (2026), "Memory Poisoning Attack and Defense on Memory Based LLM-Agents" — trust scoring defenses +- Schneider, C. (2026), "Memory Poisoning in AI Agents: Exploits That Wait" — six-layer defense architecture +- MemTrust (2026), "A Zero-Trust Architecture for Unified AI Memory System" — TEE-based memory protection +- Zuccolotto et al. (2026), "Memory Poisoning and Secure Multi-Agent Systems" — provenance and integrity measures + +--- + ## Requirements The platform has the following requirements for memory: diff --git a/docs/design/SECURITY.md b/docs/design/SECURITY.md index ad8a75f..4320f04 100644 --- a/docs/design/SECURITY.md +++ b/docs/design/SECURITY.md @@ -100,6 +100,33 @@ The `functionArn` in `CustomStepConfig` should be validated at CDK synth time to ## Memory-specific threats +### OWASP ASI06 — Memory and context poisoning + +OWASP classifies memory and context poisoning as **ASI06** in the 2026 Top 10 for Agentic Applications. This classification recognizes that persistent memory attacks are fundamentally different from single-session prompt injection (LLM01): poisoned memory entries influence every subsequent interaction, creating "sleeper agent" scenarios where compromise is dormant until activated by triggering conditions. ASI06 maps to LLM01 (prompt injection), LLM04 (data poisoning), and LLM08 (excessive agency) but with new characteristics unique to agents with persistent memory. + +The platform's memory system (see [MEMORY.md](./MEMORY.md)) faces threats from both intentional attacks and emergent corruption. The full threat taxonomy and gap analysis is documented in the [Memory security analysis](./MEMORY.md#memory-security-analysis) section of MEMORY.md. The implementation plan is in [ROADMAP.md Iteration 3e](../guides/ROADMAP.md). + +### Attack vectors beyond PR review comments + +In addition to the PR review comment injection vector detailed below, the memory system is exposed to: + +- **Query-based memory injection (MINJA)** — Attacker-crafted task descriptions that embed poisoned content the agent stores as legitimate memory. Research demonstrates 95%+ injection success rates against undefended systems via query-only interactions requiring no direct memory access. +- **Indirect injection via GitHub issues** — Issue bodies and comments are fetched during context hydration (`context-hydration.ts`) and injected into the agent's context. An adversary can craft issue content containing memory-poisoning payloads that the agent stores as "learned" repository knowledge via the post-task extraction prompt. The system currently does not differentiate between trusted (system) and untrusted (user-submitted) content in the hydration pipeline. +- **Experience grafting** — Manipulation of the agent's episodic memory to induce behavioral drift (e.g., injecting a fake episode claiming certain tests always fail, causing the agent to skip them). +- **Poisoned RAG retrieval** — Adversarial content engineered to rank highly for specific semantic queries during `RetrieveMemoryRecordsCommand`, ensuring it is retrieved and incorporated into the agent's context. +- **Emergent self-corruption** — The agent poisons itself through hallucination crystallization (false memories from hallucinated facts), error compounding feedback loops (bad episodes retrieved by similar tasks), and stale context accumulation (outdated memories weighted equally with current ones). These lack an external attacker signature and are harder to detect. + +### Required mitigations (all vectors) + +The defense architecture requires six layers (see [MEMORY.md](./MEMORY.md#defense-architecture) for the full model): + +1. **Input moderation with trust scoring** — Content sanitization and injection pattern detection before memory write. Composite trust scores (not binary allow/block) based on source provenance, content analysis, and behavioral consistency. +2. **Memory sanitization with provenance tagging** — Every memory entry carries source metadata (`agent_episode`, `orchestrator_fallback`, `github_issue`, `review_feedback`), content hash (SHA-256), and schema version. +3. **Storage isolation** — Per-repo namespace isolation (already partially implemented), expiration limits, and size caps. +4. **Trust-scored retrieval** — At retrieval time, memories are weighted by temporal freshness, source reliability, and pattern consistency. Entries below a trust threshold are excluded from the context budget. +5. **Write-ahead validation (guardian pattern)** — A separate model evaluates proposed memory updates before commit. +6. **Continuous monitoring and circuit breakers** — Anomaly detection on memory write patterns, behavioral drift detection, and automatic halt when anomalies are detected. + ### Prompt injection via PR review comments The review feedback memory loop (see [MEMORY.md](./MEMORY.md)) is the most novel memory component — and the most dangerous from a security perspective. PR review comments are **attacker-controlled input** that gets processed by an LLM and stored as persistent memory influencing future agent behavior. @@ -169,6 +196,10 @@ AgentCore Memory has **no native backup mechanism**. This is a significant gap f - **Single GitHub OAuth token** — one token may be shared for all users and repos the platform can access. Any authenticated user can trigger agent work against any repo that token can access. There is no per-user repo scoping. - **Guardrails are input-only** — the `PROMPT_ATTACK` filter screens task descriptions at submission. No guardrails are applied to model output during agent execution or to review feedback entering the memory system. +- **No memory content validation** — retrieved memory records are injected into the agent's context without sanitization, injection pattern scanning, or trust scoring. This is the most critical memory security gap (OWASP ASI06). See [MEMORY.md](./MEMORY.md#memory-security-analysis) for the full gap analysis and [ROADMAP.md Iteration 3e](../guides/ROADMAP.md) for the remediation plan. +- **No memory provenance or integrity checking** — memory entries carry no source attribution, content hashing, or trust metadata. The system cannot distinguish agent-generated memory from externally-influenced content. +- **GitHub issue content as untrusted input** — issue bodies and comments (attacker-controlled) are injected into the agent's context during hydration without trust differentiation. +- **No memory rollback or quarantine** — the 365-day AgentCore Memory expiration is the only cleanup mechanism. There is no snapshot, rollback, or quarantine capability for suspected poisoned entries. - **No MFA** — Cognito MFA is disabled (CLI-based auth flow). Should be enabled for production deployments. - **No customer-managed KMS** — all encryption at rest uses AWS-managed keys. Customer-managed KMS can be added if required by compliance policy. - **CORS is fully open** — `ALL_ORIGINS` is configured for CLI consumption. Restrict origins when exposing browser clients. diff --git a/docs/guides/ROADMAP.md b/docs/guides/ROADMAP.md index bcffdee..ead69dd 100644 --- a/docs/guides/ROADMAP.md +++ b/docs/guides/ROADMAP.md @@ -178,6 +178,51 @@ These practices apply continuously across iterations and are not treated as one- --- +## Iteration 3e — Memory security and integrity + +**Goal:** Harden the memory system against both adversarial corruption (prompt injection into memory, poisoned tool outputs, experience grafting) and emergent corruption (hallucination crystallization, feedback loops, stale context accumulation). OWASP classifies this as **ASI06 — Memory & Context Poisoning** in the [2026 Top 10 for Agentic Applications](https://owasp.org/www-project-top-10-for-large-language-model-applications/). + +### Background + +Deep research identified **9 memory-layer security gaps** in the current architecture (see the [Memory Security Analysis](#memory-security-analysis) section in [MEMORY.md](../design/MEMORY.md)). The platform has strong network-layer security (VPC isolation, DNS Firewall, HTTPS-only egress) but lacks memory content validation, provenance tracking, trust scoring, anomaly detection, and rollback capabilities. Research shows that MINJA-style attacks achieve 95%+ injection success rates against undefended agent memory systems, and that emergent self-corruption (hallucination crystallization, error compounding feedback loops) is equally dangerous because it lacks an external attacker signature. + +### Phase 1 — Input hardening + +- [ ] **Memory content sanitization** — Add content validation in `loadMemoryContext()` (`src/handlers/shared/memory.ts`). Scan retrieved memory records for injection patterns (embedded instructions, system prompt overrides, command injection payloads) before including them in the agent's context. Implement a `sanitizeMemoryContent()` function that strips or flags suspicious patterns while preserving legitimate repository knowledge. +- [ ] **GitHub issue input sanitization** — Add trust-boundary-aware sanitization in `context-hydration.ts` for GitHub issue bodies and comments. These are attacker-controlled inputs that currently flow into the agent's context without differentiation. Strip control characters, embedded instruction patterns, and known injection payloads. Tag the content source as `untrusted-external` in the hydrated context. +- [ ] **Source provenance on memory writes** — Tag all memory writes with source provenance metadata. In `memory.ts` (`writeMinimalEpisode`) and `agent/memory.py` (`write_task_episode`, `write_repo_learnings`), add a `source_type` field to event metadata: `agent_episode`, `agent_learning`, `orchestrator_fallback`, `github_issue`, or `review_feedback`. This enables trust-differentiated retrieval in Phase 2. +- [ ] **Content integrity hashing** — Add SHA-256 content hashing on all memory writes. Store the hash in event metadata. At read time, verify that content has not been modified between write and read. Implementation: compute hash before `CreateEventCommand`, store as `content_hash` metadata, verify on `RetrieveMemoryRecordsCommand` results. + +### Phase 2 — Trust-aware retrieval + +- [ ] **Trust scoring at retrieval** — Modify `loadMemoryContext()` to weight retrieved memories by temporal freshness, source type reliability, and pattern consistency with other memories. Memories from `orchestrator_fallback` and `agent_episode` sources receive higher trust than memories derived from external inputs. Entries below a configurable trust threshold are deprioritized or excluded from the 2,000-token budget. +- [ ] **Configurable temporal decay** — Implement per-entry TTL with configurable decay rates. Unverified or externally-sourced memory entries decay faster (e.g., 30-day default) than agent-generated or human-confirmed entries (e.g., 365-day default). Add `trust_tier` and `decay_rate` to the memory metadata schema. +- [ ] **Memory validation Lambda** — Add a lightweight validation function triggered on `CreateEventCommand` (via EventBridge rule on AgentCore events or as a post-write hook). The validator runs a classifier that checks whether new memory content looks like legitimate repository knowledge or could influence future agent behavior in unintended ways (the "guardian pattern"). Flag suspicious entries for operator review. + +### Phase 3 — Detection and response + +- [ ] **Memory write anomaly detection** — Instrument memory write operations with CloudWatch custom metrics: write frequency per repo, average content length, source type distribution. Add CloudWatch Alarms for anomalous patterns (e.g., burst of writes from a single task, unusually long content, writes with `untrusted-external` source type exceeding a threshold). +- [ ] **Circuit breaker in orchestrator** — Add circuit breaker logic in `orchestrator.ts`: if the agent's tool invocation patterns or memory write patterns deviate from a baseline (e.g., sudden increase in memory writes, writes containing instruction-like patterns), pause the task and emit an alert. The circuit breaker transitions the task to a new `MEMORY_REVIEW` state that requires operator intervention. +- [ ] **Memory quarantine API** — Expose an operator API endpoint (`POST /v1/memory/quarantine`, `GET /v1/memory/quarantine`) for flagging and isolating suspicious memory entries. Quarantined entries are excluded from retrieval but preserved for forensic analysis. +- [ ] **Memory rollback capability** — Implement point-in-time memory snapshots. Before each task starts, snapshot the current memory state for the target repo (via the existing `loadMemoryContext` path, persisted to S3). If poisoning is detected post-task, operators can restore the repo's memory to the pre-task snapshot. Add `POST /v1/memory/rollback` endpoint. + +### Phase 4 — Advanced protections + +- [ ] **Write-ahead validation (guardian model)** — Route proposed memory writes through a smaller, cheaper model (e.g., Haiku) that evaluates whether the content is legitimate learned context or could be adversarial. Adds latency (~100-500ms per write) but catches sophisticated attacks that evade pattern-based sanitization. Configurable per-repo via Blueprint. +- [ ] **Cross-task behavioral drift detection** — Compare agent reasoning patterns and tool invocation sequences across tasks for the same repo. Detect drift from established baselines that could indicate memory-influenced behavioral manipulation. Implemented as a post-task analysis step in the evaluation pipeline. +- [ ] **Cryptographic provenance chain** — Implement Merkle tree-based provenance for memory entry chains per repo. Each new entry includes a hash of the previous entry, creating an append-only, tamper-evident chain. Enables cryptographic verification that no entries have been inserted, modified, or deleted between known-good checkpoints. +- [ ] **Red team validation** — Red team the memory system using published attack methodologies: MINJA (query-based memory injection), AgentPoison (RAG retrieval poisoning), and experience grafting. Document results and adjust defenses. Add automated red team tests to the evaluation pipeline using the DeepTeam framework (OWASP ASI06 attack categories). + +### Non-backward-compatible changes + +- Memory metadata schema changes (`source_type`, `content_hash`, `trust_tier`, `decay_rate`) require `schema_version: "3"` and are not readable by v2 code paths without migration. +- The `MEMORY_REVIEW` task state is a new addition to the state machine (requires orchestrator, API contract, and observability updates). +- Trust-scored retrieval changes the memory context budget allocation, which may affect prompt version hashing. + +**Builds on Iteration 3d:** Review feedback memory and PR outcome tracking are in place; this iteration hardens the memory system that those components write to. The 4-phase approach allows incremental deployment with measurable security improvement at each phase. + +--- + ## Iteration 4 — Integrations, visual proof, and control panel **Goal:** Additional git providers; agent can run the app and attach visual proof; Slack integration; web dashboard for operators and users; real-time streaming. @@ -244,6 +289,7 @@ These practices apply continuously across iterations and are not treated as one- - **Iteration 3b** ✅ — Memory Tier 1 (repo knowledge, task episodes), insights, agent self-feedback, prompt versioning, per-prompt commit attribution. CDK L2 construct with named semantic + episodic strategies using namespace templates (`/{actorId}/knowledge/`, `/{actorId}/episodes/{sessionId}/`), fail-open memory load/write, orchestrator fallback episode, SHA-256 prompt hashing, git trailer attribution. - **Iteration 3c** — Per-repo GitHub App credentials, orchestrator pre-flight checks (fail-closed before session start), pre-execution task risk classification (model/limits/approval policy selection), tiered validation pipeline (tool validation, code quality analysis, post-execution risk/blast radius analysis), PR risk level, PR review task type, multi-modal input. - **Iteration 3d** — Review feedback memory loop (Tier 2), PR outcome tracking, evaluation pipeline (basic). +- **Iteration 3e** — Memory security and integrity: input hardening (content sanitization, provenance tagging, integrity hashing), trust-aware retrieval (trust scoring, temporal decay, guardian validation), detection and response (anomaly detection, circuit breaker, quarantine, rollback), advanced protections (write-ahead validation, behavioral drift detection, cryptographic provenance, red teaming). Addresses OWASP ASI06 (Memory & Context Poisoning). - **Iteration 3bis** (hardening) — Orchestrator IAM grant for Memory (was silently AccessDenied), memory schema versioning (`schema_version: "2"`), Python repo format validation, severity-aware error logging in Python memory, narrowed entrypoint try-catch, orchestrator fallback episode observability, conditional writes in agent task_state.py (ConditionExpression guards), orchestrator Lambda error alarm (CloudWatch, retryAttempts: 0), concurrency counter reconciliation (scheduled Lambda, drift correction), multi-AZ NAT documentation (already configurable), Python unit tests (pytest), entrypoint decomposition (4 extracted subfunctions), dual prompt assembly deprecation docstring, graceful thread drain in server.py (shutdown hook + atexit), dead QUEUED state removal (8 states, 4 active). - **Iteration 4** — Additional git providers, visual proof (screenshots/videos), Slack channel, skills pipeline, user preference memory (Tier 3), control panel (restrict CORS to dashboard origin), real-time event streaming (WebSocket), live session replay and mid-task nudge, browser extension client, MFA for production. - **Iteration 5** — Snapshot-on-schedule pre-warming, multi-user/team, memory isolation for multi-tenancy, full cost management, adaptive model router with cost-aware cascade, advanced evaluation (optional adaptive-teaching / trajectory-driven prompt patterns), formal orchestrator verification with TLA+/TLC, full Bedrock Guardrails (PII, denied topics, output filters), capability-based security model, alternate runtime, advanced customization with tiered tool access (MCP/plugins via AgentCore Gateway), full dashboard, AI-specific WAF rules.