Skip to content

Commit 9273168

Browse files
Kasper JungeRalphify
authored andcommitted
research: add memory engineering & event-driven loop architecture (Ch27)
New chapter covering practical memory engineering for agent loops: - BMO knowing-doing gap (ngrok): self-improvement tools used 2/60 sessions - Claude Code two-tier memory architecture ($0.05-$0.10/day, no vector DB) - Factory.ai restorable compression (two-threshold system, breadcrumbs) - 7 memory frameworks trending away from vector databases - Claude Code Channels: event-driven push into running sessions - Guardrails scaling strategies (budget caps, decay, cleanup agents) - Three-layer practitioner consensus (semantic + tracking + versioning) 14 new sources, 11 new insights, 2 questions answered. Co-authored-by: Ralphify <noreply@ralphify.co>
1 parent 0665591 commit 9273168

6 files changed

Lines changed: 223 additions & 5 deletions

File tree

research/ralph-loops/REPORT.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,8 @@
4848

4949
22. **The agent protocol stack has converged under shared governance — and credential security is the #1 operational gap.** MCP (agent→tool, 97M monthly SDK downloads), A2A (agent↔agent, 150+ orgs, v0.3 with gRPC), and AG-UI (agent→user, 16 event types) form a three-layer protocol stack, unified under the Linux Foundation's Agentic AI Foundation (Dec 2025) with AWS, Anthropic, Google, Microsoft, and OpenAI as platinum members. Meanwhile, GitGuardian's 2026 report shows AI-assisted commits leak secrets at **2x the baseline rate** (Claude Code at 3.2% vs 1.5%), with 29M hardcoded secrets on public GitHub and 24K secrets in MCP config files. The credential injection proxy — where the agent never holds credentials; an external proxy injects auth headers — is the emerging standard (Vercel, GitHub, NVIDIA converged independently). Keycard (March 19, 2026) adds identity-bound, task-scoped, ephemeral credential injection with full audit trails. Ralph loops are naturally suited: RALPH.md already declares dependencies (agent + commands), and could declare credential scopes for harness-managed provisioning.
5050

51+
23. **Memory engineering has moved beyond vector databases — and structured triggers beat vigilance for cross-session learning.** The leading memory frameworks (Google Always On Memory Agent, SimpleMem, Mastra Observational Memory) use SQLite or structured files with periodic LLM consolidation, not vector databases. Claude Code's two-tier memory (CLAUDE.md auto-briefing + .memory/state.json on-demand store) achieves production-grade cross-session learning at $0.05-$0.10/day with Jaccard deduplication and confidence-weighted decay. But ngrok's BMO post-mortem reveals the knowing-doing gap: agents use self-improvement tools only 2 out of 60+ sessions despite explicit instructions, and creating an OPPORTUNITIES.md file paradoxically increased procrastination. The fix: **boundary-triggered** learning (end-of-session reflections, every-N-iteration consolidation) executes reliably while vigilance-based behaviors fail. Ralph loops' command system already implements restorable compression — `{{ commands.X }}` re-derives state each iteration rather than relying on stale summaries. Claude Code Channels (March 2026) add event-driven push into running sessions, enabling reactive ralphs that respond to CI webhooks, monitoring alerts, and chat messages — shifting loops from timer-driven batch processing toward event-driven continuous operation.
52+
5153
## Chapters
5254

5355
| # | Chapter | Summary |
@@ -78,17 +80,16 @@
7880
| 24 | [Protocol Stack & Credential Security](chapters/24-protocol-stack-credential-security.md) | Three-protocol stack (MCP/A2A/AG-UI), AAIF governance, GitGuardian 2x leak rate, credential injection proxy (Vercel/GitHub/NVIDIA), Keycard runtime governance, MCP OAuth gap (53% static secrets), token rotation for long-running loops, zero-secret ralph architecture |
7981
| 25 | [Domain-Specific Loops & The Observability Gap](chapters/25-domain-specific-loops-observability.md) | Ralph loops beyond coding (security/DevOps/data/content/business), Databricks Genie Code (32→77% success), observability crisis (47.1% monitored, 88% incidents), traditional monitoring insufficient, AgenticOS concept, "any metric" positioning |
8082
| 26 | [Resilience Patterns, Model Routing & Durable Execution](chapters/26-resilience-patterns-durable-execution.md) | 4-layer fault tolerance (23%→2% unrecoverable), AIMD model failover, inner/outer loop separation, graceful degradation tiers, durable execution vs filesystem-as-checkpoint, production incident catalog (10 incidents, 0 postmortems), autoresearch at GPU scale, "harness > model" quantified |
83+
| 27 | [Practical Memory Engineering & Event-Driven Loops](chapters/27-memory-engineering-event-driven-loops.md) | BMO knowing-doing gap, Claude Code two-tier memory (CLAUDE.md + .memory/state.json), Factory.ai restorable compression, 7 memory frameworks (no vector DB trend), Claude Code Channels (event-driven push into sessions), guardrails scaling strategies, three-layer practitioner consensus |
8184

8285
## Open Questions
8386

8487
- What's the optimal ratio of spec-writing time to execution time in spec+ralph integrated workflows?
85-
- How does guardrails.md scale — at what point do accumulated guardrails become contradictory or context-consuming?
86-
- Which memory architecture (observational, graph, self-editing, RAG) best fits ralph loops — and can a "memory ralph" replace vector DB infrastructure?
8788
- How does the authority hierarchy (specs>tests>code) interact with TDD loops where tests are written by the agent?
8889
- At what point does architectural drift from agent-generated code become unrepairable — is there a measurable "point of no return"?
8990
- How do teams decide which harness layers to rip when a new model ships — is there a systematic evaluation process?
9091
- What's the right model routing strategy for ralph loops — task-based, budget-based, or time-based? At what scale does router complexity pay off?
91-
- At what loop scale do durable execution frameworks (Temporal, Inngest) outperform filesystem-as-checkpoint?
92+
- How do reactive/event-driven loops (Claude Code Channels) change the design assumptions of timer-based ralph loops?
9293

9394
## Key Sources (Top 30 — full list in [notes/sources.md](notes/sources.md))
9495

@@ -127,3 +128,6 @@
127128
- [The Anatomy of an Agent Harness](https://blog.langchain.com/the-anatomy-of-an-agent-harness/) — LangChain (harness > model, quantified)
128129
- [Ten AI Agents Destroyed Production](https://www.harperfoley.com/blog/ai-agents-destroyed-production-zero-postmortems) — Harper Foley (10 incidents, 0 postmortems)
129130
- [4 Fault Tolerance Patterns](https://dev.to/klement_gunndu/4-fault-tolerance-patterns-every-ai-agent-needs-in-production-jih) — klement_gunndu (23%→2% unrecoverable)
131+
- [BMO Self-Improving Coding Agent](https://ngrok.com/blog/bmo-self-improving-coding-agent) — ngrok (knowing-doing gap, 2/60 tool usage)
132+
- [Context Compression](https://factory.ai/news/compressing-context) — Factory.ai (restorable compression, two-threshold system)
133+
- [Claude Code Channels](https://code.claude.com/docs/en/channels) — Anthropic (event-driven push into running sessions)
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# Practical Memory Engineering & Event-Driven Loop Architecture
2+
3+
> Ch19 established the four competing memory architectures and their tradeoffs. This chapter goes deeper: how practitioners actually implement cross-session learning, why most self-improvement attempts fail, and how event-driven primitives (Claude Code Channels) are reshaping what agent loops can react to.
4+
5+
## The Knowing-Doing Gap: Why Self-Improvement Fails
6+
7+
ngrok's BMO coding agent is the most honest post-mortem on agent self-improvement. Across ~100 sessions, BMO built 11 new tools and 7 skills — but only 2 tools were built during active work. The rest were deferred to "maintenance passes" that rarely happened.
8+
9+
**The four-loop system and where each broke**:
10+
1. **Build It Now** — hot-reload new tools on first friction. In practice, agents deferred nearly everything.
11+
2. **Active Learning Capture** — log corrections and preferences in real time. The Learning Event Capture skill was used only **twice across 60+ sessions** despite detailed instructions.
12+
3. **Self-Reflection** — structured end-of-session analysis answering three questions (What went well? What was slow? What next?). **This worked every time** — structured triggers with time/place bounds execute reliably.
13+
4. **Battery Change** — comprehensive 10-session analysis. Never produced the compound improvement expected.
14+
15+
The critical finding: **creating an OPPORTUNITIES.md bucket paradoxically increased procrastination.** Adding items to the opportunities file became the path of least resistance — the agent satisfied its "improve something" directive without actually improving anything. This isn't laziness; it's probabilistic behavior following training data patterns where logging is more common than acting.
16+
17+
**Design principle for ralph loops**: Structured triggers (session-end reflections, iteration-boundary consolidation) execute reliably. Vigilance-based behaviors (watch for improvement opportunities, build tools on first friction) fail consistently. Design memory systems around **boundary events**, not continuous monitoring.
18+
19+
## Claude Code's Two-Tier Memory Architecture
20+
21+
The most mature production memory system for coding agents uses a stratified approach:
22+
23+
**Tier 1 (Automatic — CLAUDE.md)**:
24+
- ~150 lines, loaded on every session start
25+
- Ranked by confidence × access frequency
26+
- Budget system: 25 lines architecture, 25 decisions, 25 patterns, 20 gotchas, 30 progress, 15 context
27+
- Survives `/compact` — always in context
28+
- "80% of sessions need only Tier 1"
29+
30+
**Tier 2 (On-Demand — .memory/state.json)**:
31+
- Unlimited storage, accessible via MCP tools (`memory_search`, `memory_related`, `memory_ask`)
32+
- Deeper recall for the 20% of sessions that need it
33+
- Haiku synthesizes across top 30 results for natural-language queries
34+
35+
**Capture pipeline**: Three hooks extract knowledge from conversation transcripts:
36+
- **Stop** — after each response
37+
- **PreCompact** — before context compaction
38+
- **SessionEnd** — at session termination
39+
40+
Content exceeding 6,000 chars is chunked. Haiku performs structured extraction with deduplication via Jaccard similarity (>60% overlap triggers supersession of older memory). LLM consolidation runs every 10 extractions or when memories exceed 80 active entries.
41+
42+
**Memory decay**: Permanent memories (architecture, decisions, patterns, gotchas) never decay. Progress has a 7-day half-life. Context has a 30-day half-life. Memories below 0.3 confidence are excluded from Tier 1 but remain searchable in Tier 2.
43+
44+
**Cost**: ~$0.001 per extraction, $0.05–$0.10 daily. No external vector databases or embedding APIs required.
45+
46+
**Key insight for ralph loops**: This validates that a "memory ralph" — a periodic consolidation loop that reads recent progress, extracts patterns, and updates a knowledge file — is architecturally sound without vector infrastructure. The decay model (permanent vs. half-life) is directly implementable as a RALPH.md command that ages entries.
47+
48+
## The Three-Layer Practitioner Consensus
49+
50+
A consensus is emerging on how to structure agent memory for coding contexts:
51+
52+
1. **Semantic Layer** — "what are we building and why" (CLAUDE.md, RALPH.md prompt body)
53+
2. **Task Tracking Layer** — structured storage combining "Jira meets Git" with metadata on blockers, dependencies, and decision paths
54+
3. **Version Control Layer** — Git handles historical records
55+
56+
The core principle: **separating understanding from tracking from versioning reduces cognitive load on the agent.** Practitioners report that this separation produces "something closer to judgment rather than just following rules."
57+
58+
Notably, this directly maps to ralphify's existing architecture:
59+
- Semantic layer = RALPH.md prompt body
60+
- Task tracking = `{{ commands.tasks }}` loading a tasks.md file
61+
- Version control = Git (already the state backend)
62+
63+
## New Memory Frameworks (March 2026)
64+
65+
| Framework | Approach | Key Result | Vector DB? |
66+
|-----------|----------|------------|------------|
67+
| Google Always On Memory Agent | SQLite + 3-sub-agent LLM consolidation | Works for thousands of facts; drift risk | No |
68+
| MemOS v2.0 (Stardust) | MemCubes with provenance/versioning | 159% improvement in temporal reasoning vs OpenAI's global memory | Optional |
69+
| SimpleMem | Three-stage compression pipeline | 30x token reduction, 43.24 F1 on LoCoMo (vs Mem0's 34.20) | No |
70+
| Mastra Observational Memory | Observer + Reflector background agents | 84.23% accuracy, 3-40x compression | No |
71+
| AutoMem | FalkorDB graph + Qdrant vector | Dual-DB local memory for Claude Code | Yes |
72+
| OpenMemory (CaviraOSS) | Local SQL-native + temporal graphs | Zero-config MCP endpoint for Claude/Copilot/Codex | No |
73+
| Hindsight | Institutional memory | 91.4% on LongMemEval vs Mem0's 49.0% | Optional |
74+
75+
**Trend**: The leading frameworks are moving **away** from vector databases toward structured storage (SQLite, markdown) with LLM-powered consolidation. This validates ralph loops' filesystem-first design. The vector DB becomes optional infrastructure for scale, not a prerequisite.
76+
77+
## Restorable Compression in Practice
78+
79+
Factory.ai's production implementation reveals the mechanics:
80+
81+
**Two-threshold system**:
82+
- **T_max** ("fill line") — triggers compression when total context reaches this count
83+
- **T_retained** ("drain line") — max tokens preserved post-compression, always lower than T_max
84+
- Narrow gap = frequent compression + better preservation; wide gap = less frequent + more aggressive truncation
85+
86+
**What survives**:
87+
- Session intent and user objectives
88+
- High-level action play-by-play (not details)
89+
- Artifact trail (file modifications, test results)
90+
- **Breadcrumbs** — file paths, function names, and key identifiers the agent can query to re-access prior outputs
91+
92+
**Critical principle**: "Minimize tokens per task, not per request." Over-compression backfires when agents must repeatedly re-fetch information in iterative workflows, offsetting token savings through extra inference calls.
93+
94+
**Direction**: Moving from reactive compression ("compress when forced") to proactive memory management at natural breakpoints — agents recognizing completion points and self-directing compression. This maps to ralph loop iteration boundaries, where each iteration is a natural compression point.
95+
96+
The academic formalization comes from **Memex(RL)**: separating a compact in-context indexed summary from full-fidelity artifacts stored externally under stable indices. Pointer-heavy index maps that remain usable later, combined with selective retrieval, significantly improve long-horizon returns.
97+
98+
**Ralph loop implementation**: Commands already implement restorable compression by design. `{{ commands.recent_changes }}` re-derives state rather than relying on a summary. The lesson: prefer commands that query current state over static progress files that accumulate stale summaries.
99+
100+
## Claude Code Channels: Event-Driven Loop Architecture
101+
102+
Claude Code Channels (research preview, v2.1.80+, March 2026) introduce a fundamentally new primitive for agent loops: **push events into a running session from external sources**.
103+
104+
**What channels are**: An MCP server that pushes events into a running Claude Code session. Events arrive as `<channel source="name">` elements in the agent's context. The agent reads the event and can reply back through the same channel — two-way communication.
105+
106+
**Currently supported**: Telegram, Discord (via plugins), with a `fakechat` localhost demo. Custom channels can be built via the Channels Reference API.
107+
108+
**How it changes loop architecture**:
109+
110+
Traditional ralph loop: `run commands → assemble prompt → pipe to agent → repeat on timer`
111+
112+
Event-driven ralph loop: `run commands → assemble prompt → pipe to agent → react to external events → repeat`
113+
114+
**Key use cases**:
115+
1. **CI webhook receiver** — CI results push into the running session; Claude sees test failures in context and can immediately react
116+
2. **Chat bridge** — ask Claude from Telegram/Discord while it's working on your codebase; the answer comes back in the same chat
117+
3. **Monitoring alerts** — production errors arrive where Claude has your files open and remembers what you were debugging
118+
4. **Deploy pipeline events** — staging deploys complete, the agent gets notified and can verify
119+
120+
**Security model**: Sender allowlists — only paired accounts can push messages. Being in `.mcp.json` isn't enough; a server must also be named in `--channels`.
121+
122+
**Limitation**: Events only arrive while the session is open. For always-on setups, Claude runs in a background process or persistent terminal. If a permission prompt is hit while the user is away, the session pauses.
123+
124+
**Implications for ralph loops**: This enables a new pattern — the **reactive ralph** — where the loop doesn't just run on a timer but also responds to external signals. A ralph that monitors CI, reacts to failures, and pushes fixes could operate as a continuous integration agent rather than a batch processor. The `--channels` flag effectively turns Claude Code from a request-response tool into an event-driven agent.
125+
126+
## Guardrails Bloat: The Memory Scaling Problem
127+
128+
As agents accumulate learning across sessions, the guardrails/rules file grows. Practitioner strategies for preventing bloat:
129+
130+
1. **Rule count caps** — hard limit (e.g., 30 rules). Adding a new rule requires removing or merging an existing one.
131+
2. **Expiration dates** — rules tagged with `added: 2025-01-15`; cleanup pass removes rules older than N days unless marked permanent.
132+
3. **Categorization** — group by domain (testing, git, style, architecture) with per-category limits. Makes contradictions easier to spot.
133+
4. **Periodic human review** — most common approach, typically monthly.
134+
5. **Severity tiers** — critical (security, data loss) vs. preferences. Under context pressure, drop preferences first.
135+
6. **Cleanup agent** — a dedicated agent that reads all rules, identifies contradictions, merges duplicates, and proposes removals (the "gardener" pattern from Ch12).
136+
137+
The Claude Code memory system's budget allocation (25 lines for architecture, 25 for decisions, etc.) is the first automated approach to guardrails scaling. By assigning fixed budgets per category, it structurally prevents any single category from consuming the full context allocation.
138+
139+
**For ralph loops**: A `guardrails.md` file loaded via `{{ commands.guardrails }}` should include a header noting the maximum rule count and requiring the agent to consolidate when adding new rules. The "cleanup ralph" pattern (Ch12) is the automated solution.
140+
141+
## Implications for Ralphify
142+
143+
1. **The "memory ralph" is validated as architecturally sound.** Google's Always On Memory Agent, Claude Code's two-tier system, and SimpleMem all prove that structured storage + periodic LLM consolidation works without vector infrastructure. A cookbook recipe for a memory ralph that consolidates recent progress into a knowledge file is ready to write.
144+
145+
2. **Boundary-triggered learning beats vigilance-based learning.** ngrok's BMO proved that structured triggers (end-of-session, every-N-iterations) execute reliably while "always watch for improvements" fails. Ralph loops should consolidate at iteration boundaries, not continuously.
146+
147+
3. **Commands are restorable compression by design.** `{{ commands.X }}` re-derives state each iteration — the pattern Factory.ai and Memex(RL) describe as optimal. Ralph authors should prefer commands that query current state over static files that accumulate stale summaries.
148+
149+
4. **Claude Code Channels enable reactive ralphs.** The event-driven push model (CI webhooks, chat bridges, monitoring alerts arriving in-session) creates a new loop pattern: the reactive ralph that both runs on a schedule AND responds to external events. This could be a framework feature — `channels` field in RALPH.md frontmatter declaring event sources.
150+
151+
5. **Memory decay is a feature, not a bug.** Claude Code's half-life model (permanent architecture decisions, 7-day progress decay, 30-day context decay) should inform ralph state file design. Progress files should be periodically consolidated, not infinitely appended.
152+
153+
6. **Guardrails need structural scaling limits.** Fixed budgets per category (Claude Code's approach) or hard rule count caps prevent the inevitable bloat from agent-accumulated learning. The `ralph new` template should include a max-rules header in guardrails.md.

0 commit comments

Comments
 (0)