Transform Claude Code into a Recursive Language Model (RLM) agent by externalizing conversation context as manipulable Python variables in a REPL environment, enabling the root model to programmatically decompose, analyze, and recursively query context—eliminating context rot while preserving Claude Code's core agentic capabilities.
Claude Code sessions accumulate context through:
- Conversation history (multi-turn interactions)
- File reads (codebase exploration)
- Tool outputs (bash commands, test results, diffs)
- Agentic reasoning traces
The problem: As sessions grow beyond 60-100K tokens, model performance degrades—it "forgets" earlier context, misses details, and makes decisions without full information. This is especially acute in:
- Large refactoring tasks spanning multiple files
- Long debugging sessions with extensive logs
- Multi-step planning with accumulated state
The RLM paper demonstrates that context-centric decomposition (treating the prompt as a variable to be programmatically manipulated) outperforms:
- Direct model calls with full context
- RAG/retrieval approaches (BM25, semantic search)
- ReAct-style agents with retrieval tools
Key insight: The model decides how to decompose context, not a predefined scaffold.
┌─────────────────────────────────────────────────────────────────┐
│ Claude Code CLI │
├─────────────────────────────────────────────────────────────────┤
│ RLM Orchestration Layer │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────────┐ │
│ │ Context │ │ REPL │ │ Recursive │ │
│ │ Manager │ │ Environment │ │ Call Handler │ │
│ └───────────────┘ └───────────────┘ └───────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Claude Code Core (Preserved) │
│ ┌─────────┐ ┌──────────┐ ┌────────┐ ┌─────────────────┐ │
│ │ Tools │ │ Agents │ │ Hooks │ │ MCP Servers │ │
│ └─────────┘ └──────────┘ └────────┘ └─────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Model Provider Layer │
│ (Anthropic API / Router / Local Models) │
└─────────────────────────────────────────────────────────────────┘
| Component | Responsibility |
|---|---|
| Context Manager | Externalizes conversation context as Python variables |
| REPL Environment | Sandboxed Python execution with context access |
| Recursive Call Handler | Manages sub-model calls with isolated state |
| Router Integration | Directs root vs. recursive calls to appropriate models |
Instead of including full conversation history in the prompt, externalize as:
# Available in REPL environment
conversation: List[Message] = [...] # Full conversation history
files: Dict[str, str] = {...} # Cached file contents
tool_outputs: List[ToolOutput] = [...] # Recent tool results
working_memory: Dict[str, Any] = {} # Session state
# Metadata
context_stats = {
"total_tokens": 156000,
"conversation_tokens": 45000,
"file_tokens": 89000,
"tool_output_tokens": 22000
}
# Helper functions (pre-loaded)
def peek(var, start=0, end=1000): ...
def search(var, pattern, regex=False): ...
def summarize(var, max_tokens=500): ... # Uses sub-call
def recursive_query(query, context_var): ... # Sub-RLM callThe root model receives a minimal prompt without full context:
You are Claude Code operating in RLM mode. Your conversation context
is stored in variables in a Python REPL environment.
## Available Variables
- `conversation`: {n_messages} messages, {n_tokens} tokens
- `files`: {n_files} files cached ({file_tokens} tokens)
- `tool_outputs`: {n_outputs} recent outputs ({output_tokens} tokens)
## Actions
1. **Peek**: View portions of context: `peek(conversation, 0, 5)`
2. **Search**: Find patterns: `search(files['main.py'], 'def ')`
3. **Query**: Ask sub-questions: `recursive_query("What's the error?", tool_outputs[-1])`
4. **Execute**: Run code/tools normally when ready
## Current Query
{user_query}
## Rules
- Don't request full context dumps—use programmatic access
- Partition large contexts before analysis
- Use recursive_query for semantic understanding of large chunks
- When confident, output: FINAL(answer) or FINAL_VAR(var_name)class RLMEnvironment:
"""Sandboxed REPL for context manipulation"""
def __init__(self, context: SessionContext):
self.globals = {
# Context variables
'conversation': context.messages,
'files': context.cached_files,
'tool_outputs': context.tool_outputs,
'working_memory': {},
# Helper functions
'peek': self._peek,
'search': self._search,
'summarize': self._summarize,
'recursive_query': self._recursive_query,
'recursive_llm': self._recursive_query, # alias
# Standard library (restricted)
're': re,
'json': json,
'len': len,
'str': str,
'list': list,
'dict': dict,
'sorted': sorted,
'enumerate': enumerate,
# === EXTENDED TOOLING ===
# These are available for context analysis and validation
'pydantic': pydantic, # Data validation/parsing
'hypothesis': hypothesis, # Property-based testing
'cp': cpmpy, # Constraint programming
'cpmpy': cpmpy, # Alias
}
self.locals = {}
def execute(self, code: str) -> ExecutionResult:
"""Execute code in sandbox, return result"""
# Use RestrictedPython or similar for safety
...The RLM REPL environment includes production-grade Python tooling for robust context manipulation:
Available Tools:
| Tool | Version | Purpose in RLM |
|---|---|---|
uv |
latest | Fast package management, virtual env creation |
ty |
latest | Type checking extracted context/generated code |
ruff |
latest | Linting/formatting code in context |
pydantic |
v2.x | Schema validation for structured context data |
hypothesis |
latest | Property-based testing for verification queries |
cpmpy |
latest | Constraint programming for verification and analysis |
CPMpy Integration:
CPMpy enables constraint-driven reasoning in the REPL—aligning with the Ananke architecture of treating code generation as constrained search through valid programs.
import cpmpy as cp
# Example: Verify a scheduling constraint from context
# Given: task dependencies extracted from code comments
tasks = cp.intvar(0, 100, shape=5, name="task_start")
durations = [10, 5, 8, 3, 12] # Extracted from context
model = cp.Model()
# Constraint: task[1] must start after task[0] finishes
model += tasks[1] >= tasks[0] + durations[0]
# Constraint: tasks 2 and 3 cannot overlap (extracted from mutex comment)
model += (tasks[2] + durations[2] <= tasks[3]) | (tasks[3] + durations[3] <= tasks[2])
# Verify feasibility
if model.solve():
print("Schedule is feasible:", tasks.value())
else:
print("Constraint violation detected")Use Cases in RLM:
| Scenario | CPMpy Application |
|---|---|
| Dependency analysis | Model import/call graphs as constraints |
| Type compatibility | Encode type relationships as logical constraints |
| Resource verification | Cumulative constraints for resource bounds |
| State machine validation | Transition constraints on state variables |
| Test case generation | Generate inputs satisfying path constraints |
Depth=2 Verification Pattern:
# At depth=2, use CPMpy to verify proposed changes don't break invariants
async def verify_change_safety(
change: ProposedEdit,
invariants: List[str], # Extracted from docstrings/comments
context: str
) -> VerificationResult:
"""
Use constraint solving to verify a proposed change.
Implements: Spec §6.4 Depth=2 Verification
"""
model = cp.Model()
# Parse invariants into constraints
for inv in invariants:
constraint = parse_invariant_to_cpmpy(inv, context)
if constraint:
model += constraint
# Check if change maintains satisfiability
model += encode_change_effect(change)
if model.solve():
return VerificationResult(safe=True, witness=model.solution_hint())
else:
# Use unsat core to explain why
return VerificationResult(
safe=False,
reason=extract_conflict(model)
)REPL Integration:
# Pydantic: Validate structured data from context
from pydantic import BaseModel, Field
from typing import List
class ErrorReport(BaseModel):
file: str
line: int
message: str
severity: str = Field(pattern=r'^(error|warning|info)$')
# Parse tool output into structured form
errors = [ErrorReport.model_validate_json(line) for line in tool_outputs[-1].split('\n')]
# Hypothesis: Property-based verification
from hypothesis import given, strategies as st
@given(st.sampled_from(files.keys()))
def test_all_files_have_extension(filename):
assert '.' in filename
# Run verification (useful for depth=2 validation)
test_all_files_have_extension()
# CPMpy: Constraint verification
import cpmpy as cp
# Model a simple constraint problem from context
x = cp.intvar(1, 10, name="x")
y = cp.intvar(1, 10, name="y")
model = cp.Model([x + y == 15, x < y])
model.solve()
print(f"Solution: x={x.value()}, y={y.value()}")CLI Tools (via subprocess in sandbox):
# These run in isolated subprocess, output captured
# Type check a code snippet from context
async def typecheck_snippet(code: str) -> TypeCheckResult:
"""Run ty on extracted code"""
result = await sandbox_exec(['ty', 'check', '-'], stdin=code)
return TypeCheckResult(
success=result.returncode == 0,
errors=parse_ty_output(result.stderr)
)
# Lint code before suggesting edits
async def lint_snippet(code: str) -> LintResult:
"""Run ruff on extracted code"""
result = await sandbox_exec(['ruff', 'check', '--stdin-filename=snippet.py', '-'], stdin=code)
return LintResult(
issues=parse_ruff_output(result.stdout),
auto_fixable=count_fixable(result.stdout)
)Environment Setup (on plugin install):
#!/bin/bash
# scripts/setup_repl_env.sh
# Create isolated venv for REPL
uv venv ~/.claude/rlm-repl-env --python 3.12
# Install core tooling
uv pip install --python ~/.claude/rlm-repl-env \
pydantic>=2.0 \
hypothesis>=6.0 \
cpmpy>=0.9.20 \
ruff \
ty
# Verify installation
~/.claude/rlm-repl-env/bin/python -c "import pydantic, hypothesis, cpmpy; print('OK')"Security Constraints:
# ALLOWED in REPL
- pydantic.BaseModel, Field, validators
- hypothesis.given, strategies, settings
- cpmpy.Model, intvar, boolvar, constraints, solve
- subprocess calls to: ty, ruff (sandboxed)
- Read-only file inspection via context variables
# BLOCKED in REPL
- subprocess calls to: rm, mv, curl, wget, etc.
- Direct filesystem writes (use Claude Code tools)
- Network access (except via Claude Code tools)
- os.system, eval on user input
- Import of arbitrary modules
### 4.2 Recursive Call Implementation
```python
async def _recursive_query(
self,
query: str,
context: Any,
model: str = None, # Defaults to configured recursive model
max_tokens: int = 4000
) -> str:
"""
Spawn a sub-RLM call over provided context.
The sub-call receives:
- The query
- The context (as a string variable)
- NO access to parent REPL state (isolated)
Returns: The sub-model's response
"""
context_str = self._serialize_context(context)
# Sub-call prompt
sub_prompt = f"""
You are analyzing a portion of context for a larger task.
## Context
The following is stored in variable `context`:
{context_str[:500]}... [truncated, {len(context_str)} chars total]
## Query
{query}
Provide a focused, factual response. If you need to examine
the context programmatically, you have access to a Python REPL
with `context` as a variable.
"""
# Route to recursive model (could be same or different)
response = await self.router.complete(
model=model or self.config.recursive_model,
prompt=sub_prompt,
max_tokens=max_tokens,
depth=self.depth + 1
)
return response.content
The root model signals completion via:
# Direct answer
FINAL(Your answer here)
# Answer stored in variable
result = "Complex computed answer"
FINAL_VAR(result)
# Continue with tool use (normal Claude Code flow)
<tool_use>bash</tool_use>
<command>npm test</command>rlm-agent/
├── .claude-plugin/
│ └── plugin.json
├── agents/
│ └── rlm-orchestrator.md # Root agent definition
├── skills/
│ └── rlm-context-management/
│ └── SKILL.md # Context management strategies
├── hooks/
│ └── hooks.json # Session hooks
├── src/
│ ├── context_manager.py # Context externalization
│ ├── repl_environment.py # REPL sandbox
│ ├── recursive_handler.py # Sub-call management
│ └── router_integration.py # Model routing
└── README.md
{
"hooks": {
"SessionStart": [{
"hooks": [{
"type": "command",
"command": "python ${CLAUDE_PLUGIN_ROOT}/src/init_rlm.py",
"timeout": 5
}]
}],
"UserPromptSubmit": [{
"matcher": ".*",
"hooks": [{
"type": "command",
"command": "python ${CLAUDE_PLUGIN_ROOT}/src/check_context_threshold.py",
"timeout": 3
}]
}],
"PreCompact": [{
"hooks": [{
"type": "command",
"command": "python ${CLAUDE_PLUGIN_ROOT}/src/externalize_context.py"
}]
}]
}
}Integrate with claude-code-router for model selection:
{
"Router": {
"default": "anthropic,claude-sonnet-4",
"rlm_root": "anthropic,claude-opus-4-5-20251101",
"rlm_recursive": "anthropic,claude-sonnet-4",
"rlm_recursive_deep": "anthropic,claude-haiku-4-5-20251001",
"rlm_max_depth": 2,
"rlm_mode": "complexity",
"rlm_simple_bypass": true
},
"rlm": {
"activation": {
"mode": "complexity",
"fallback_token_threshold": 80000,
"complexity_score_threshold": 2
},
"depth": {
"default": 2,
"max": 3,
"spawn_repl_at_depth_1": true
},
"hybrid": {
"enabled": true,
"simple_query_bypass": true,
"simple_confidence_threshold": 0.95
},
"trajectory": {
"verbosity": "normal",
"streaming": true,
"colors": true,
"export_enabled": true,
"export_path": "~/.claude/rlm-trajectories/"
},
"models": {
"root": "claude-opus-4-5-20251101",
"recursive_depth_1": "claude-sonnet-4",
"recursive_depth_2": "claude-haiku-4-5-20251001"
},
"cost_controls": {
"max_recursive_calls_per_turn": 10,
"max_tokens_per_recursive_call": 8000,
"abort_on_cost_threshold": 50000
}
}
}Model Selection by Depth:
| Depth | Model | Rationale |
|---|---|---|
| 0 (root) | Opus 4.5 | Complex orchestration, strategy selection |
| 1 | Sonnet 4 | Analysis, summarization, focused queries |
| 2 | Haiku 4.5 | Verification, simple extraction, cheap |
Routing Logic:
rlm_mode: "complexity"→ Activate based on task complexity analysis (default)rlm_mode: "always"→ Always use RLM orchestrationrlm_mode: "token"→ Activate when context exceeds token thresholdrlm_mode: "manual"→ Only via/rlmcommand
Decision: Build as a plugin, not a fork.
Rationale:
- Anthropic frequently updates Claude Code (>300 commits)
- Plugins get updates automatically
- Core tool execution, permissions, hooks all preserved
- Community can extend/improve independently
Decision: Python REPL (not JavaScript)
Rationale:
- RLM paper uses Python; proven patterns exist
- Better string manipulation for context processing
- Claude is stronger at Python code generation
- RestrictedPython provides sandbox primitives
- Can shell out to bash for system operations
Decision: Activate RLM based on task complexity analysis, not token count.
Rationale: Token count is a proxy metric. The real question is: "Does this task require reasoning over distributed context?" A 20K token session with a complex multi-file refactor needs RLM more than a 100K token session asking "what's in package.json?"
Complexity Classifier:
@dataclass
class TaskComplexitySignals:
"""Signals extracted from user prompt + context"""
# Prompt analysis
references_multiple_files: bool # "fix the bug in auth and update tests"
requires_cross_context_reasoning: bool # "why is X happening given Y?"
involves_temporal_reasoning: bool # "what changed since last working version?"
asks_about_patterns: bool # "find all places where..."
debugging_task: bool # error traces, stack dumps in context
# Context analysis
context_has_multiple_domains: bool # frontend + backend + infra
recent_tool_outputs_large: bool # >10K tokens of test/build output
conversation_has_state_changes: bool # user changed requirements mid-session
files_span_multiple_modules: bool # auth/, api/, db/, etc.
# Historical signals
previous_turn_was_confused: bool # model asked clarifying questions
task_is_continuation: bool # "now do the same for..."
def should_activate_rlm(
prompt: str,
context: SessionContext,
config: RLMConfig
) -> tuple[bool, str]:
"""
Determine if RLM mode should activate.
Returns (should_activate, reason).
Biased toward activation—when in doubt, use RLM.
"""
signals = extract_complexity_signals(prompt, context)
# Always activate (manual override)
if context.rlm_mode_forced:
return True, "manual_override"
# Never activate (explicit simple mode)
if context.simple_mode_forced:
return False, "simple_mode_forced"
# Complexity scoring (biased toward activation)
score = 0
reasons = []
# High-signal indicators (each sufficient alone)
if signals.requires_cross_context_reasoning:
return True, "cross_context_reasoning"
if signals.debugging_task and signals.recent_tool_outputs_large:
return True, "debugging_with_large_output"
if signals.references_multiple_files and signals.files_span_multiple_modules:
return True, "multi_module_task"
# Accumulative signals
if signals.references_multiple_files:
score += 2
reasons.append("multi_file")
if signals.involves_temporal_reasoning:
score += 2
reasons.append("temporal")
if signals.asks_about_patterns:
score += 1
reasons.append("pattern_search")
if signals.context_has_multiple_domains:
score += 1
reasons.append("multi_domain")
if signals.previous_turn_was_confused:
score += 2
reasons.append("prior_confusion")
if signals.task_is_continuation:
score += 1
reasons.append("continuation")
# Token count as tiebreaker (not primary signal)
if context.total_tokens > 80000:
score += 1
reasons.append("large_context")
# Threshold: 2+ signals → activate (conservative toward RLM)
if score >= 2:
return True, f"complexity_score:{score}:{'+'.join(reasons)}"
return False, "simple_task"Prompt Analysis Implementation:
def extract_complexity_signals(prompt: str, context: SessionContext) -> TaskComplexitySignals:
"""
Lightweight signal extraction—runs on every prompt.
Must be fast (<50ms) so uses heuristics, not LLM calls.
"""
prompt_lower = prompt.lower()
# File reference patterns
file_patterns = [
r'\b\w+\.(ts|js|py|go|rs|tsx|jsx)\b', # explicit files
r'\b(and|also|plus)\s+(update|fix|change|modify)', # conjunctions
r'\b(auth|api|db|ui|test|config)\b.*\b(auth|api|db|ui|test|config)\b', # multiple modules
]
references_multiple = sum(
len(re.findall(p, prompt_lower)) > 1
for p in file_patterns
)
# Cross-context reasoning patterns
cross_context_patterns = [
r'\bwhy\b.*\b(when|if|given|since)\b',
r'\bhow\b.*\b(relate|connect|affect|impact)\b',
r'\bwhat\b.*\b(cause|led to|result)\b',
r'\b(trace|follow|track)\b.*\b(through|across)\b',
]
# Temporal patterns
temporal_patterns = [
r'\b(before|after|since|when|changed|used to|previously)\b',
r'\b(history|log|commit|version|diff)\b',
r'\blast\s+(time|session|attempt)\b',
]
# Pattern search indicators
pattern_patterns = [
r'\b(find|search|locate|grep|all|every|each)\b.*\b(where|that|which)\b',
r'\bhow many\b',
r'\blist\s+(all|every)\b',
]
# Debug indicators
debug_patterns = [
r'\b(error|exception|fail|crash|bug|issue|broken)\b',
r'\b(stack\s*trace|traceback|stderr)\b',
r'\b(debug|diagnose|investigate|troubleshoot)\b',
]
return TaskComplexitySignals(
references_multiple_files=references_multiple >= 2,
requires_cross_context_reasoning=any(
re.search(p, prompt_lower) for p in cross_context_patterns
),
involves_temporal_reasoning=any(
re.search(p, prompt_lower) for p in temporal_patterns
),
asks_about_patterns=any(
re.search(p, prompt_lower) for p in pattern_patterns
),
debugging_task=any(
re.search(p, prompt_lower) for p in debug_patterns
),
context_has_multiple_domains=len(context.active_modules) > 2,
recent_tool_outputs_large=sum(
len(o.content) for o in context.tool_outputs[-5:]
) > 10000,
conversation_has_state_changes=context.has_requirement_changes,
files_span_multiple_modules=len(set(
Path(f).parts[0] for f in context.cached_files.keys()
)) > 2,
previous_turn_was_confused=context.last_response_had_clarification,
task_is_continuation='continue' in prompt_lower or
'same' in prompt_lower or
'also' in prompt_lower[:50],
)Decision: Default max depth = 2, configurable up to 3.
Rationale:
- Depth=1 handles most cases but fails on verification chains
- Your Ananke/Maze work involves constraint propagation that benefits from deeper recursion
- Depth=2 allows: Root → Analysis → Verification (three-layer reasoning)
- Cost is manageable: ~3-5x single call, still cheaper than naive 200K+ context
Depth=2 Architecture:
class RecursiveREPL:
"""
REPL environment that can spawn child REPLs for depth=2.
"""
def __init__(self, context: Any, depth: int, max_depth: int = 2):
self.context = context
self.depth = depth
self.max_depth = max_depth
self.child_repls: List[RecursiveREPL] = []
async def recursive_query(
self,
query: str,
context: Any,
spawn_repl: bool = False # Allow sub-query to have its own REPL
) -> str:
"""
Spawn a recursive call. If spawn_repl=True AND depth < max_depth,
the child gets its own REPL environment for further decomposition.
"""
if self.depth >= self.max_depth:
# At max depth: simple completion, no REPL
return await self._simple_completion(query, context)
if spawn_repl:
# Create child REPL with isolated state
child = RecursiveREPL(
context=context,
depth=self.depth + 1,
max_depth=self.max_depth
)
self.child_repls.append(child)
return await child.run_rlm_loop(query)
else:
# Simple sub-call without REPL (faster, cheaper)
return await self._simple_completion(query, context)When to use depth=2:
| Pattern | Depth=1 | Depth=2 |
|---|---|---|
| "Find the bug" | Root analyzes, sub-calls examine files | Same |
| "Find and verify fix" | Root analyzes, may miss verification | Root → Find → Verify chain |
| "Refactor safely" | Root plans, sub-calls check files | Root → Plan → Validate each change |
| "Compare approaches" | Root struggles with parallel analysis | Root → Analyze A → Analyze B → Compare |
Decision: Hybrid mode enabled, but biased toward RLM activation.
Rationale: False negatives (missing RLM when needed) are worse than false positives (using RLM when unnecessary). RLM overhead on simple queries is ~2-3 seconds—acceptable. Missing context on complex queries causes real failures.
Simple Query Definition (conservative):
def is_definitely_simple(prompt: str, context: SessionContext) -> bool:
"""
Returns True ONLY for queries that definitely don't need RLM.
When in doubt, return False (use RLM).
"""
prompt_lower = prompt.lower().strip()
# Explicit simple patterns (exhaustive list)
simple_patterns = [
# Direct file operations
r'^(show|cat|read|view|open)\s+[\w./]+$',
# Single command execution
r'^(run|execute)\s+(npm|yarn|pnpm|cargo|go|python|pytest)\s+\w+$',
# Git status checks
r'^git\s+(status|log|diff|branch)$',
# Simple questions about single files
r'^what\'?s?\s+in\s+[\w./]+\??$',
# Acknowledgments
r'^(ok|okay|thanks|got it|understood|yes|no|sure)\.?$',
]
for pattern in simple_patterns:
if re.match(pattern, prompt_lower):
# Even if pattern matches, check context isn't complex
if context.total_tokens < 20000 and not context.has_errors:
return True
# Short prompts with no context references
if len(prompt) < 50 and context.total_tokens < 10000:
# Check for ANY complexity indicators
if not any(word in prompt_lower for word in [
'why', 'how', 'find', 'fix', 'bug', 'error', 'all',
'every', 'change', 'update', 'refactor', 'test'
]):
return True
# Default: not simple (use RLM)
return FalseExpected Distribution:
- ~10-15% of queries classified as "definitely simple"
- ~85-90% routed through RLM
- Users can override with
/simplecommand for explicit bypass
Decision: Full trajectory visibility with human-friendly streaming output.
Rationale: RLM orchestration is multi-step reasoning. Without visibility:
- Users can't debug why a query failed
- No way to learn what strategies work
- Feels like a black box (violates your constraint-driven principles)
Trajectory Display Design:
┌─────────────────────────────────────────────────────────────────────────┐
│ RLM Mode • depth=0/2 • task: multi_module_task │
├─────────────────────────────────────────────────────────────────────────┤
│ ▶ ANALYZE │ Examining context structure... │
│ ├─ conversation: 45 messages (23K tokens) │
│ ├─ files: 12 cached (67K tokens) │
│ └─ tool_outputs: 8 recent (34K tokens) │
│ │
│ ▶ REPL │ peek(tool_outputs[-1], 0, 1500) │
│ └─ [test output: FAIL auth/login.test.ts:45] │
│ │
│ ▶ REASON │ Error is 500 vs expected 401. Need to trace auth handler. │
│ │
│ ▶ REPL │ search(files, 'AuthController') │
│ └─ Found in: src/auth/controllers/auth.controller.ts │
│ │
│ ▶ RECURSE │ depth=1 │ "How does login() handle validation failures?" │
│ │ │ context: auth.controller.ts (2.3K tokens) │
│ │ │
│ │ ▶ ANALYZE │ Scanning controller implementation... │
│ │ ▶ REPL │ search(context, 'validateCredentials') │
│ │ ▶ REASON │ Found: throws generic Error, not HTTP error │
│ │ ◀ RETURN │ "Controller throws Error() not UnauthorizedError()" │
│ │
│ ▶ REASON │ Root cause identified. Preparing fix. │
│ │
│ ▶ FINAL │ Bug in auth.controller.ts:23 - throws Error() instead of │
│ │ UnauthorizedError(). Fix: change to throw new │
│ │ UnauthorizedError('Invalid credentials') │
└─────────────────────────────────────────────────────────────────────────┘
Streaming Implementation:
from dataclasses import dataclass
from enum import Enum
from typing import AsyncIterator, Optional
import asyncio
class TrajectoryEventType(Enum):
RLM_START = "rlm_start"
ANALYZE = "analyze"
REPL_EXEC = "repl_exec"
REPL_RESULT = "repl_result"
REASON = "reason"
RECURSE_START = "recurse_start"
RECURSE_END = "recurse_end"
FINAL = "final"
ERROR = "error"
TOOL_USE = "tool_use" # Normal Claude Code tool
@dataclass
class TrajectoryEvent:
type: TrajectoryEventType
depth: int
content: str
metadata: Optional[dict] = None
timestamp: float = None
def __post_init__(self):
self.timestamp = self.timestamp or time.time()
class TrajectoryRenderer:
"""
Renders trajectory events to terminal with streaming support.
Designed for human readability during execution.
"""
ICONS = {
TrajectoryEventType.RLM_START: "◆",
TrajectoryEventType.ANALYZE: "▶",
TrajectoryEventType.REPL_EXEC: "▶",
TrajectoryEventType.REPL_RESULT: "└─",
TrajectoryEventType.REASON: "▶",
TrajectoryEventType.RECURSE_START: "▶",
TrajectoryEventType.RECURSE_END: "◀",
TrajectoryEventType.FINAL: "▶",
TrajectoryEventType.ERROR: "✗",
TrajectoryEventType.TOOL_USE: "⚙",
}
LABELS = {
TrajectoryEventType.RLM_START: "RLM",
TrajectoryEventType.ANALYZE: "ANALYZE",
TrajectoryEventType.REPL_EXEC: "REPL",
TrajectoryEventType.REPL_RESULT: "",
TrajectoryEventType.REASON: "REASON",
TrajectoryEventType.RECURSE_START: "RECURSE",
TrajectoryEventType.RECURSE_END: "RETURN",
TrajectoryEventType.FINAL: "FINAL",
TrajectoryEventType.ERROR: "ERROR",
TrajectoryEventType.TOOL_USE: "TOOL",
}
def __init__(self, verbosity: str = "normal"):
"""
verbosity: "minimal" | "normal" | "verbose" | "debug"
"""
self.verbosity = verbosity
self.current_depth = 0
def render_event(self, event: TrajectoryEvent) -> str:
"""Render single event to terminal string"""
indent = "│ " * event.depth
icon = self.ICONS[event.type]
label = self.LABELS[event.type]
# Depth indicator for recursive calls
depth_indicator = ""
if event.type == TrajectoryEventType.RECURSE_START:
depth_indicator = f" │ depth={event.depth + 1} │"
# Truncate content based on verbosity
content = self._truncate_content(event.content, event.type)
# Color coding (ANSI)
color = self._get_color(event.type)
reset = "\033[0m"
dim = "\033[2m"
if event.type == TrajectoryEventType.REPL_RESULT:
# Indented result, dimmed
return f"{indent} {dim}{icon} [{content}]{reset}"
elif label:
return f"{indent}{color}{icon} {label:7}{reset}{depth_indicator} │ {content}"
else:
return f"{indent}{content}"
def _truncate_content(self, content: str, event_type: TrajectoryEventType) -> str:
"""Truncate based on verbosity and event type"""
limits = {
"minimal": {"default": 60, "repl_result": 40, "reason": 80},
"normal": {"default": 120, "repl_result": 80, "reason": 200},
"verbose": {"default": 300, "repl_result": 200, "reason": 500},
"debug": {"default": 1000, "repl_result": 1000, "reason": 1000},
}
key = "repl_result" if event_type == TrajectoryEventType.REPL_RESULT else \
"reason" if event_type == TrajectoryEventType.REASON else "default"
limit = limits[self.verbosity][key]
if len(content) <= limit:
return content
return content[:limit - 3] + "..."
def _get_color(self, event_type: TrajectoryEventType) -> str:
"""ANSI color codes for event types"""
colors = {
TrajectoryEventType.RLM_START: "\033[1;36m", # Bold cyan
TrajectoryEventType.ANALYZE: "\033[34m", # Blue
TrajectoryEventType.REPL_EXEC: "\033[33m", # Yellow
TrajectoryEventType.REASON: "\033[32m", # Green
TrajectoryEventType.RECURSE_START: "\033[35m", # Magenta
TrajectoryEventType.RECURSE_END: "\033[35m", # Magenta
TrajectoryEventType.FINAL: "\033[1;32m", # Bold green
TrajectoryEventType.ERROR: "\033[1;31m", # Bold red
TrajectoryEventType.TOOL_USE: "\033[36m", # Cyan
}
return colors.get(event_type, "")
class StreamingTrajectory:
"""
Manages streaming trajectory output during RLM execution.
"""
def __init__(self, renderer: TrajectoryRenderer):
self.renderer = renderer
self.events: List[TrajectoryEvent] = []
self.subscribers: List[asyncio.Queue] = []
async def emit(self, event: TrajectoryEvent):
"""Emit event to all subscribers"""
self.events.append(event)
rendered = self.renderer.render_event(event)
for queue in self.subscribers:
await queue.put(rendered)
def subscribe(self) -> asyncio.Queue:
"""Subscribe to trajectory stream"""
queue = asyncio.Queue()
self.subscribers.append(queue)
return queue
async def stream(self) -> AsyncIterator[str]:
"""Async iterator for streaming output"""
queue = self.subscribe()
try:
while True:
line = await queue.get()
yield line
except asyncio.CancelledError:
self.subscribers.remove(queue)
raise
def get_full_trajectory(self) -> List[TrajectoryEvent]:
"""Get complete trajectory for logging/replay"""
return self.events.copy()
def export_json(self) -> str:
"""Export trajectory as JSON for analysis"""
return json.dumps([
{
"type": e.type.value,
"depth": e.depth,
"content": e.content,
"metadata": e.metadata,
"timestamp": e.timestamp,
}
for e in self.events
], indent=2)Integration with Claude Code Output:
class RLMOrchestrator:
"""Main orchestration loop with streaming trajectory"""
async def run(
self,
query: str,
context: SessionContext
) -> AsyncIterator[Union[TrajectoryEvent, ToolUse, AssistantMessage]]:
"""
Run RLM loop, yielding events as they occur.
Consumers can render trajectory AND tool outputs together.
"""
trajectory = StreamingTrajectory(
TrajectoryRenderer(verbosity=self.config.trajectory_verbosity)
)
# Start event
await trajectory.emit(TrajectoryEvent(
type=TrajectoryEventType.RLM_START,
depth=0,
content=f"depth=0/{self.config.max_depth} • task: {self.activation_reason}",
metadata={"query": query, "context_tokens": context.total_tokens}
))
yield trajectory.events[-1]
# Main loop
repl = RecursiveREPL(context, depth=0, max_depth=self.config.max_depth)
async for event in self._rlm_loop(query, repl, trajectory):
yield event
# Check for final answer
if event.type == TrajectoryEventType.FINAL:
break
# Check for tool use (pass through to Claude Code)
if event.type == TrajectoryEventType.TOOL_USE:
# Let Claude Code handle the tool, get result
tool_result = yield event # Bidirectional generator
# Feed result back into REPL
repl.inject_tool_result(tool_result)Verbosity Levels:
| Level | Shows | Use Case |
|---|---|---|
minimal |
RECURSE, FINAL, ERROR only | Production, minimal noise |
normal |
All events, truncated content | Default, balanced visibility |
verbose |
All events, full content | Debugging specific issues |
debug |
Everything + internal state | Development, trajectory analysis |
Configuration:
{
"rlm": {
"trajectory_verbosity": "normal",
"trajectory_colors": true,
"trajectory_timestamps": false,
"trajectory_export_path": "~/.claude/rlm-trajectories/",
"trajectory_export_format": "json"
}
}| Task | Description | Deliverable |
|---|---|---|
| Context Manager | Externalize conversation/files to Python vars | context_manager.py |
| REPL Sandbox | Secure Python execution environment | repl_environment.py |
| Basic Router | Route root vs. recursive calls | router_integration.py |
| CLI Integration | /rlm command to toggle mode |
Plugin command |
Validation: NIAH benchmark on 500K token synthetic context
| Task | Description | Deliverable |
|---|---|---|
| Hook System | Automatic activation on threshold | hooks.json |
| Tool Preservation | Ensure bash/edit/read tools work in RLM mode | Integration tests |
| State Persistence | Save/restore RLM state across sessions | State manager |
| Error Handling | Graceful fallback to standard mode | Error handlers |
Validation: Real coding tasks (bug fix, feature implementation)
| Task | Description | Deliverable |
|---|---|---|
| Async Sub-calls | Parallel recursive queries | Async handler |
| Caching | Cache summarizations, frequently-accessed context | Cache layer |
| Cost Tracking | Monitor token usage in RLM mode | Cost dashboard |
| Prompt Tuning | Optimize root/recursive prompts | Prompt library |
Validation: OOLONG benchmark subset, cost comparison
| Task | Description | Deliverable |
|---|---|---|
| Depth=2 Support | Enable deeper recursion for massive contexts | Recursive REPL |
| Model Routing | Different models for different query types | Smart router |
| Learning | Adapt strategies based on success patterns | Feedback loop |
| Visualization | Trajectory viewer for debugging | Web UI |
Validation: Production dogfooding on real projects
The REPL environment must be sandboxed:
# ALLOWED
- String manipulation
- List/dict operations
- Regex (re module)
- JSON parsing
- len(), str(), sorted(), etc.
# BLOCKED
- File system access (use Claude Code's tools)
- Network access (use Claude Code's tools)
- subprocess/os.system
- import of arbitrary modules
- eval/exec of user-provided stringsImplementation: Use RestrictedPython with custom guards, or containerized execution.
| Operation | Target Latency | Strategy |
|---|---|---|
| Context externalization | <500ms | Lazy serialization |
| REPL execution | <100ms | Persistent interpreter |
| Recursive sub-call | <10s | Streaming, timeout |
| Full RLM turn | <30s | Parallel sub-calls |
RLM Cost = Root Prompt Tokens + Σ(Recursive Call Tokens)
Where:
- Root Prompt Tokens ≈ 2K (minimal prompt) + REPL outputs
- Recursive Call = Query + Context chunk (typically 4-16K)
- Expected: 2-5 recursive calls per complex query
- Breakeven: When context > 60K tokens (RLM becomes cheaper)
class RLMSession:
"""Persistent state across turns"""
conversation: List[Message] # Full history
files: Dict[str, str] # Cached files
working_memory: Dict # User-defined state
repl_state: Dict # Python interpreter state
trajectory: List[RLMStep] # For debugging/replay
def save(self, path: Path): ...
def load(cls, path: Path) -> 'RLMSession': ...| Question | Decision | Section |
|---|---|---|
| Activation strategy | Complexity-based, not token count | §6.3 |
| Recursive depth | Default=2, max=3 | §6.4 |
| Hybrid mode | Enabled, conservative simple classification | §6.5 |
| Trajectory visibility | Streaming, human-friendly, configurable verbosity | §6.6 |
-
REPL Isolation at Depth=2: When depth=1 spawns a child REPL, should it share any state with parent?
- Proposal: Fully isolated, but can pass serialized context
- Risk: Memory overhead of multiple interpreter instances
-
Tool Interleaving: When RLM decides to use a Claude Code tool (bash, edit), how does that flow?
- Option A: Yield control to Claude Code, resume RLM after
- Option B: RLM orchestrates tool use through its own handler
- Proposal: Option A (preserves Claude Code's permission model)
-
Checkpoint/Resume: Can an RLM session be paused and resumed?
- Use case: Long-running tasks, cost management
- Complexity: REPL state serialization, trajectory replay
-
False Positive Cost: What's the overhead when RLM activates unnecessarily?
- Needs: Benchmark simple queries through RLM vs direct
- Hypothesis: ~2-3 second overhead acceptable
-
Classifier Calibration: How to tune complexity signals for different codebases?
- Proposal: Project-level config for signal weights
- Alternative: Learning from user feedback (thumbs up/down)
-
Multi-Modal Signals: Should image/diagram content affect complexity scoring?
- Current: Not considered
- Future: If Claude Code adds vision, integrate
-
Terminal Width: How to handle narrow terminals?
- Proposal: Adaptive truncation, collapsible depth levels
-
Interruption: Can user interrupt mid-RLM and redirect?
- Use case: "Actually, focus on X instead"
- Needs: REPL state checkpoint, trajectory branch
-
Replay: Should users be able to replay trajectories?
- Use case: "Show me how you solved that yesterday"
- Proposal: JSON export + viewer tool (phase 4)
-
Optimal Depth=2 Strategy: When should depth=1 spawn a REPL child vs simple completion?
- Hypothesis: Spawn REPL for verification tasks, simple for extraction
- Needs: A/B testing on real coding tasks
-
Model Selection at Depth: Is Haiku sufficient for depth=2, or does it fail on verification?
- Needs: Benchmark Haiku vs Sonnet at depth=2
-
Complexity Classifier Accuracy: What's the precision/recall of the heuristic classifier?
- Needs: Labeled dataset of "simple" vs "complex" queries
-
Cost-Performance Curve: At what complexity score does RLM ROI become positive?
- Needs: Token cost tracking + success rate by complexity
| Metric | Target | Measurement |
|---|---|---|
| Context handling | 500K+ tokens | Synthetic benchmarks |
| Accuracy on long context | +30% vs baseline | OOLONG subset |
| Cost per query | ≤ baseline at 80K+ | Token tracking |
| Latency overhead | <50% vs direct | E2E timing |
| Complexity classifier precision | >85% | Labeled test set |
| Complexity classifier recall | >95% | Labeled test set (favor false positives) |
| Depth=2 utilization | 20-30% of RLM queries | Production telemetry |
| Trajectory render latency | <10ms per event | Performance profiling |
- Users understand what RLM is doing (trajectory is readable)
- Tool execution works identically to standard mode
- Session state persists correctly across RLM/standard mode switches
- Debugging trajectory is inspectable and exportable
- Community can extend via standard plugin mechanisms
- Complexity activation feels "smart" (users don't notice false positives)
- Depth=2 is used for verification, not unnecessarily
-
/simpleescape hatch works reliably
| Scenario | Target Experience |
|---|---|
| Simple query in complex session | Bypass RLM, <2s response |
| Complex multi-file task | RLM activates, trajectory visible, 30-60s |
| Interruption mid-RLM | Graceful stop, context preserved |
| Verbosity change | Immediate effect, no restart |
| Trajectory export | JSON saved, replayable in viewer |
- Recursive Language Models Paper - Zhang, Kraska, Khattab
- RLM Official Implementation
- Claude Code Documentation
- Claude Code Plugins
- Claude Code Router
- OpenCode - Alternative architecture reference
- Anthropic Context Engineering
Scenario: User asks to refactor auth module and verify no regressions.
┌─────────────────────────────────────────────────────────────────────────────┐
│ ◆ RLM │ depth=0/2 • task: multi_module_task+verification │
├─────────────────────────────────────────────────────────────────────────────┤
│ ▶ ANALYZE │ Context: 156K tokens across auth/, api/, tests/ │
│ ├─ conversation: 45 messages (23K tokens) │
│ ├─ files: 18 cached (98K tokens) - auth/*, api/*, tests/* │
│ └─ tool_outputs: 3 recent (35K tokens) - test failures │
│ │
│ ▶ REPL │ peek(tool_outputs[-1], 0, 2000) │
│ └─ [npm test: FAIL auth/login.test.ts:45 - expected 401, got 500] │
│ │
│ ▶ REASON │ Test failure indicates error handling issue. Need to: │
│ │ 1) Find the bug, 2) Fix it, 3) Verify no side effects │
│ │
│ ▶ RECURSE │ depth=1 │ "Analyze auth error handling and find 500 source" │
│ │ │ context: files['src/auth/**'] (34K tokens) │
│ │ │ spawn_repl: true │
│ │ │
│ │ ▶ ANALYZE │ Scanning 8 auth module files... │
│ │ ▶ REPL │ search(context, r'throw.*Error|catch.*Error') │
│ │ └─ [Found 12 matches across 5 files] │
│ │ ▶ REPL │ recursive_query("Which throw statements lack HTTP codes?", │
│ │ │ matches[:6], spawn_repl=False) │
│ │ │ │
│ │ │ ▶ RECURSE │ depth=2 │ simple extraction query │
│ │ │ └─ [auth.controller.ts:23 throws generic Error] │
│ │ │ └─ [auth.service.ts:89 throws generic Error] │
│ │ │ ◀ RETURN │ "2 locations throw generic Error without HTTP status" │
│ │ │
│ │ ▶ REASON │ auth.controller.ts:23 is in login path, likely culprit │
│ │ ◀ RETURN │ "Bug: auth.controller.ts:23 throws Error() not │
│ │ │ UnauthorizedError(). Also check auth.service.ts:89" │
│ │
│ ▶ REASON │ Confirmed bug location. Before fixing, verify call sites. │
│ │
│ ▶ RECURSE │ depth=1 │ "Find all callers of AuthController.login()" │
│ │ │ context: files['src/**'] (67K tokens) │
│ │ │ spawn_repl: true │
│ │ │
│ │ ▶ REPL │ search(context, r'AuthController.*login|login.*route') │
│ │ └─ [Found in: routes/auth.ts, middleware/session.ts, tests/*] │
│ │ ▶ RECURSE │ depth=2 │ "Do any callers depend on catching Error?" │
│ │ │ ▶ ANALYZE │ Checking error handling in 3 call sites... │
│ │ │ ◀ RETURN │ "No callers catch generic Error specifically. │
│ │ │ │ All use HTTP status codes. Safe to change." │
│ │ ◀ RETURN │ "3 call sites verified. Change is safe." │
│ │
│ ▶ REASON │ Verification complete. Safe to proceed with fix. │
│ │
│ ⚙ TOOL │ Edit src/auth/controllers/auth.controller.ts │
│ └─ Line 23: throw new Error(...) → throw new UnauthorizedError(...) │
│ │
│ ⚙ TOOL │ Edit src/auth/services/auth.service.ts │
│ └─ Line 89: throw new Error(...) → throw new UnauthorizedError(...) │
│ │
│ ⚙ TOOL │ bash: npm test -- --grep "auth" │
│ └─ [PASS: 12/12 tests passing] │
│ │
│ ▶ FINAL │ Fixed auth error handling: │
│ │ • auth.controller.ts:23 - now throws UnauthorizedError (401) │
│ │ • auth.service.ts:89 - now throws UnauthorizedError (401) │
│ │ Verified: All 12 auth tests passing, no regression. │
└─────────────────────────────────────────────────────────────────────────────┘
Trajectory Statistics:
Total turns: 1
RLM depth reached: 2
Recursive calls: 4 (2 at depth=1, 2 at depth=2)
REPL executions: 6
Tool uses: 3 (2 edits, 1 bash)
Total tokens: ~45K (vs ~156K for naive approach)
Estimated cost: $0.47 (vs $1.56 naive)
Time: 34 seconds
What Depth=2 Enabled:
- Depth=1 found the bug locations
- Depth=2 verified no callers depend on current behavior (safe to change)
- Without depth=2, would have either skipped verification or used expensive root-level analysis
| Approach | Pros | Cons |
|---|---|---|
| RLM Plugin | Preserves Claude Code, community extensible | Complexity in state sync |
| Claude Code Router | Simple model switching | No context decomposition |
| OpenCode Fork | Full control | Loses Anthropic updates |
| Custom Agent | Maximum flexibility | Reimplements everything |
Recommendation: RLM Plugin is optimal for Rand's stated goals of better reasoning and context handling while retaining Claude Code's core capabilities.