Complete documentation for using RLM-Claude-Code effectively.
- Understanding RLM
- REPL Environment
- Slash Commands
- Execution Modes
- Auto-Activation
- Memory System
- Reasoning Traces
- Budget Management
- Trajectory Analysis
- Strategy Learning
- Epistemic Verification
- Advanced Configuration
- Best Practices
Large Language Models have context limits. Even with 200K token windows, Claude can struggle with:
- Information overload: Too much context dilutes attention
- Cross-reference reasoning: Connecting information across distant parts
- Systematic analysis: Ensuring nothing is missed in large codebases
RLM (Recursive Language Model) solves this by decomposition:
- Context Externalization: Large contexts become Python variables
- REPL Environment: Claude writes code to explore context programmatically
- Recursive Sub-Queries: Complex questions spawn focused sub-queries
- Memory Persistence: Facts and experiences persist across sessions
- Strategy Learning: Successful patterns are remembered for similar tasks
User: "Find security vulnerabilities in the auth module"
RLM Analysis:
├─ Complexity classifier detects cross-file reasoning needed
├─ Orchestrator chooses: depth=2, model=sonnet, tools=read_only
├─ Context externalized: auth/*.py files as Python dict
├─ REPL execution:
│ ├─ peek(files['auth/handler.py'][:500])
│ ├─ search(files, 'password', regex=False)
│ └─ find_relevant(files['auth/session.py'], 'validation')
├─ Sub-queries spawned:
│ ├─ llm("Analyze input validation", files['handler.py'])
│ └─ llm("Check session management", files['session.py'])
├─ Results aggregated
└─ Final response with findings
The REPL is a sandboxed Python environment for context manipulation.
| Variable | Type | Description |
|---|---|---|
conversation |
list[dict] |
Messages with role and content |
files |
dict[str, str] |
Filename → content mapping |
tool_outputs |
list[dict] |
Tool results with tool and content |
working_memory |
dict |
Scratchpad for intermediate results |
View a slice of any context variable.
# First 500 chars of a file
peek(files['main.py'], 0, 500)
# Middle of conversation
peek(conversation, 5, 10)
# First 3 items of a dict
peek(files, 0, 3)Find patterns in context. Returns list of matches with location info.
# Find all authentication-related code
search(files, 'authenticate')
# Regex search for function definitions
search(files['utils.py'], r'def \w+\(', regex=True)
# Search conversation for error mentions
search(conversation, 'error')LLM-powered summarization via sub-call.
# Summarize a large file
summary = summarize(files['large_module.py'], max_tokens=200)Spawn a recursive sub-query.
# Simple sub-query
result = llm("What does this function do?", files['auth.py'])
# With REPL access for the sub-query
result = llm("Analyze this module", files['complex.py'], spawn_repl=True)Execute multiple LLM queries in parallel.
# Analyze multiple modules concurrently
results = llm_batch([
("Analyze auth module", files['auth.py']),
("Analyze db module", files['db.py']),
("Analyze api module", files['api.py']),
])Apply map-reduce pattern to large content.
# Analyze large file by chunks
result = map_reduce(
large_file_content,
map_prompt="Find potential bugs in this code chunk",
reduce_prompt="Combine these findings into a prioritized list",
n_chunks=4,
model="fast",
)Find sections most relevant to a query.
# Find authentication-related sections
relevant = find_relevant(
files['large_module.py'],
query="password validation",
top_k=3,
)
# Returns: [(chunk, score), ...]Parse and extract function definitions.
# Get all functions from a file
functions = extract_functions(files['utils.py'])
# Returns: [{'name': 'foo', 'args': [...], 'body': '...', 'line': 42}, ...]Execute safe subprocess commands (limited to ty, ruff).
# Type check a file
result = run_tool("ty", ["check", "src/module.py"])
# Lint a file
result = run_tool("ruff", ["check", "src/module.py"])When memory is enabled, additional functions are available:
Search stored knowledge.
# Find facts about authentication
results = memory_query("authentication patterns", limit=5)Store a fact.
# Remember a discovery
memory_add_fact("This project uses JWT for auth", confidence=0.9)Store an experience with outcome.
# Record what worked
memory_add_experience(
"Used map_reduce for large file analysis",
"Successfully identified 3 bugs",
success=True,
)Get recent/relevant context nodes.
# Get context for current work
context_nodes = memory_get_context(limit=5)Create relationships between nodes.
# Link related facts
memory_relate(fact1_id, fact2_id, "supports")| Command | Description |
|---|---|
/rlm |
Show current RLM status |
/rlm on |
Enable RLM for this session |
/rlm off |
Disable RLM mode |
/rlm status |
Show detailed configuration |
| Command | Description |
|---|---|
/rlm mode fast |
Quick, shallow analysis |
/rlm mode balanced |
Standard processing (default) |
/rlm mode thorough |
Deep, comprehensive analysis |
| Command | Description |
|---|---|
/rlm depth <0-3> |
Set maximum recursion depth |
/rlm budget $X |
Set session cost limit |
/rlm model <name> |
Force model (opus/sonnet/haiku/auto) |
/rlm tools <level> |
Tool access (none/repl/read/full) |
/rlm verbosity <level> |
Output detail (minimal/normal/verbose/debug) |
/rlm reset |
Reset all settings to defaults |
/rlm save |
Save current preferences to disk |
| Command | Description |
|---|---|
/simple |
Bypass RLM for current query only |
/trajectory <file> |
Analyze a saved trajectory file |
/test |
Run the test suite |
/bench |
Run performance benchmarks |
/code-review |
Review code changes |
/rlm mode fast
| Setting | Value |
|---|---|
| Depth | 1 |
| Model | Haiku |
| Tools | REPL only |
Best for: Quick questions, iteration, simple tasks.
/rlm mode balanced
| Setting | Value |
|---|---|
| Depth | 2 |
| Model | Sonnet |
| Tools | Read-only |
Best for: Most daily tasks, feature development, bug fixes.
/rlm mode thorough
| Setting | Value |
|---|---|
| Depth | 3 |
| Model | Opus |
| Tools | Full access |
Best for: Security audits, architecture decisions, complex debugging.
RLM analyzes each query to decide whether to activate:
- Context Size: Large contexts (>80K tokens) trigger activation
- Query Complexity: Cross-file references, debugging keywords
- Pattern Matching: Architecture questions, comparison requests
- User Preference: Manual
/rlm onoverrides everything
| Signal | Examples |
|---|---|
| Cross-file reference | "How does auth.py interact with api.py?" |
| Debugging keywords | "Why does this fail?", "trace the error" |
| Architecture questions | "How should I structure this?" |
| Comparison requests | "What's the difference between X and Y?" |
| Multi-step tasks | "Refactor and add tests" |
/rlm on # Force activation for all queries
/rlm off # Disable auto-activation
/simple # Skip activation for one query
With debug verbosity:
/rlm verbosity debug
You'll see activation reasoning:
[ACTIVATION] Analyzing query...
- Token count: 145,230 (above threshold)
- Cross-file references: 3 detected
- Complexity score: 0.87
- Decision: ACTIVATE
RLM includes a persistent memory system for cross-session learning.
| Type | Description |
|---|---|
fact |
Verified information about the codebase |
experience |
Past actions and their outcomes |
procedure |
Known working approaches |
goal |
Tracked objectives |
Memory evolves through tiers based on usage and confidence:
task → session → longterm → archive
| Tier | Lifespan | Purpose |
|---|---|---|
task |
Current task | Working memory |
session |
Current session | Short-term recall |
longterm |
Persistent | Core knowledge |
archive |
Compressed | Historical reference |
from src import MemoryStore, MemoryEvolution
# Create store
store = MemoryStore(db_path="~/.claude/rlm-memory.db")
# Store facts
fact_id = store.create_node(
node_type="fact",
content="This project uses PostgreSQL 15",
tier="task",
confidence=0.9,
)
# Create relationships
store.create_edge(
edge_type="relation",
label="uses",
members=[
{"node_id": project_id, "role": "subject", "position": 0},
{"node_id": fact_id, "role": "object", "position": 1},
],
)
# Evolve memory
evolution = MemoryEvolution(store)
evolution.consolidate(task_id="current-task") # task → session
evolution.promote(session_id="session-1") # session → longterm
evolution.decay(days_threshold=30) # old → archiveTrack decision-making for transparency and debugging.
from src import ReasoningTraces
traces = ReasoningTraces(store)
# Create a goal
goal_id = traces.create_goal(
content="Implement user authentication",
prompt="How should I implement user authentication?",
files=["src/auth.py", "src/models/user.py"],
)
# Create a decision point
decision_id = traces.create_decision(
goal_id=goal_id,
content="Choose authentication strategy",
)
# Add options
jwt_option = traces.add_option(decision_id, "Use JWT tokens")
session_option = traces.add_option(decision_id, "Use session cookies")
# Record the choice
traces.choose_option(decision_id, jwt_option)
traces.reject_option(decision_id, session_option, "JWT is more scalable for API")
# Create action and outcome
action_id = traces.create_action(decision_id, "Implementing JWT authentication")
outcome_id = traces.create_outcome(action_id, "JWT auth implemented successfully", success=True)# Get full decision tree
tree = traces.get_decision_tree(goal_id)
# Get rejected options with reasons
rejected = traces.get_rejected_options(decision_id)
for opt in rejected:
print(f"Rejected: {opt.content} - {opt.reason}")/rlm budget $5 # Session budget of $5
/rlm budget $0.50 # Budget of 50 cents
- Budgets are per-session (reset when Claude Code restarts)
- RLM tracks estimated cost of each operation
- When budget is exceeded, RLM uses simpler strategies
- You're warned before exceeding budget
from src import EnhancedBudgetTracker, BudgetLimits
tracker = EnhancedBudgetTracker()
# Set limits
tracker.set_limits(BudgetLimits(
max_cost_per_task=5.0,
max_recursive_calls=10,
max_depth=3,
max_repl_executions=50,
))
# Start tracking a task
tracker.start_task("analyze-codebase")
tracker.start_timing()
# Check before operations
allowed, reason = tracker.can_make_llm_call()
if not allowed:
print(f"Blocked: {reason}")
# Record operations
tracker.record_llm_call(
input_tokens=1000,
output_tokens=500,
model="sonnet",
component=CostComponent.RECURSIVE_CALL,
)
tracker.record_repl_execution()
tracker.record_depth(2)
# Get metrics
metrics = tracker.get_metrics()
print(f"Cost: ${metrics.total_cost_usd:.2f}")
print(f"Calls: {metrics.sub_call_count}")
print(f"Max depth: {metrics.max_depth_reached}")
# End task
tracker.stop_timing()
tracker.end_task()The tracker can trigger alerts:
tracker.set_limits(BudgetLimits(
max_cost_per_task=5.0,
warn_at_cost=4.0, # Warn at 80%
))
# Check for alerts
alerts = tracker.get_alerts()
for alert in alerts:
print(f"[{alert.level}] {alert.message}")A trajectory records RLM's reasoning process:
- Queries and sub-queries
- REPL code executed
- Results at each step
- Final answer synthesis
| Level | Shows |
|---|---|
minimal |
RECURSE, FINAL, ERROR only |
normal |
All events, truncated content |
verbose |
All events, full content |
debug |
Everything + internal state |
/trajectory ~/.claude/trajectories/session-123.json
Output:
Trajectory Analysis
───────────────────
Total events: 23
Max depth reached: 2
Recursive calls: 4
REPL executions: 8
Duration: 34.2s
Estimated cost: $0.47
Event Distribution:
ANALYZE: 3
REPL_EXEC: 8
RECURSE_START: 4
RECURSE_END: 4
FINAL: 1
RLM learns from successful trajectories.
| Strategy | Description | When Used |
|---|---|---|
| Peeking | Sample context before deep dive | Large files, unknown structure |
| Grepping | Pattern-based search | Finding specific code patterns |
| Partition+Map | Divide and conquer | Multi-file analysis |
| Programmatic | One-shot code execution | Transformations, calculations |
| Recursive | Spawn sub-queries | Verification, complex reasoning |
- Pattern Detection: Identifies strategies used in successful trajectories
- Feature Extraction: Extracts query characteristics
- Similarity Matching: Matches new queries to past successes
- Strategy Suggestion: Suggests proven approaches
With debug verbosity:
[STRATEGY] Similar query found (similarity: 0.89)
Previous: "Find all TODO comments in src/"
Strategy: grepping (effectiveness: 0.94)
Suggestion: Use search() with regex pattern
RLM includes always-on hallucination detection that verifies claims against evidence.
LLMs can exhibit "procedural hallucinations" where they:
- Have correct information but fail to use it properly
- Cite evidence that doesn't actually support their claims
- Present confident answers disconnected from provided context
Epistemic verification catches these issues by checking claims against evidence.
| Command | Description |
|---|---|
/verify |
Show verification status and configuration |
/verify on |
Enable verification for this session |
/verify off |
Disable verification |
/verify report |
Show the last verification report |
/verify claim "..." |
Verify a specific claim against context |
/verify feedback <id> correct|incorrect |
Provide accuracy feedback |
/verify stats |
Show feedback statistics |
/verify mode <mode> |
Set verification mode |
| Mode | Description | Cost |
|---|---|---|
full |
Verify all extracted claims | Highest |
sample |
Verify critical claims + 30% sample (default) | Medium |
critical |
Only verify claims marked as critical | Lowest |
Set the mode:
/verify mode sample
When RLM is active, these verification functions are available:
Verify a single claim against evidence.
result = verify_claim(
"The function returns 42",
"def func(): return 42",
threshold=0.7
)
# Returns ClaimVerification with:
# - evidence_support: 0.95
# - evidence_dependence: 0.8
# - is_flagged: FalseCheck if an answer actually depends on the evidence provided.
score = evidence_dependence(
"What color is the widget?",
"The widget is blue.",
"According to the spec, widgets are blue."
)
# Returns 0.0-1.0
# - 1.0 = answer fully depends on evidence (good)
# - 0.0 = answer unchanged without evidence (potential hallucination)Verify a chain of reasoning steps.
results = audit_reasoning(
steps=[
{"claim": "The function returns 42", "cites": ["src1"]},
{"claim": "This matches the spec", "cites": ["src2"]},
],
sources={
"src1": "def func(): return 42",
"src2": "Spec: func should return 42",
}
)Auto-detect and verify all claims in a response.
report = detect_hallucinations(
response="The function returns 42 and handles errors gracefully.",
context="def func(): return 42",
support_threshold=0.7
)
# Returns HallucinationReport with flagged claims and gapsThe verification report shows:
Verification Report
───────────────────
Response: resp-abc123
Mode: sample (30% sampling)
Claims: 5 total, 4 verified, 1 flagged
Confidence: 0.85
Flagged Claims:
[c3] "The API returns XML data"
Reason: unsupported
Suggestion: Provide supporting evidence or remove claim
Evidence Gaps:
- partial_support (c2): Claim goes beyond available evidence
Key metrics:
- Claims verified: Passed evidence support and dependence checks
- Claims flagged: Failed verification (reasons below)
- Confidence: Overall weighted score (higher = more trustworthy)
Flag reasons:
| Reason | Meaning |
|---|---|
unsupported |
No evidence supports the claim |
phantom_citation |
Cited source doesn't exist |
contradiction |
Evidence contradicts the claim |
over_extrapolation |
Claim goes beyond what evidence states |
low_dependence |
Answer unchanged without evidence |
Help improve verification accuracy by providing feedback:
/verify feedback c1 correct # Verification was accurate
/verify feedback c2 incorrect # False positive - claim was actually fine
View accuracy statistics:
/verify stats
Feedback is stored and used to calibrate thresholds over time.
In ~/.claude/rlm-config.json:
{
"verification": {
"enabled": true,
"mode": "sample",
"support_threshold": 0.7,
"dependence_threshold": 0.3,
"sample_rate": 0.3,
"on_failure": "retry",
"max_retries": 2,
"verification_model": "haiku",
"critical_model": "sonnet"
}
}| Setting | Default | Description |
|---|---|---|
enabled |
true |
Enable/disable verification |
mode |
"sample" |
full, sample, or critical_only |
support_threshold |
0.7 |
Minimum evidence support score |
dependence_threshold |
0.3 |
Minimum evidence dependence |
sample_rate |
0.3 |
Fraction to verify in sample mode |
on_failure |
"retry" |
Action on failure: flag, retry, or ask |
verification_model |
"haiku" |
Model for standard verification |
critical_model |
"sonnet" |
Model for critical claims |
Enable verification when:
- Accuracy is critical (production docs, code review)
- Working with unfamiliar codebases
- Generating technical specifications
- Claims seem wrong or suspicious
Disable verification when:
- Quick iterations where speed matters more
- Creative or exploratory tasks
- You're confident in the context
Verification adds overhead (~$0.001 per response in sample mode):
- Claim extraction: ~$0.0003
- Evidence mapping: ~$0.0002
- Per-claim verification: ~$0.0001 each
Use critical mode for lowest cost, full mode only when accuracy is paramount.
~/.claude/rlm-config.json:
{
"activation": {
"mode": "complexity",
"fallback_token_threshold": 80000,
"auto_activate": true,
"complexity_threshold": 0.6
},
"depth": {
"default": 2,
"max": 3
},
"models": {
"root_model": "opus",
"recursive_depth_1": "sonnet",
"recursive_depth_2": "haiku",
"prefer_provider": "anthropic"
},
"trajectory": {
"verbosity": "normal",
"streaming": true,
"save_to_disk": true,
"save_path": "~/.claude/trajectories"
},
"cost": {
"session_budget": 5.0,
"warn_at_percent": 80
},
"tools": {
"default_access": "read_only",
"blocked_commands": ["rm -rf", "sudo"]
}
}| Variable | Purpose |
|---|---|
ANTHROPIC_API_KEY |
Anthropic API access |
OPENAI_API_KEY |
OpenAI API access (optional) |
RLM_CONFIG_PATH |
Custom config location |
RLM_DEBUG |
Enable debug logging |
The default balanced mode works well for most tasks. Only switch to thorough for genuinely complex work.
Set a reasonable budget to prevent unexpected costs:
/rlm budget $2
For important decisions, check the trajectory to understand RLM's reasoning:
/rlm verbosity verbose
Don't waste RLM overhead on simple queries:
/simple
What's the syntax for a Python list comprehension?
Store facts about your codebase to improve future sessions:
memory_add_fact("This project uses FastAPI with SQLAlchemy", confidence=0.95)Help RLM make better decisions:
# Good - clear scope
"Analyze the authentication flow in src/auth/"
# Less good - vague
"Check the code"
For security-sensitive work:
/rlm mode thorough
Find security vulnerabilities in the payment processing code
- GitHub Issues: github.com/rand/rlm-claude-code/issues
- Getting Started: getting-started.md
- Specification: rlm-claude-code-spec.md