Skip to content

Commit 094027e

Browse files
committed
feat: add EvolveMem (v3.0) core code and update main README
1 parent 94ef7d7 commit 094027e

36 files changed

Lines changed: 17712 additions & 182 deletions

EvolveMem/README.md

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
2+
3+
<h2 align="center"><b>EvolveMem: Self-Evolving Memory Architecture via AutoResearch</b></h2>
4+
5+
<p align="center">
6+
<b><i>Extending <a href="https://github.com/aiming-lab/SimpleMem">SimpleMem</a> with self-evolving retrieval infrastructure. The system autonomously researches its own architecture through LLM-driven closed-loop diagnosis.</i></b>
7+
</p>
8+
9+
<p align="center">
10+
<a href="https://python.org"><img src="https://img.shields.io/badge/Python-3.10%2B-3776AB?logo=python&logoColor=white" alt="Python 3.10+"></a>
11+
<a href="#-results"><img src="https://img.shields.io/badge/SOTA-LoCoMo%20%7C%20MemBench-ff6f00?logo=target&logoColor=white" alt="SOTA"></a>
12+
<a href="#-citation"><img src="https://img.shields.io/badge/NeurIPS-2026-blue?logo=arxiv&logoColor=white" alt="NeurIPS 2026"></a>
13+
</p>
14+
15+
<p align="center">
16+
<a href="#-quick-start">Quick Start</a> &nbsp;·&nbsp;
17+
<a href="#%EF%B8%8F-architecture">Architecture</a> &nbsp;·&nbsp;
18+
<a href="#-results">Results</a> &nbsp;·&nbsp;
19+
<a href="#-self-evolution-trajectory">Evolution</a> &nbsp;·&nbsp;
20+
<a href="#-citation">Citation</a>
21+
</p>
22+
23+
---
24+
25+
## 💡 Key Idea
26+
27+
Every existing memory system evolves what it *stores* but never how it *retrieves*. EvolveMem closes this gap.
28+
29+
The retrieval infrastructure (fusion weights, context budgets, answer styles, per-category overrides, ...) is exposed as a **structured action space** and optimized through an autonomous closed-loop:
30+
31+
| Step | What happens |
32+
|:--:|:--|
33+
| 📊 **Evaluate** | Run held-out QA, write per-question failure logs |
34+
| 🔍 **Diagnose** | LLM reads failure logs, identifies root causes |
35+
| 💡 **Propose** | Targeted configuration adjustments |
36+
| 🛡️ **Guard** | Auto-revert if performance drops |
37+
38+
This closed-loop self-evolution realizes an **AutoResearch** process: the system conducts the observe-hypothesize-experiment-validate cycle on its own architecture.
39+
40+
---
41+
42+
## ✨ Highlights
43+
44+
<table>
45+
<tr>
46+
<td align="center" width="160">📈 <b>+25.7%</b><br><sub>vs. strongest baseline (LoCoMo)</sub></td>
47+
<td align="center" width="160">📈 <b>+18.9%</b><br><sub>vs. strongest baseline (MemBench)</sub></td>
48+
<td align="center" width="160">🧬 <b>Self-expanding</b><br><sub>3 new dimensions discovered</sub></td>
49+
<td align="center" width="160">🔄 <b>Positive transfer</b><br><sub>Cross-benchmark generalization</sub></td>
50+
<td align="center" width="140">⚙️ <b>7 rounds</b><br><sub>Fully autonomous</sub></td>
51+
</tr>
52+
</table>
53+
54+
---
55+
56+
## 🚀 Quick Start
57+
58+
### Installation
59+
60+
```bash
61+
git clone https://github.com/aiming-lab/SimpleMem.git
62+
cd SimpleMem/EvolveMem
63+
pip install -r requirements.txt
64+
```
65+
66+
### Configuration
67+
68+
```bash
69+
export OPENAI_API_KEY="your-key-here"
70+
export OPENAI_API_BASE="https://api.openai.com/v1" # or Azure endpoint
71+
export LLM_MODEL="gpt-4o"
72+
```
73+
74+
### Run Self-Evolution
75+
76+
```bash
77+
# Full evolution on LoCoMo (7 rounds)
78+
python run_evolution.py --data data/locomo10.json --max-rounds 7
79+
80+
# Quick 3-round evolution
81+
python run_evolution.py --data data/locomo10.json --max-rounds 3
82+
83+
# Start from pre-extracted memory cache
84+
python run_evolution.py --use-cache cache.json --max-rounds 5
85+
```
86+
87+
### Run Benchmark Evaluation
88+
89+
```bash
90+
# LoCoMo evaluation
91+
python run_benchmark.py locomo --sample 0 --initial weak --max-rounds 3
92+
93+
# MemBench evaluation
94+
python run_benchmark.py membench --agent FirstAgent \
95+
--categories simple comparative aggregative conditional \
96+
--initial weak --max-rounds 3
97+
```
98+
99+
---
100+
101+
## 🏗️ Architecture
102+
103+
EvolveMem consists of three layers connected by a self-evolution feedback loop:
104+
105+
### 1. 🗄️ Structured Memory Store
106+
107+
| Component | Description |
108+
|:--|:--|
109+
| **SQLite + FTS5** | Persistent storage with full-text search |
110+
| **LLM Extraction** | Sliding window with retry, chunk-splitting, coverage verification |
111+
| **Consolidation** | Deduplication, importance decay, entity reinforcement |
112+
113+
### 2. 🔍 Multi-View Retrieval (Evolvable Action Space)
114+
115+
| View | Signal | Purpose |
116+
|:--|:--|:--|
117+
| 📝 **Lexical** | BM25 | Exact keyword matching |
118+
| 🧠 **Semantic** | Dense embeddings | Conceptual similarity |
119+
| 🏷️ **Structured** | Entity/location/person metadata | Structured filtering |
120+
121+
Fusion mode, per-view weights, context budgets, answer styles, and per-category overrides are all **evolvable parameters**.
122+
123+
### 3. 🧬 Self-Evolution Engine (AutoResearch)
124+
125+
The engine reads per-question failure logs, diagnoses root causes, and proposes targeted adjustments. Three safeguards ensure robustness:
126+
127+
| Safeguard | Trigger | Action |
128+
|:--|:--|:--|
129+
| 🛡️ **Revert** | Performance drops > threshold | Roll back to best-so-far |
130+
| 🔀 **Explore** | Score plateaus for 2 rounds | Random perturbation |
131+
| ⏹️ **Converge** | Improvement < epsilon | Terminate and return best |
132+
133+
<details>
134+
<summary>📐 Full system diagram</summary>
135+
136+
```
137+
Raw Conversations
138+
139+
140+
┌─────────────────────────────┐
141+
│ LLM-Based Extraction │ ← Sliding window + retry + coverage verify
142+
│ → Typed Memory Units │
143+
└─────────────┬───────────────┘
144+
145+
┌─────────────────────────────┐
146+
│ Multi-View Retrieval │
147+
│ BM25 ∪ Semantic ∪ Struct │ ← Evolvable fusion (sum/weighted/RRF)
148+
│ + Entity-swap │
149+
│ + Query decomposition │
150+
└─────────────┬───────────────┘
151+
152+
┌─────────────────────────────┐
153+
│ Answer Generation │ ← Per-category style + verification
154+
└─────────────┬───────────────┘
155+
156+
┌─────────────────────────────┐
157+
│ Evaluation + Diagnosis │ ← LLM reads per-question failure logs
158+
│ → Structured proposal │
159+
└─────────────┬───────────────┘
160+
161+
┌─────────────────────────────┐
162+
│ Meta-Analyzer │ ← Revert / Explore / Apply
163+
│ → Updated config θ │
164+
└─────────────┬───────────────┘
165+
166+
└──── Loop back to Retrieval ────┘
167+
```
168+
169+
</details>
170+
171+
---
172+
173+
## 📊 Results
174+
175+
### LoCoMo (Token-F1)
176+
177+
| Method | GPT-4o | GPT-5.1 |
178+
|:--|:--:|:--:|
179+
| MemVerse | 0.365 | 0.383 |
180+
| Mem0 | 0.397 | 0.390 |
181+
| A-MEM | 0.394 | 0.385 |
182+
| MemGPT | 0.404 | 0.385 |
183+
| SimpleMem | 0.432 | 0.418 |
184+
| **EvolveMem** | **0.543** | **0.572** |
185+
186+
### MemBench (Accuracy %)
187+
188+
| Method | GPT-4o | GPT-5.1 |
189+
|:--|:--:|:--:|
190+
| RecentMemory | 57.1 | 60.7 |
191+
| MemGPT | 57.1 | 60.7 |
192+
| MemoryBank | 46.4 | 64.3 |
193+
| SCMemory | 39.3 | 32.1 |
194+
| **EvolveMem** | **67.9** | **71.4** |
195+
196+
---
197+
198+
## 🧬 Self-Evolution Trajectory
199+
200+
Starting from a minimal BM25-only baseline (F1 = 30.5%), the system autonomously discovers and activates retrieval mechanisms over 7 rounds:
201+
202+
| Round | Stage | Automated Change | F1 (%) |
203+
|:--:|:--:|:--|:--:|
204+
| R0 | 🟢 start | BM25-only, k=5 | 30.5 |
205+
| R1 | ⚙️ auto | Intent planning + RRF fusion | 35.8 |
206+
| R2 | 🔙 revert | MMR diversity (reverted) | 34.8 |
207+
| R3 | ⚙️ auto | Entity-swap for Cat. 5 | 37.2 |
208+
| R4 | ⚙️ auto | Per-category answer styles | 38.5 |
209+
| R5 | ⚙️ auto | Query decomposition for Cat. 1/4 | 38.1 |
210+
| R6 | ⚙️ auto | Cat. 3 inferential subtypes + swap expansion | 45.4 |
211+
| R7 | ⚙️ auto | Answer verification + hyperparameter sweep | **54.3** |
212+
213+
Three configuration dimensions emerged from failure diagnosis that were **not in the original design**:
214+
- 🔀 **Query decomposition** (splitting multi-hop questions into sub-queries)
215+
- 🔄 **Adversarial entity-swap** (stripping misleading names before retrieval)
216+
-**Answer verification** (second-pass LLM review of low-confidence outputs)
217+
218+
---
219+
220+
## 🔄 Cross-Benchmark Transfer
221+
222+
| Configuration | LoCoMo (F1) | MemBench (Acc) |
223+
|:--|:--:|:--:|
224+
| Baseline | 0.305 | / |
225+
| C_L (LoCoMo only) | 0.543 | 0.543 |
226+
| C_LM (LoCoMo → MemBench) | **0.593** | **0.792** |
227+
| C_M (MemBench only) | / | 0.679 |
228+
229+
> Continued evolution from a LoCoMo prior **outperforms** scratch evolution on MemBench (+16.6% relative) while also **improving** LoCoMo performance. Pareto improvement on both benchmarks.
230+
231+
---
232+
233+
## 📝 Citation
234+
235+
```bibtex
236+
@article{evolvemem2026,
237+
title={EvolveMem: Self-Evolving Memory Architecture via AutoResearch for LLM Agents},
238+
author={Liu, Jiaqi and Ye, Xinyu and Xia, Peng and Zheng, Zeyu and Xie, Cihang and Ding, Mingyu and Yao, Huaxiu},
239+
journal={arXiv preprint arXiv:2605.13941},
240+
year={2026},
241+
url={https://arxiv.org/abs/2605.13941}
242+
}
243+
```

EvolveMem/evolvemem/__init__.py

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
"""
2+
EvolveMem — Self-Evolving Memory Architecture for LLM Agents.
3+
4+
A four-layer memory system with typed knowledge representation, adaptive
5+
retrieval policy, replay-based offline evaluation, and promotion-gated
6+
self-evolution — all backed by SQLite + FTS5.
7+
"""
8+
9+
from .candidate import generate_policy_candidates
10+
from .consolidator import MemoryConsolidator
11+
from .diagnosis import DiagnosisReport, MemoryDiagnostics, QAResult
12+
from .evolution import EvolutionConfig, EvolutionEngine, EvolutionResult
13+
from .extractor import ExtractionConfig, MemoryExtractor
14+
from .manager import MemoryManager
15+
from .models import MemoryQuery, MemoryStatus, MemoryType, MemoryUnit
16+
from .multi_retriever import (
17+
MultiViewIndex,
18+
RetrievalConfig,
19+
RetrievedMemory,
20+
format_context,
21+
retrieve_multiview,
22+
)
23+
from .promotion import MemoryPromotionCriteria, should_promote
24+
from .replay import (
25+
MemoryReplayEvaluator,
26+
MemoryReplaySample,
27+
load_replay_samples,
28+
run_policy_candidate_replay,
29+
)
30+
from .scope import derive_memory_scope
31+
from .self_upgrade import MemorySelfUpgradeOrchestrator
32+
from .store import MemoryStore
33+
from .telemetry import MemoryTelemetryStore
34+
from .upgrade_worker import MemoryUpgradeWorker
35+
36+
__all__ = [
37+
# Core
38+
"MemoryManager",
39+
"MemoryStore",
40+
"MemoryConsolidator",
41+
"MemoryQuery",
42+
"MemoryStatus",
43+
"MemoryType",
44+
"MemoryUnit",
45+
"MemoryTelemetryStore",
46+
# Self-Evolution (NEW)
47+
"EvolutionEngine",
48+
"EvolutionConfig",
49+
"EvolutionResult",
50+
"MemoryExtractor",
51+
"ExtractionConfig",
52+
"MemoryDiagnostics",
53+
"DiagnosisReport",
54+
"QAResult",
55+
# Multi-View Retrieval (NEW)
56+
"MultiViewIndex",
57+
"RetrievalConfig",
58+
"RetrievedMemory",
59+
"retrieve_multiview",
60+
"format_context",
61+
# Policy Evolution
62+
"MemoryPromotionCriteria",
63+
"should_promote",
64+
"MemoryReplayEvaluator",
65+
"MemoryReplaySample",
66+
"load_replay_samples",
67+
"run_policy_candidate_replay",
68+
"MemorySelfUpgradeOrchestrator",
69+
"MemoryUpgradeWorker",
70+
"derive_memory_scope",
71+
"generate_policy_candidates",
72+
]
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
"""Benchmark adapters — map heterogeneous benchmark data into a unified
2+
`BenchmarkSample` format consumed by `EvolutionEngine`.
3+
4+
Each adapter provides:
5+
- load(...) -> list[BenchmarkSample]
6+
- scoring_fn(prediction, reference, qa_meta) -> float (bounded [0,1])
7+
- answer_prompt(question, context, qa_meta) -> str (format-specific)
8+
9+
The engine calls these via the `BenchmarkAdapter` protocol; adding a new
10+
benchmark means adding a file here, not touching the core engine.
11+
"""
12+
13+
from .base import (
14+
BenchmarkAdapter,
15+
BenchmarkSample,
16+
QuestionMeta,
17+
register_adapter,
18+
get_adapter,
19+
)
20+
from .locomo import LoCoMoAdapter
21+
from .longmemeval import LongMemEvalAdapter
22+
from .membench import MemBenchAdapter
23+
24+
# Registry — populated on import
25+
register_adapter("locomo", LoCoMoAdapter)
26+
register_adapter("longmemeval", LongMemEvalAdapter)
27+
register_adapter("membench", MemBenchAdapter)
28+
29+
__all__ = [
30+
"BenchmarkAdapter",
31+
"BenchmarkSample",
32+
"QuestionMeta",
33+
"LoCoMoAdapter",
34+
"LongMemEvalAdapter",
35+
"MemBenchAdapter",
36+
"get_adapter",
37+
"register_adapter",
38+
]

0 commit comments

Comments
 (0)