|
| 1 | + |
| 2 | + |
| 3 | +<h2 align="center"><b>EvolveMem: Self-Evolving Memory Architecture via AutoResearch</b></h2> |
| 4 | + |
| 5 | +<p align="center"> |
| 6 | + <b><i>Extending <a href="https://github.com/aiming-lab/SimpleMem">SimpleMem</a> with self-evolving retrieval infrastructure. The system autonomously researches its own architecture through LLM-driven closed-loop diagnosis.</i></b> |
| 7 | +</p> |
| 8 | + |
| 9 | +<p align="center"> |
| 10 | + <a href="https://python.org"><img src="https://img.shields.io/badge/Python-3.10%2B-3776AB?logo=python&logoColor=white" alt="Python 3.10+"></a> |
| 11 | + <a href="#-results"><img src="https://img.shields.io/badge/SOTA-LoCoMo%20%7C%20MemBench-ff6f00?logo=target&logoColor=white" alt="SOTA"></a> |
| 12 | + <a href="#-citation"><img src="https://img.shields.io/badge/NeurIPS-2026-blue?logo=arxiv&logoColor=white" alt="NeurIPS 2026"></a> |
| 13 | +</p> |
| 14 | + |
| 15 | +<p align="center"> |
| 16 | + <a href="#-quick-start">Quick Start</a> · |
| 17 | + <a href="#%EF%B8%8F-architecture">Architecture</a> · |
| 18 | + <a href="#-results">Results</a> · |
| 19 | + <a href="#-self-evolution-trajectory">Evolution</a> · |
| 20 | + <a href="#-citation">Citation</a> |
| 21 | +</p> |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## 💡 Key Idea |
| 26 | + |
| 27 | +Every existing memory system evolves what it *stores* but never how it *retrieves*. EvolveMem closes this gap. |
| 28 | + |
| 29 | +The retrieval infrastructure (fusion weights, context budgets, answer styles, per-category overrides, ...) is exposed as a **structured action space** and optimized through an autonomous closed-loop: |
| 30 | + |
| 31 | +| Step | What happens | |
| 32 | +|:--:|:--| |
| 33 | +| 📊 **Evaluate** | Run held-out QA, write per-question failure logs | |
| 34 | +| 🔍 **Diagnose** | LLM reads failure logs, identifies root causes | |
| 35 | +| 💡 **Propose** | Targeted configuration adjustments | |
| 36 | +| 🛡️ **Guard** | Auto-revert if performance drops | |
| 37 | + |
| 38 | +This closed-loop self-evolution realizes an **AutoResearch** process: the system conducts the observe-hypothesize-experiment-validate cycle on its own architecture. |
| 39 | + |
| 40 | +--- |
| 41 | + |
| 42 | +## ✨ Highlights |
| 43 | + |
| 44 | +<table> |
| 45 | +<tr> |
| 46 | +<td align="center" width="160">📈 <b>+25.7%</b><br><sub>vs. strongest baseline (LoCoMo)</sub></td> |
| 47 | +<td align="center" width="160">📈 <b>+18.9%</b><br><sub>vs. strongest baseline (MemBench)</sub></td> |
| 48 | +<td align="center" width="160">🧬 <b>Self-expanding</b><br><sub>3 new dimensions discovered</sub></td> |
| 49 | +<td align="center" width="160">🔄 <b>Positive transfer</b><br><sub>Cross-benchmark generalization</sub></td> |
| 50 | +<td align="center" width="140">⚙️ <b>7 rounds</b><br><sub>Fully autonomous</sub></td> |
| 51 | +</tr> |
| 52 | +</table> |
| 53 | + |
| 54 | +--- |
| 55 | + |
| 56 | +## 🚀 Quick Start |
| 57 | + |
| 58 | +### Installation |
| 59 | + |
| 60 | +```bash |
| 61 | +git clone https://github.com/aiming-lab/SimpleMem.git |
| 62 | +cd SimpleMem/EvolveMem |
| 63 | +pip install -r requirements.txt |
| 64 | +``` |
| 65 | + |
| 66 | +### Configuration |
| 67 | + |
| 68 | +```bash |
| 69 | +export OPENAI_API_KEY="your-key-here" |
| 70 | +export OPENAI_API_BASE="https://api.openai.com/v1" # or Azure endpoint |
| 71 | +export LLM_MODEL="gpt-4o" |
| 72 | +``` |
| 73 | + |
| 74 | +### Run Self-Evolution |
| 75 | + |
| 76 | +```bash |
| 77 | +# Full evolution on LoCoMo (7 rounds) |
| 78 | +python run_evolution.py --data data/locomo10.json --max-rounds 7 |
| 79 | + |
| 80 | +# Quick 3-round evolution |
| 81 | +python run_evolution.py --data data/locomo10.json --max-rounds 3 |
| 82 | + |
| 83 | +# Start from pre-extracted memory cache |
| 84 | +python run_evolution.py --use-cache cache.json --max-rounds 5 |
| 85 | +``` |
| 86 | + |
| 87 | +### Run Benchmark Evaluation |
| 88 | + |
| 89 | +```bash |
| 90 | +# LoCoMo evaluation |
| 91 | +python run_benchmark.py locomo --sample 0 --initial weak --max-rounds 3 |
| 92 | + |
| 93 | +# MemBench evaluation |
| 94 | +python run_benchmark.py membench --agent FirstAgent \ |
| 95 | + --categories simple comparative aggregative conditional \ |
| 96 | + --initial weak --max-rounds 3 |
| 97 | +``` |
| 98 | + |
| 99 | +--- |
| 100 | + |
| 101 | +## 🏗️ Architecture |
| 102 | + |
| 103 | +EvolveMem consists of three layers connected by a self-evolution feedback loop: |
| 104 | + |
| 105 | +### 1. 🗄️ Structured Memory Store |
| 106 | + |
| 107 | +| Component | Description | |
| 108 | +|:--|:--| |
| 109 | +| **SQLite + FTS5** | Persistent storage with full-text search | |
| 110 | +| **LLM Extraction** | Sliding window with retry, chunk-splitting, coverage verification | |
| 111 | +| **Consolidation** | Deduplication, importance decay, entity reinforcement | |
| 112 | + |
| 113 | +### 2. 🔍 Multi-View Retrieval (Evolvable Action Space) |
| 114 | + |
| 115 | +| View | Signal | Purpose | |
| 116 | +|:--|:--|:--| |
| 117 | +| 📝 **Lexical** | BM25 | Exact keyword matching | |
| 118 | +| 🧠 **Semantic** | Dense embeddings | Conceptual similarity | |
| 119 | +| 🏷️ **Structured** | Entity/location/person metadata | Structured filtering | |
| 120 | + |
| 121 | +Fusion mode, per-view weights, context budgets, answer styles, and per-category overrides are all **evolvable parameters**. |
| 122 | + |
| 123 | +### 3. 🧬 Self-Evolution Engine (AutoResearch) |
| 124 | + |
| 125 | +The engine reads per-question failure logs, diagnoses root causes, and proposes targeted adjustments. Three safeguards ensure robustness: |
| 126 | + |
| 127 | +| Safeguard | Trigger | Action | |
| 128 | +|:--|:--|:--| |
| 129 | +| 🛡️ **Revert** | Performance drops > threshold | Roll back to best-so-far | |
| 130 | +| 🔀 **Explore** | Score plateaus for 2 rounds | Random perturbation | |
| 131 | +| ⏹️ **Converge** | Improvement < epsilon | Terminate and return best | |
| 132 | + |
| 133 | +<details> |
| 134 | +<summary>📐 Full system diagram</summary> |
| 135 | + |
| 136 | +``` |
| 137 | +Raw Conversations |
| 138 | + │ |
| 139 | + ▼ |
| 140 | +┌─────────────────────────────┐ |
| 141 | +│ LLM-Based Extraction │ ← Sliding window + retry + coverage verify |
| 142 | +│ → Typed Memory Units │ |
| 143 | +└─────────────┬───────────────┘ |
| 144 | + ▼ |
| 145 | +┌─────────────────────────────┐ |
| 146 | +│ Multi-View Retrieval │ |
| 147 | +│ BM25 ∪ Semantic ∪ Struct │ ← Evolvable fusion (sum/weighted/RRF) |
| 148 | +│ + Entity-swap │ |
| 149 | +│ + Query decomposition │ |
| 150 | +└─────────────┬───────────────┘ |
| 151 | + ▼ |
| 152 | +┌─────────────────────────────┐ |
| 153 | +│ Answer Generation │ ← Per-category style + verification |
| 154 | +└─────────────┬───────────────┘ |
| 155 | + ▼ |
| 156 | +┌─────────────────────────────┐ |
| 157 | +│ Evaluation + Diagnosis │ ← LLM reads per-question failure logs |
| 158 | +│ → Structured proposal │ |
| 159 | +└─────────────┬───────────────┘ |
| 160 | + ▼ |
| 161 | +┌─────────────────────────────┐ |
| 162 | +│ Meta-Analyzer │ ← Revert / Explore / Apply |
| 163 | +│ → Updated config θ │ |
| 164 | +└─────────────┬───────────────┘ |
| 165 | + │ |
| 166 | + └──── Loop back to Retrieval ────┘ |
| 167 | +``` |
| 168 | + |
| 169 | +</details> |
| 170 | + |
| 171 | +--- |
| 172 | + |
| 173 | +## 📊 Results |
| 174 | + |
| 175 | +### LoCoMo (Token-F1) |
| 176 | + |
| 177 | +| Method | GPT-4o | GPT-5.1 | |
| 178 | +|:--|:--:|:--:| |
| 179 | +| MemVerse | 0.365 | 0.383 | |
| 180 | +| Mem0 | 0.397 | 0.390 | |
| 181 | +| A-MEM | 0.394 | 0.385 | |
| 182 | +| MemGPT | 0.404 | 0.385 | |
| 183 | +| SimpleMem | 0.432 | 0.418 | |
| 184 | +| **EvolveMem** | **0.543** | **0.572** | |
| 185 | + |
| 186 | +### MemBench (Accuracy %) |
| 187 | + |
| 188 | +| Method | GPT-4o | GPT-5.1 | |
| 189 | +|:--|:--:|:--:| |
| 190 | +| RecentMemory | 57.1 | 60.7 | |
| 191 | +| MemGPT | 57.1 | 60.7 | |
| 192 | +| MemoryBank | 46.4 | 64.3 | |
| 193 | +| SCMemory | 39.3 | 32.1 | |
| 194 | +| **EvolveMem** | **67.9** | **71.4** | |
| 195 | + |
| 196 | +--- |
| 197 | + |
| 198 | +## 🧬 Self-Evolution Trajectory |
| 199 | + |
| 200 | +Starting from a minimal BM25-only baseline (F1 = 30.5%), the system autonomously discovers and activates retrieval mechanisms over 7 rounds: |
| 201 | + |
| 202 | +| Round | Stage | Automated Change | F1 (%) | |
| 203 | +|:--:|:--:|:--|:--:| |
| 204 | +| R0 | 🟢 start | BM25-only, k=5 | 30.5 | |
| 205 | +| R1 | ⚙️ auto | Intent planning + RRF fusion | 35.8 | |
| 206 | +| R2 | 🔙 revert | MMR diversity (reverted) | 34.8 | |
| 207 | +| R3 | ⚙️ auto | Entity-swap for Cat. 5 | 37.2 | |
| 208 | +| R4 | ⚙️ auto | Per-category answer styles | 38.5 | |
| 209 | +| R5 | ⚙️ auto | Query decomposition for Cat. 1/4 | 38.1 | |
| 210 | +| R6 | ⚙️ auto | Cat. 3 inferential subtypes + swap expansion | 45.4 | |
| 211 | +| R7 | ⚙️ auto | Answer verification + hyperparameter sweep | **54.3** | |
| 212 | + |
| 213 | +Three configuration dimensions emerged from failure diagnosis that were **not in the original design**: |
| 214 | +- 🔀 **Query decomposition** (splitting multi-hop questions into sub-queries) |
| 215 | +- 🔄 **Adversarial entity-swap** (stripping misleading names before retrieval) |
| 216 | +- ✅ **Answer verification** (second-pass LLM review of low-confidence outputs) |
| 217 | + |
| 218 | +--- |
| 219 | + |
| 220 | +## 🔄 Cross-Benchmark Transfer |
| 221 | + |
| 222 | +| Configuration | LoCoMo (F1) | MemBench (Acc) | |
| 223 | +|:--|:--:|:--:| |
| 224 | +| Baseline | 0.305 | / | |
| 225 | +| C_L (LoCoMo only) | 0.543 | 0.543 | |
| 226 | +| C_LM (LoCoMo → MemBench) | **0.593** | **0.792** | |
| 227 | +| C_M (MemBench only) | / | 0.679 | |
| 228 | + |
| 229 | +> Continued evolution from a LoCoMo prior **outperforms** scratch evolution on MemBench (+16.6% relative) while also **improving** LoCoMo performance. Pareto improvement on both benchmarks. |
| 230 | +
|
| 231 | +--- |
| 232 | + |
| 233 | +## 📝 Citation |
| 234 | + |
| 235 | +```bibtex |
| 236 | +@article{evolvemem2026, |
| 237 | + title={EvolveMem: Self-Evolving Memory Architecture via AutoResearch for LLM Agents}, |
| 238 | + author={Liu, Jiaqi and Ye, Xinyu and Xia, Peng and Zheng, Zeyu and Xie, Cihang and Ding, Mingyu and Yao, Huaxiu}, |
| 239 | + journal={arXiv preprint arXiv:2605.13941}, |
| 240 | + year={2026}, |
| 241 | + url={https://arxiv.org/abs/2605.13941} |
| 242 | +} |
| 243 | +``` |
0 commit comments