You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[2026/03] ContextPilot now can run on **macOS / Apple Silicon** via [llama.cpp](docs/guides/mac_llama_cpp.md).
21
23
-[2026/02] ContextPilot v0.3.2 released, supporting [PageIndex](https://github.com/VectifyAI/PageIndex) and [Mem0](https://github.com/mem0ai/mem0).
22
24
-[2026/01] ContextPilot has been accepted to MLSys 2026 🎉! See you in Bellevue, WA, USA.
@@ -28,7 +30,7 @@ Long-context workloads (RAG, memory chat, tool-augmented agents) prepend many co
28
30
ContextPilot sits between context assembly and inference to maximize prefix reuse and remove duplicates:
29
31
30
32
1.**Higher throughput & cache hits** — boosts prefill throughput and prefix cache hit ratio via context reuse.
31
-
2.**Drop-in solutions** — works with [PageIndex](https://github.com/VectifyAI/PageIndex), [Mem0](https://github.com/mem0ai/mem0), [LMCache](https://github.com/LMCache/LMCache), and backends like [vLLM](https://github.com/vllm-project/vllm) / [SGLang](https://github.com/sgl-project/sglang) / [llama.cpp](docs/guides/mac_llama_cpp.md).
3.**No compromise in reasoning quality** — can even improve with extremely long contexts.
33
35
4.**Widely tested** — validated across diverse RAG and agentic workloads.
34
36
@@ -42,53 +44,63 @@ It maintains a **Context Index** of cached content, then per request applies **R
42
44
43
45
## Performance at a Glance
44
46
45
-
ContextPilot is validated across three representative settings: single-node academic RAG, multi-node production MoE inference, and multi-turn memory-augmented chat. In every case it delivers significant speedups with comparable answer quality.
47
+
**OpenClaw Agent on RTX 5090** — 60 enterprise document analysis tasks ([claw-tasks](https://github.com/EfficientContext/ClawTasks)), Qwen3-4B-Instruct via SGLang. [Full results →](docs/benchmarks/openclaw.md)
46
48
47
-
**Qwen3-32B on 4×A6000** — single-node academic RAG with a 32B model on consumer GPUs.
48
-
49
-
| Benchmark | Method | Prefill TP (tok/s) | Cache Hit | F1 (%) |
**Qwen3-4B on 1×A6000** — multi-turn memory chat with [Mem0](https://github.com/mem0ai/mem0) on the [LoCoMo](https://github.com/snap-research/locomo) benchmark.
>ContextPilot results in mem0 table are without context annotation — an optional feature that adds original importance ranking to reordered context blocks, which can further improve answer quality (see [Paper](https://arxiv.org/abs/2511.03475)).
73
67
74
-
**Llama-3.2-1B on Apple M3 (MacBook Air, 16 GB)** — MultihopRAG on Apple Silicon with llama.cpp, no GPU server required.
68
+
**Llama-3.2-1B on Apple Silicon** — MultihopRAG with llama.cpp, no GPU server required.
75
69
76
-
| Method | Avg Latency (ms) |
77
-
|--------|-----------------|
78
-
| llama.cpp | 3,315 |
79
-
|**llama.cpp + ContextPilot**|**1,378**|
70
+
| Device | Method | Avg Latency (ms) |
71
+
|--------|--------|-----------------|
72
+
| M3 (MacBook Air, 16 GB) | llama.cpp | 3,315 |
73
+
||**llama.cpp + ContextPilot**|**1,378**|
74
+
| M5 (MacBook Air, 32 GB) | llama.cpp | 2,157 |
75
+
||**llama.cpp + ContextPilot**|**911**|
80
76
81
77
Settings: `Llama-3.2-1B-Instruct-Q4_K_M.gguf`, Metal offload (`-ngl 99`), `--cache-reuse 256`, `--parallel 4`, context 32768 tokens. See the [Mac + llama.cpp guide](docs/guides/mac_llama_cpp.md).
82
78
79
+
We also evaluated on academic RAG (Qwen3-32B, 4×A6000) and production MoE inference (DeepSeek-R1-671B, 16×H20) — see [RAG benchmarks](docs/benchmarks/rag.md) and [paper](https://arxiv.org/abs/2511.03475).
80
+
83
81
## Installation
84
82
85
83
**Requirements:** Python >= 3.10
86
84
87
85
---
88
86
89
-
### vLLM / SGLang
87
+
### OpenClaw
90
88
91
-
ContextPilot works with both CPU and GPU backends for building the context index. The `[gpu]` extra enables GPU-accelerated distance computation (via `cupy-cuda12x`) and is faster for large batches; without it, ContextPilot falls back to the CPU backend automatically.
Then set OpenClaw's base URL to `http://localhost:8765/v1`. See the [full OpenClaw integration guide](docs/guides/openclaw.md) for UI setup, config file examples, and self-hosted model instructions.
100
+
101
+
---
102
+
103
+
### vLLM / SGLang
92
104
93
105
**From PyPI** — the vLLM and SGLang hooks are installed automatically:
94
106
```bash
@@ -135,6 +147,19 @@ Docker images are also available for both all-in-one and standalone deployment.
openclaw agent --message "Read contracts/contract_alpha_cloud.txt and summarize the liability terms."
155
+
openclaw agent --message "Read contracts/contract_beta_ai.txt and compare its liability with Alpha."
156
+
openclaw agent --message "Read contracts/contract_gamma_security.txt. Rank all three by liability exposure."
157
+
```
158
+
159
+
When the agent reads multiple documents sharing content (contracts from the same template, proposals with shared methodology), ContextPilot automatically deduplicates identical blocks — reducing prefill tokens by ~27% with zero accuracy loss. See the [integration guide](docs/guides/openclaw.md) and [benchmark](docs/benchmarks/openclaw.md).
160
+
161
+
---
162
+
138
163
### Quick Start with Context Ordering
139
164
140
165
Add **one call** (`cp_instance.optimize()`) before inference to rearrange context blocks so that shared content aligns into a common prefix, enabling cache reuse. An importance ranking in the prompt preserves accuracy.
0 commit comments