Skip to content

Commit 088b520

Browse files
Merge pull request #34 from EfficientContext/cloud-cache-proxy
Add OpenClaw and Cloud API Support
2 parents 8dd2674 + 68088f3 commit 088b520

38 files changed

Lines changed: 7666 additions & 749 deletions

CHANGELOG.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,43 @@ All notable changes to ContextPilot will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.4.0] - 2026-03-29
9+
10+
### Added
11+
- **Cloud prompt cache proxy** for Anthropic, OpenAI, and MiniMax — transparent prefix caching over cloud APIs
12+
- **HTTP intercept proxy** — drop-in reverse proxy that extracts, reorders, and deduplicates documents in LLM requests without client changes
13+
- **Block-level dedup** — content-defined chunking within tool results and assistant code blocks to deduplicate repeated content across turns
14+
- **OpenClaw integration** — tool_result reordering, `markdown_header` extraction mode, deployment files, and quick-start guide
15+
- **TTL-based cache eviction** policy with configurable tiers and automatic expiry
16+
- **Conversation tracker** for multi-turn state: parent chain tracking, per-turn document history, and cross-turn block dedup
17+
- `--chunk-modulus` CLI flag for tuning content-level dedup block size
18+
- Cache sync documentation and `how_it_works.md` guide
19+
- Pipeline diagram and architecture SVGs
20+
- M5 MacBook Air results to Apple Silicon benchmark table
21+
- P99 wall time to OpenClaw benchmark table
22+
23+
### Changed
24+
- Renamed dedup levels: file-level → document-level, block-level → content-level, content-level → ContextBlock-level
25+
- Intercept parser supports multiple extraction formats (XML, numbered, separator, JSON results) with auto-detection
26+
- Cloud adapters inject `cache_control` breakpoints on system prompts and tool results (limited to 4 per Anthropic API)
27+
- Proxy forwards request metadata via headers instead of body to avoid breaking tool loops
28+
29+
### Fixed
30+
- Block dedup `"\n\n".join` corrupting content at chunk boundaries (phantom blank lines)
31+
- `hash()` non-determinism in content-defined chunking — replaced with `hashlib.md5`
32+
- `_chunk_modulus` missing from global declaration (CLI flag silently ignored)
33+
- Proxy hardcoding `temperature=0` overwriting user values — now uses `setdefault`
34+
- `default_ttl_seconds=0` silently becoming 300 (falsy `or``is not None`)
35+
- `default_ttl` setter not syncing `_default_ttl_seconds`
36+
- `update_from_response` double-counting partial cache hits
37+
- Reconstruction functions using default config instead of original extraction config
38+
- API key leak in error responses from `aiohttp.ClientError`
39+
- Non-JSON upstream error crashing with `JSONDecodeError`
40+
- Streaming connection leak on client disconnect (missing `finally` cleanup)
41+
- Redundant `copy.deepcopy` doubling memory pressure per request
42+
- Cycle detection added to `get_conversation_chain`
43+
- Alpha header validation (non-numeric no longer crashes)
44+
845
## [0.3.5.post2] - 2026-03-05
946

1047
### Added

README.md

Lines changed: 51 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@
1717

1818
## News
1919

20+
- [2026/03] Supports [OpenClaw](https://openclaw.ai)[guide](docs/guides/openclaw.md) | [benchmark](docs/benchmarks/openclaw.md)
21+
- [2026/03] Supports cloud APIs (OpenAI, Anthropic, MiniMax) — [cache sync](docs/guides/cache_sync.md)
2022
- [2026/03] ContextPilot now can run on **macOS / Apple Silicon** via [llama.cpp](docs/guides/mac_llama_cpp.md).
2123
- [2026/02] ContextPilot v0.3.2 released, supporting [PageIndex](https://github.com/VectifyAI/PageIndex) and [Mem0](https://github.com/mem0ai/mem0).
2224
- [2026/01] ContextPilot has been accepted to MLSys 2026 🎉! See you in Bellevue, WA, USA.
@@ -28,7 +30,7 @@ Long-context workloads (RAG, memory chat, tool-augmented agents) prepend many co
2830
ContextPilot sits between context assembly and inference to maximize prefix reuse and remove duplicates:
2931

3032
1. **Higher throughput & cache hits** — boosts prefill throughput and prefix cache hit ratio via context reuse.
31-
2. **Drop-in solutions**works with [PageIndex](https://github.com/VectifyAI/PageIndex), [Mem0](https://github.com/mem0ai/mem0), [LMCache](https://github.com/LMCache/LMCache), and backends like [vLLM](https://github.com/vllm-project/vllm) / [SGLang](https://github.com/sgl-project/sglang) / [llama.cpp](docs/guides/mac_llama_cpp.md).
33+
2. **Drop-in solutions**supports [OpenClaw](https://openclaw.ai) ([guide](docs/guides/openclaw.md)), [PageIndex](https://github.com/VectifyAI/PageIndex), [Mem0](https://github.com/mem0ai/mem0), [LMCache](https://github.com/LMCache/LMCache), [vLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](docs/guides/mac_llama_cpp.md), and cloud APIs (OpenAI, Anthropic).
3234
3. **No compromise in reasoning quality** — can even improve with extremely long contexts.
3335
4. **Widely tested** — validated across diverse RAG and agentic workloads.
3436

@@ -42,53 +44,63 @@ It maintains a **Context Index** of cached content, then per request applies **R
4244
4345
## Performance at a Glance
4446

45-
ContextPilot is validated across three representative settings: single-node academic RAG, multi-node production MoE inference, and multi-turn memory-augmented chat. In every case it delivers significant speedups with comparable answer quality.
47+
**OpenClaw Agent on RTX 5090** — 60 enterprise document analysis tasks ([claw-tasks](https://github.com/EfficientContext/ClawTasks)), Qwen3-4B-Instruct via SGLang. [Full results →](docs/benchmarks/openclaw.md)
4648

47-
**Qwen3-32B on 4×A6000** — single-node academic RAG with a 32B model on consumer GPUs.
48-
49-
| Benchmark | Method | Prefill TP (tok/s) | Cache Hit | F1 (%) |
50-
|-----------|--------|--------------------|-----------|--------|
51-
| MultihopRAG | SGLang | 7,290 | 4.64% | 60.42 |
52-
| | **SGLang + ContextPilot** | **14,214** | **33.97%** | **64.39** |
53-
| NarrativeQA | SGLang | 7,921 | 5.91% | 28.41 |
54-
| | **SGLang + ContextPilot** | **12,117** | **20.82%** | **29.64** |
55-
56-
**DeepSeek-R1-671B on 16×H20** — production-scale 671B MoE inference on a multi-node GPU cluster.
57-
58-
| Benchmark | Method | Prefill TP (tok/s) | Cache Hit | F1 (%) |
59-
|-----------|--------|--------------------|-----------|--------|
60-
| MultihopRAG | SGLang | 9,636 | 5.12% | 64.15 |
61-
| | **SGLang + ContextPilot** | **17,498** | **60.37%** | **64.68** |
62-
| NarrativeQA | SGLang | 8,687 | 6.08% | 40.20 |
63-
| | **SGLang + ContextPilot** | **13,201** | **38.24%** | **41.08** |
49+
| Metric | OpenClaw + SGLang | + ContextPilot | Δ |
50+
|--------|-------------------|----------------|---|
51+
| Prompt tokens / request (avg) | 45,771 | 33,622 | **-26.5%** |
52+
| Prompt tokens / request (P99) | 92,785 | 51,581 | **-44.4%** |
53+
| Wall time (avg) | 26.1s | 20.8s | **-20.4%** |
54+
| Wall time (P99) | 68.8s | 50.4s | **-26.6%** |
55+
| Accuracy | 245/245 | 245/245 ||
6456

6557
**Qwen3-4B on 1×A6000** — multi-turn memory chat with [Mem0](https://github.com/mem0ai/mem0) on the [LoCoMo](https://github.com/snap-research/locomo) benchmark.
6658

6759
| Context Size | Method | TTFT (s) | LLM Judge |
6860
|--------------|--------|----------|-----------|
61+
| 5 (long context memory) | SGLang | 0.1051 | 0.418 |
62+
| | **SGLang + ContextPilot** | **0.0548** | 0.414 |
6963
| 100 memories | SGLang | 0.1012 | 0.437 |
7064
| | **SGLang + ContextPilot** | **0.0554** | 0.420 |
7165

7266
>ContextPilot results in mem0 table are without context annotation — an optional feature that adds original importance ranking to reordered context blocks, which can further improve answer quality (see [Paper](https://arxiv.org/abs/2511.03475)).
7367
74-
**Llama-3.2-1B on Apple M3 (MacBook Air, 16 GB)** — MultihopRAG on Apple Silicon with llama.cpp, no GPU server required.
68+
**Llama-3.2-1B on Apple Silicon** — MultihopRAG with llama.cpp, no GPU server required.
7569

76-
| Method | Avg Latency (ms) |
77-
|--------|-----------------|
78-
| llama.cpp | 3,315 |
79-
| **llama.cpp + ContextPilot** | **1,378** |
70+
| Device | Method | Avg Latency (ms) |
71+
|--------|--------|-----------------|
72+
| M3 (MacBook Air, 16 GB) | llama.cpp | 3,315 |
73+
| | **llama.cpp + ContextPilot** | **1,378** |
74+
| M5 (MacBook Air, 32 GB) | llama.cpp | 2,157 |
75+
| | **llama.cpp + ContextPilot** | **911** |
8076

8177
Settings: `Llama-3.2-1B-Instruct-Q4_K_M.gguf`, Metal offload (`-ngl 99`), `--cache-reuse 256`, `--parallel 4`, context 32768 tokens. See the [Mac + llama.cpp guide](docs/guides/mac_llama_cpp.md).
8278

79+
We also evaluated on academic RAG (Qwen3-32B, 4×A6000) and production MoE inference (DeepSeek-R1-671B, 16×H20) — see [RAG benchmarks](docs/benchmarks/rag.md) and [paper](https://arxiv.org/abs/2511.03475).
80+
8381
## Installation
8482

8583
**Requirements:** Python >= 3.10
8684

8785
---
8886

89-
### vLLM / SGLang
87+
### OpenClaw
9088

91-
ContextPilot works with both CPU and GPU backends for building the context index. The `[gpu]` extra enables GPU-accelerated distance computation (via `cupy-cuda12x`) and is faster for large batches; without it, ContextPilot falls back to the CPU backend automatically.
89+
```bash
90+
pip install contextpilot
91+
92+
# Start proxy (points to your LLM backend)
93+
python -m contextpilot.server.http_server \
94+
--port 8765 --infer-api-url http://localhost:30000 # SGLang
95+
# or: --infer-api-url https://api.anthropic.com # Anthropic
96+
# or: --infer-api-url https://api.openai.com # OpenAI
97+
```
98+
99+
Then set OpenClaw's base URL to `http://localhost:8765/v1`. See the [full OpenClaw integration guide](docs/guides/openclaw.md) for UI setup, config file examples, and self-hosted model instructions.
100+
101+
---
102+
103+
### vLLM / SGLang
92104

93105
**From PyPI** — the vLLM and SGLang hooks are installed automatically:
94106
```bash
@@ -135,6 +147,19 @@ Docker images are also available for both all-in-one and standalone deployment.
135147

136148
## Getting Started
137149

150+
### Quick Start with OpenClaw
151+
152+
```bash
153+
# Ask OpenClaw to analyze vendor contracts (ContextPilot deduplicates shared content automatically)
154+
openclaw agent --message "Read contracts/contract_alpha_cloud.txt and summarize the liability terms."
155+
openclaw agent --message "Read contracts/contract_beta_ai.txt and compare its liability with Alpha."
156+
openclaw agent --message "Read contracts/contract_gamma_security.txt. Rank all three by liability exposure."
157+
```
158+
159+
When the agent reads multiple documents sharing content (contracts from the same template, proposals with shared methodology), ContextPilot automatically deduplicates identical blocks — reducing prefill tokens by ~27% with zero accuracy loss. See the [integration guide](docs/guides/openclaw.md) and [benchmark](docs/benchmarks/openclaw.md).
160+
161+
---
162+
138163
### Quick Start with Context Ordering
139164

140165
Add **one call** (`cp_instance.optimize()`) before inference to rearrange context blocks so that shared content aligns into a common prefix, enabling cache reuse. An importance ranking in the prompt preserves accuracy.

contextpilot/__init__.py

Lines changed: 29 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@
77
88
Quick Start:
99
>>> from contextpilot.pipeline import RAGPipeline
10-
>>>
10+
>>>
1111
>>> pipeline = RAGPipeline(
1212
... retriever="bm25",
1313
... corpus_path="corpus.jsonl",
1414
... model="Qwen/Qwen2.5-7B-Instruct"
1515
... )
16-
>>>
16+
>>>
1717
>>> results = pipeline.run(queries=["What is AI?"])
1818
1919
See docs/reference/api.md for detailed documentation.
@@ -38,6 +38,12 @@
3838

3939
from .server.live_index import ContextPilot
4040

41+
from .dedup import (
42+
dedup_chat_completions,
43+
dedup_responses_api,
44+
DedupResult,
45+
)
46+
4147
from .api import optimize, optimize_batch
4248

4349
from .retriever import (
@@ -53,27 +59,28 @@
5359

5460
__all__ = [
5561
# High-level pipeline API
56-
'RAGPipeline',
57-
'RetrieverConfig',
58-
'OptimizerConfig',
59-
'InferenceConfig',
60-
'PipelineConfig',
61-
62+
"RAGPipeline",
63+
"RetrieverConfig",
64+
"OptimizerConfig",
65+
"InferenceConfig",
66+
"PipelineConfig",
6267
# Core components
63-
'ContextIndex',
64-
'IndexResult',
65-
'IntraContextOrderer',
66-
'ContextPilot',
67-
68+
"ContextIndex",
69+
"IndexResult",
70+
"IntraContextOrderer",
71+
"ContextPilot",
72+
# Deduplication
73+
"dedup_chat_completions",
74+
"dedup_responses_api",
75+
"DedupResult",
6876
# Convenience functions
69-
'optimize',
70-
'optimize_batch',
71-
77+
"optimize",
78+
"optimize_batch",
7279
# Retrievers
73-
'BM25Retriever',
74-
'FAISSRetriever',
75-
'FAISS_AVAILABLE',
76-
'Mem0Retriever',
77-
'create_mem0_corpus_map',
78-
'MEM0_AVAILABLE',
80+
"BM25Retriever",
81+
"FAISSRetriever",
82+
"FAISS_AVAILABLE",
83+
"Mem0Retriever",
84+
"create_mem0_corpus_map",
85+
"MEM0_AVAILABLE",
7986
]

contextpilot/dedup/__init__.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
from .block_dedup import (
2+
dedup_chat_completions,
3+
dedup_responses_api,
4+
DedupResult,
5+
)
6+
7+
__all__ = [
8+
"dedup_chat_completions",
9+
"dedup_responses_api",
10+
"DedupResult",
11+
]

0 commit comments

Comments
 (0)