ContextPilot is an optional optimization layer for prompt-heavy workloads with overlap, such as shared-prefix RAG, repeated system prompts, and multi-turn chat. In MoE-Infinity it can run inside the OpenAI-compatible server as middleware, or deeper in the scheduling and KV stack.
ContextPilot reorders prompt context and removes repeated content so the model does less redundant prefill work. In practice, that means lower TTFT, better KV cache reuse, and fewer prompt tokens sent through the serving path.
MoE-Infinity supports two integration phases:
-
Phase B, In-Process Middleware
- Runs ContextPilot logic inside
api_server_v2.pybefore tokenization. - Best when you want lower proxy overhead and direct runtime toggles.
- Includes runtime enable/disable control and fault injection for testing.
- Runs ContextPilot logic inside
-
Phase C, Deep Scheduler Integration
- Extends ContextPilot signals into KV allocation and request ordering.
- Uses the CP-aware KV interface plus eviction sync hooks so terminal frees stay in sync with ContextPilot state.
- Best gains, highest integration depth.
Core modules:
moe_infinity/serving/contextpilot_middleware.py: prompt reorder, dedup, request metrics, cleanup hooksmoe_infinity/serving/eviction_sync.py: maps terminal request completion and abort events to ContextPilot evictionmoe_infinity/serving/contextpilot_circuit_breaker.py: protects the serving path when ContextPilot misbehaves or slows downmoe_infinity/serving/cp_kv_interface.py: CP-aware KV adapter for Phase C scheduling and allocation decisions
pip install moe-infinity[contextpilot]Notes:
- Real ContextPilot requires Python 3.10+.
- The middleware gracefully disables itself when the package is unavailable.
Use Phase B when you want ContextPilot inside the OpenAI-compatible server process, with fewer moving parts and direct runtime controls. This is the right fit once your runtime is on Python 3.10+.
python -m moe_infinity.entrypoints.openai.api_server_v2 \
--model deepseek-ai/DeepSeek-V2-Lite-Chat \
--offload-dir ./offload_dir \
--enable-contextpilot| Flag | Default | Purpose |
|---|---|---|
--enable-contextpilot |
off | Enables ContextPilot middleware before tokenization |
--contextpilot-debug |
off | Enables debug-only fault injection endpoint |
| Env var | Default | Purpose |
|---|---|---|
CONTEXTPILOT_ENABLED |
1 |
Set to 0 to force-disable ContextPilot even if --enable-contextpilot is passed |
Enable or disable ContextPilot at runtime:
curl -X POST http://localhost:8000/contextpilot/toggle \
-H "Content-Type: application/json" \
-d '{"enabled": true}'Inject a test fault, debug mode required:
curl -X POST http://localhost:8000/contextpilot/inject-fault \
-H "Content-Type: application/json" \
-d '{"fault": "reorder_exception", "duration_s": 5}'Accepted fault values are none and reorder_exception.
Phase C keeps the Phase B middleware path and adds CP-aware KV and scheduling hooks:
ContextPilotKVManagerexposes predicted prefix reuse, cached block hints, allocation notifications, free notifications, and request ordering.EvictionSyncAdapterremoves ContextPilot state on terminal completion and abort, while leaving swap events alone.- Scheduler and KV code can prioritize requests with higher predicted overlap instead of relying only on queue order.
| Benefit | Cost |
|---|---|
| Best TTFT and token-savings gains | Highest integration depth |
| Better overlap-aware scheduling | More coupling with KV and scheduler internals |
| Better CP state sync on terminal frees | More pieces to validate during upgrades |
Representative dry-run gains versus baseline:
| Phase | TTFT p50 | Token savings |
|---|---|---|
| Phase B | 21% faster | 27% |
| Phase C | 26% faster | 28% |
| Endpoint | Method | Purpose |
|---|---|---|
/contextpilot/toggle |
POST |
Enable or disable ContextPilot at runtime |
/contextpilot/inject-fault |
POST |
Inject a test fault, only when debug mode is enabled |
/contextpilot/status |
GET |
Return current status, counters, and recent metrics |
Check current state:
curl http://localhost:8000/contextpilot/status| Field | Meaning |
|---|---|
enabled |
Current runtime toggle state |
env_enabled |
Whether CONTEXTPILOT_ENABLED still allows ContextPilot |
circuit_breaker_state |
closed, open, or half_open summary state |
requests_processed |
Total requests seen by the middleware |
reorder_count |
Requests that went through reorder logic |
dedup_count |
Requests that went through dedup logic |
avg_reorder_latency_ms |
Average reorder latency over the in-memory sample window |
p99_reorder_latency_ms |
P99 reorder latency over the in-memory sample window |
token_savings_total |
Total prompt tokens removed so far |
token_savings_avg_pct |
Average percent token savings across processed requests |
eviction_sync |
Nested counters for terminal eviction events: incoming, removed, not_found |
cp_index_size |
Current number of live ContextPilot index entries, when available |
fallback_count |
Total times MoE-Infinity fell back after ContextPilot errors |
last_fallback_count |
Latest recorded fallback counter snapshot |
debug |
Whether debug endpoints are enabled |
fault |
Active injected fault mode |
Real ContextPilot needs Python 3.10+. On Python 3.8/3.9, the middleware auto-disables with a warning. Install the package in a Python 3.10+ environment to enable ContextPilot features.
There is a package name conflict on PyPI. One package named contextpilot is a different project and may pull in elasticsearch. If you hit that dependency path, use the known ContextPilot checkout under /tmp/ContextPilot, or install the intended package in a clean Python 3.10+ environment before enabling Phase B or Phase C.
Check CONTEXTPILOT_ENABLED. If it is set to 0, the server will ignore --enable-contextpilot. The /contextpilot/status endpoint exposes both enabled and env_enabled so you can tell which gate is active.
Start the server with --contextpilot-debug. Without that flag, /contextpilot/inject-fault is intentionally blocked.
| Module | Role |
|---|---|
moe_infinity/serving/contextpilot_middleware.py |
In-process prompt reorder, dedup, request metrics, and request cleanup |
moe_infinity/serving/eviction_sync.py |
Sync layer that converts terminal request lifecycle events into ContextPilot eviction signals |
moe_infinity/serving/contextpilot_circuit_breaker.py |
Small circuit breaker used to contain repeated failures or latency spikes |
moe_infinity/serving/cp_kv_interface.py |
CP-aware abstraction used by KV and scheduler code in Phase C |
For lower-level design details, see: