Complete API documentation for ContextPilot.
The core class for context optimization.
import contextpilot as cp
engine = cp.ContextPilot(use_gpu=False)| Parameter | Type | Default | Description |
|---|---|---|---|
alpha |
float | 0.001 |
Distance computation parameter |
use_gpu |
bool | False |
Use GPU for distance computation |
linkage_method |
str | "average" |
Clustering method |
batch_size |
int | 128 |
Batch size for distance computation |
Reorder contexts and return ready-to-use OpenAI messages. This is the simplest way to use ContextPilot — it handles reordering and prompt assembly in one call.
messages = engine.optimize(
docs=["Doc about ML", "Doc about AI", "Doc about DL"],
query="What is ML?",
conversation_id="user_42", # optional, for multi-turn
system_instruction="Be concise.", # optional, prepended to system message
)
# messages is a list of dicts ready for client.chat.completions.create()| Parameter | Type | Default | Description |
|---|---|---|---|
docs |
List[str] | Required | List of document strings |
query |
str | Required | The user question |
conversation_id |
str | None | None |
Conversation key for multi-turn deduplication |
system_instruction |
str | None | None |
Extra instruction prepended to the system message |
Returns: List[Dict[str, str]] — a list of message dicts (role / content) ready for the OpenAI chat completions API.
Batch-optimize contexts and return messages in scheduled execution order.
messages_batch, order = engine.optimize_batch(
all_docs=[["Doc A", "Doc B"], ["Doc B", "Doc C"]],
all_queries=["What is A?", "What is C?"],
)
# messages_batch[i] corresponds to all_queries[order[i]]| Parameter | Type | Default | Description |
|---|---|---|---|
all_docs |
List[List[str]] | Required | Documents for each query |
all_queries |
List[str] | Required | One query per entry in all_docs |
system_instruction |
str | None | None |
Extra instruction prepended to every system message |
Returns: (messages_batch, original_indices) — a 2-tuple where messages_batch[i] corresponds to all_queries[original_indices[i]].
Reorder contexts for optimal KV-cache prefix sharing. Use this when you need full control over prompt construction.
reordered, indices = engine.reorder(
contexts,
conversation_id="user_42", # optional, for multi-turn dedup
)| Parameter | Type | Default | Description |
|---|---|---|---|
contexts |
List[List] | Required | List of contexts (doc IDs or strings) |
initial_tokens_per_context |
int | 0 |
Initial token budget per context |
conversation_id |
str | None | None |
Conversation key for deduplication tracking |
Returns: (reordered_contexts, original_indices) — a 2-tuple where reordered_contexts[i] corresponds to contexts[original_indices[i]].
Remove already-seen documents from follow-up conversation turns.
conversation_idis required — prevents cross-contamination between concurrent users.
results = engine.deduplicate(
contexts,
conversation_id="user_42", # REQUIRED
hint_template="See Doc {doc_id}", # optional
)| Parameter | Type | Default | Description |
|---|---|---|---|
contexts |
List[List] | Required | List of contexts to deduplicate |
conversation_id |
str | Required | Must match ID from a prior .reorder() call |
hint_template |
str | None | None |
Custom hint template with {doc_id} placeholder |
Returns: List[Dict] — one dict per context with keys:
| Key | Type | Description |
|---|---|---|
new_docs |
List | Documents not seen in prior turns |
overlapping_docs |
List | Documents already sent |
reference_hints |
List[str] | Hint strings for overlapping docs |
deduplicated_docs |
List | Alias for new_docs |
Raises:
TypeErrorifconversation_idis not providedValueErrorifconversation_idis empty or has no prior.reorder()history
ContextPilot provides an HTTP server for live index management with inference engine integration.
GET /
Root health check endpoint with basic server information.
Response:
{
"status": "ready",
"mode": "stateful",
"index_initialized": true,
"timestamp": "2025-02-15T10:30:00Z"
}GET /health
Detailed health check with index statistics.
Response:
{
"status": "ready",
"mode": "stateful",
"eviction_enabled": true,
"current_tokens": 12500,
"utilization": 0.45,
"index_stats": {
"total_nodes": 42,
"leaf_nodes": 20,
"total_docs": 150
}
}POST /reorder
Reorder contexts. Auto-detects mode based on whether an index exists:
- Stateless (no index): computes reordering without maintaining state.
- Stateful (index present): builds or incrementally updates the index. Call
POST /resetto force a fresh build.
Request:
{
"contexts": [[1, 2, 3], [2, 3, 4]],
"initial_tokens_per_context": 0,
"alpha": 0.001,
"use_gpu": false,
"linkage_method": "average",
"deduplicate": false,
"parent_request_ids": [null, null],
"hint_template": "Refer to Doc {doc_id} from Turn {turn_number}"
}Accepts both integer doc IDs and string documents — the server auto-detects input type.
| Parameter | Type | Default | Description |
|---|---|---|---|
contexts |
List[List] | Required | List of contexts (doc IDs or strings) |
initial_tokens_per_context |
int | 0 |
Initial token count per context |
alpha |
float | 0.001 |
Distance computation parameter |
use_gpu |
bool | false |
Use GPU for distance computation |
linkage_method |
str | "average" |
Clustering method |
deduplicate |
bool | false |
Enable multi-turn deduplication |
parent_request_ids |
List[str|null] | null |
Parent request IDs for deduplication |
hint_template |
str | null |
Custom template for reference hints |
Response:
{
"status": "success",
"mode": "initial",
"input_type": "integer",
"num_contexts": 2,
"matched_count": 0,
"inserted_count": 2,
"request_ids": ["contextpilot_abc123", "contextpilot_def456"],
"reordered_contexts": [[2, 3, 1], [2, 3, 4]],
"original_indices": [0, 1],
"stats": {}
}POST /deduplicate
Deduplicate contexts for multi-turn conversations without index operations.
Recommended flow:
- Turn 1: Call
/reorder(builds index, registers request in tracker) - Turn 2+: Call
/deduplicate(deduplicates only, no index ops)
Request:
{
"contexts": [[4, 3, 2], [10, 20, 30]],
"parent_request_ids": ["req_turn1_a", "req_turn1_b"],
"hint_template": "See Doc {doc_id} from Turn {turn_number}"
}| Parameter | Type | Default | Description |
|---|---|---|---|
contexts |
List[List[int]] | Required | List of contexts to deduplicate |
parent_request_ids |
List[str|null] | Required | Parent request IDs (null = new conversation) |
hint_template |
str | null |
Custom template for reference hints |
Response:
{
"status": "success",
"request_ids": ["dedup_abc123", "dedup_def456"],
"results": [
{
"request_id": "dedup_abc123",
"parent_request_id": "req_turn1_a",
"original_docs": [4, 3, 2],
"deduplicated_docs": [2],
"overlapping_docs": [4, 3],
"new_docs": [2],
"reference_hints": ["Refer to Doc 4...", "Refer to Doc 3..."],
"is_new_conversation": false
}
]
}POST /evict
Notify ContextPilot that KV cache entries have been evicted by the inference engine.
Request:
{
"request_ids": ["req-abc", "req-def"]
}POST /reset
Reset the index and conversation tracker. Clears all state and frees memory.
Response:
{
"status": "success",
"message": "Index reset to initial state",
"conversation_tracker": "reset"
}GET /stats
Get detailed index statistics.
Response:
{
"index_stats": {
"total_nodes": 256,
"leaf_nodes": 128,
"total_docs": 1500,
"unique_docs": 450,
"tree_depth": 8
},
"total_tokens": 50000,
"num_contexts": 100,
"cache_utilization": 0.75
}GET /requests
Get all tracked request IDs in the index.
Response:
{
"request_ids": ["contextpilot_abc123", "contextpilot_def456"],
"count": 2
}POST /search
Search for a context in the index and return its location.
Request:
{
"context": [1, 2, 3, 4],
"update_access": true
}Response:
{
"status": "success",
"search_path": [0, 1, 5, 12],
"node_id": 12,
"prefix_length": 3
}POST /insert
Insert a new context into the index at a specific location.
Request:
{
"context": [1, 2, 3, 4],
"search_path": [0, 1, 5],
"total_tokens": 256
}Response:
{
"status": "success",
"node_id": 42,
"search_path": [0, 1, 5, 42],
"request_id": "contextpilot_xyz789"
}Python client for the HTTP server.
from contextpilot.server.http_client import ContextPilotIndexClient
client = ContextPilotIndexClient("http://localhost:8765", timeout=1.0)
# Primary API
reordered, order = client.reorder(contexts) # returns (reordered_contexts, original_indices)
result = client.reorder_raw(contexts) # returns full server response dict
# Stateful options
result = client.reorder_raw(
contexts, deduplicate=True, parent_request_ids=[None]
)
client.deduplicate(contexts, parent_request_ids, hint_template=None)
client.evict(request_ids)
client.reset()
# Queries
client.search(context, update_access=True)
client.insert(context, search_path, total_tokens=0)
client.get_stats()
client.get_requests()
client.health()
client.is_ready()
client.close()| Method | Description |
|---|---|
reorder(contexts, ...) |
Reorder contexts, returns (reordered_contexts, original_indices) |
reorder_raw(contexts, ...) |
Reorder contexts, returns full server response dict |
deduplicate(contexts, parent_request_ids, hint_template) |
Deduplicate contexts (Turn 2+) |
evict(request_ids) |
Remove requests from index |
reset() |
Reset index and conversation tracker |
get_stats() |
Get index statistics |
get_requests() |
Get all tracked request IDs |
health() |
Health check |
is_ready() |
Check if server is ready |
close() |
Close connection |
from contextpilot.server.http_client import ContextPilotIndexClient
client = ContextPilotIndexClient("http://localhost:8765")
# Turn 1: build index
turn1 = client.reorder_raw(
contexts=[[4, 3, 1]],
deduplicate=True,
parent_request_ids=[None]
)
turn1_id = turn1["request_ids"][0]
# Turn 2+: lightweight deduplication
turn2 = client.deduplicate(
contexts=[[4, 3, 2]],
parent_request_ids=[turn1_id]
)
result = turn2["results"][0]
print(f"Overlapping: {result['overlapping_docs']}") # [4, 3]
print(f"New docs: {result['new_docs']}") # [2]
client.close()These use a shared singleton ContextPilot instance internally — no need to create one yourself.
import contextpilot as cp
messages = cp.optimize(docs, query, conversation_id="user_42")
messages_batch, order = cp.optimize_batch(all_docs, all_queries)Signatures and parameters are identical to ContextPilot.optimize() and ContextPilot.optimize_batch() above.
Configuration for LLM generation used by RAGPipeline.
from contextpilot.pipeline import InferenceConfig
config = InferenceConfig(
model_name="Qwen/Qwen2.5-7B-Instruct",
backend="sglang",
base_url="http://localhost:30000",
max_tokens=256,
temperature=0.0
)| Parameter | Type | Default | Description |
|---|---|---|---|
model_name |
str | Required | Model name/path |
backend |
str | "sglang" |
Inference backend ("sglang" or "vllm") |
base_url |
str | Required | Server URL |
max_tokens |
int | 256 |
Maximum generation tokens |
temperature |
float | 0.0 |
Sampling temperature |
High-level pipeline combining retrieval and ContextPilot optimization.
from contextpilot.pipeline import RAGPipeline, InferenceConfig
pipeline = RAGPipeline(
retriever="bm25",
corpus_path="corpus.jsonl",
use_contextpilot=True,
use_gpu=False,
inference=InferenceConfig(...)
)| Parameter | Type | Default | Description |
|---|---|---|---|
retriever |
str | Required | Retriever type: "bm25", "faiss", or custom |
corpus_path |
str | Required | Path to corpus JSONL file |
use_contextpilot |
bool | True |
Enable ContextPilot optimization |
use_gpu |
bool | False |
Use GPU for distance computation |
inference |
InferenceConfig | None |
Configuration for LLM generation |
index_path |
str | None |
Path to FAISS index (faiss retriever only) |
embedding_model |
str | None |
Embedding model name (faiss retriever only) |
embedding_base_url |
str | None |
Embedding server URL (faiss retriever only) |
results = pipeline.run(queries=["What is ML?"], top_k=20, generate_responses=True)Returns:
{
"retrieval_results": [...],
"optimized_batch": [...],
"generation_results": [...],
"metadata": {"num_queries": 1, "num_groups": 1, "total_time": 1.5}
}results = pipeline.retrieve(queries=["What is ML?"], top_k=20)optimized = pipeline.optimize(retrieval_results)generation_results = pipeline.generate(optimized)pipeline.save_results(results, "output.jsonl")result = pipeline.process_conversation_turn(
conversation_id="session_123",
query="What is ML?",
top_k=10,
enable_deduplication=True,
generate_response=True
)pipeline.reset_conversation("session_123")pipeline.reset_all_conversations()