Describe the bug
CLI headless mode: internal state accumulation causes progressive latency degradation across sessions
Summary
When running the Copilot CLI in headless mode (copilot --headless --port <port>) with BYOK (Azure OpenAI), response latency degrades progressively with each new session, even after properly disconnecting and deleting sessions via the SDK. The first request after a fresh CLI start completes in ~1–3s, but subsequent requests degrade to 17–30s. Only killing and restarting the CLI process restores performance.
Environment
- CLI version: 1.0.28
- SDK version (Python): 0.2.2 (container), 0.1.32 (host)
- OS: Debian (python:3.13-slim Docker image), Linux amd64
- Provider: Azure OpenAI (BYOK) —
gpt-4o-mini deployment
- Provider wire API:
responses
- Azure OpenAI API version:
2025-01-01-preview
Steps to Reproduce
-
Start CLI in headless mode inside a container:
copilot --headless --port 14321
-
From an external process, connect via SDK, create a session, send a message, disconnect, delete the session:
client = CopilotClient({"cli_url": "container-host:4321"})
await client.start()
session = await client.create_session(
session_id="test-1",
model="gpt-4o-mini",
provider={"type": "azure", "base_url": "...", "api_key": "...", "azure": {"api_version": "..."}},
system_message={"mode": "replace", "content": "Answer briefly."},
streaming=True,
)
response = await session.send_and_wait("What is 2+2?") # ~1-3s ✅
await session.disconnect()
# Cleanup
sessions = await client.list_sessions()
for s in sessions:
await client.delete_session(s.session_id)
await client.stop()
-
From a new process (fresh Python interpreter, new CopilotClient instance), repeat step 2 with a different session_id:
# New process, new client, new session_id
response = await session.send_and_wait("What is 2+2?") # ~17-30s ❌
-
Each subsequent new-process request gets progressively slower, stabilizing at ~28-30s.
Expected Behavior
After session.disconnect() + client.delete_session(), the CLI should fully release all resources associated with that session. New sessions — whether from the same or a different SDK client process — should have consistent latency (~1-3s for a simple prompt when the LLM backend responds in <1s).
Actual Behavior
| Request # |
Same CLI instance |
send_and_wait latency |
Notes |
| 1 |
Fresh start |
~1,000–3,000 ms |
✅ Fast |
| 2 |
Reused |
~17,000 ms |
❌ Degraded |
| 3 |
Reused |
~28,000–30,000 ms |
❌ Severely degraded |
| 4+ |
Reused |
~28,000–30,000 ms |
❌ Plateaus at ~29s |
| 1 (after kill+restart) |
Fresh start |
~1,000–3,000 ms |
✅ Fast again |
Key observations:
- Azure OpenAI is NOT the bottleneck: Direct HTTP calls to the same Azure endpoint consistently return in ~900ms (verified with httpx, bypassing CLI/SDK entirely).
delete_session() does not fully clean up: After list_sessions() + delete_session() for all sessions, list_sessions() returns empty, BUT the CLI process still retains internal state that causes degradation.
- Wiping session-state files doesn't help: Deleting
/root/.copilot/session-state/* on disk while the CLI is running has no effect — the state is held in memory.
- Only a process kill fixes it:
killall copilot followed by a fresh copilot --headless --port <port> restores full performance immediately.
- Pattern is consistent: Reproduced dozens of times across multiple hours. The degradation occurs even when each request uses a unique
session_id and a completely fresh CopilotClient from a new OS process.
Workaround (current)
We run the CLI via an entrypoint script that auto-restarts it when killed. After each SDK request, the application kills the CLI process via docker exec killall copilot, and the entrypoint restarts it with clean state. This is fragile and adds ~3-5s of restart overhead per request.
# entrypoint.sh (simplified)
copilot --headless --port 14321 &
CLI_PID=$!
while true; do
if ! kill -0 $CLI_PID 2>/dev/null; then
rm -rf /root/.copilot/session-state/*
copilot --headless --port 14321 &
CLI_PID=$!
fi
sleep 1
done
Impact
This makes the CLI unsuitable for any multi-request server workload (web backends, API services, chatbots) without the kill-restart hack. In production, our chat feature degrades from sub-3s responses to 30s responses after just 2 user messages.
Possible Root Cause (speculation)
The CLI appears to retain conversation context or model state in memory across sessions even after delete_session(). This accumulated context may be sent with each new request to the LLM provider, causing the provider to process increasingly large payloads (explaining the progressive slowdown). The ~29s plateau could be the provider's timeout or max-context processing time.
Reproduction Script
Full reproduction script available: creates N sequential requests with fresh SDK clients, measures latency, and optionally kills CLI between requests to demonstrate the fix.
# Set environment variables:
# COPILOT_CLI_URL=localhost:4321
# AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
# AZURE_OPENAI_API_KEY=your-key
# AZURE_OPENAI_API_VERSION=2025-01-01-preview
# NUM_REQUESTS=3
# KILL_CLI=0 (set to 1 to kill CLI between requests — makes all fast)
python tools/test_sdk_perf.py
test_sdk_perf.py
Affected version
No response
Steps to reproduce the behavior
No response
Expected behavior
No response
Additional context
No response
Describe the bug
CLI headless mode: internal state accumulation causes progressive latency degradation across sessions
Summary
When running the Copilot CLI in headless mode (
copilot --headless --port <port>) with BYOK (Azure OpenAI), response latency degrades progressively with each new session, even after properly disconnecting and deleting sessions via the SDK. The first request after a fresh CLI start completes in ~1–3s, but subsequent requests degrade to 17–30s. Only killing and restarting the CLI process restores performance.Environment
gpt-4o-minideploymentresponses2025-01-01-previewSteps to Reproduce
Start CLI in headless mode inside a container:
From an external process, connect via SDK, create a session, send a message, disconnect, delete the session:
From a new process (fresh Python interpreter, new
CopilotClientinstance), repeat step 2 with a differentsession_id:Each subsequent new-process request gets progressively slower, stabilizing at ~28-30s.
Expected Behavior
After
session.disconnect()+client.delete_session(), the CLI should fully release all resources associated with that session. New sessions — whether from the same or a different SDK client process — should have consistent latency (~1-3s for a simple prompt when the LLM backend responds in <1s).Actual Behavior
send_and_waitlatencyKey observations:
delete_session()does not fully clean up: Afterlist_sessions()+delete_session()for all sessions,list_sessions()returns empty, BUT the CLI process still retains internal state that causes degradation./root/.copilot/session-state/*on disk while the CLI is running has no effect — the state is held in memory.killall copilotfollowed by a freshcopilot --headless --port <port>restores full performance immediately.session_idand a completely freshCopilotClientfrom a new OS process.Workaround (current)
We run the CLI via an entrypoint script that auto-restarts it when killed. After each SDK request, the application kills the CLI process via
docker exec killall copilot, and the entrypoint restarts it with clean state. This is fragile and adds ~3-5s of restart overhead per request.Impact
This makes the CLI unsuitable for any multi-request server workload (web backends, API services, chatbots) without the kill-restart hack. In production, our chat feature degrades from sub-3s responses to 30s responses after just 2 user messages.
Possible Root Cause (speculation)
The CLI appears to retain conversation context or model state in memory across sessions even after
delete_session(). This accumulated context may be sent with each new request to the LLM provider, causing the provider to process increasingly large payloads (explaining the progressive slowdown). The ~29s plateau could be the provider's timeout or max-context processing time.Reproduction Script
Full reproduction script available: creates N sequential requests with fresh SDK clients, measures latency, and optionally kills CLI between requests to demonstrate the fix.
test_sdk_perf.py
Affected version
No response
Steps to reproduce the behavior
No response
Expected behavior
No response
Additional context
No response