Skip to content

Commit f77d097

Browse files
jsbattigclaude
andcommitted
feat: Story #734 - Job-aware auto-update with graceful drain mode
Implements maintenance mode API for coordinated server shutdowns: - MaintenanceState service with thread-safe singleton pattern - REST API endpoints (enter/exit/status/drain-status) with admin auth - DeploymentExecutor drain coordination with configurable timeout - Health endpoint shows maintenance_mode status - Comprehensive auto-update documentation Key features: - Auto-update waits for jobs to drain before restart (5min default timeout) - Query endpoints remain available during maintenance - New job submissions blocked with 503 during maintenance - Graceful timeout logging with job details when forcing restart New files: - src/code_indexer/server/services/maintenance_service.py - src/code_indexer/server/routers/maintenance_router.py - docs/auto-update.md - tests/unit/server/services/test_maintenance_service.py - tests/unit/server/routers/test_maintenance_router.py - tests/unit/server/auto_update/test_deployment_executor_drain.py Closes #734 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 1828a3a commit f77d097

13 files changed

Lines changed: 1742 additions & 3 deletions

File tree

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -313,6 +313,9 @@ For complete configuration reference including environment variables, daemon set
313313
- [AI Integration Guide](docs/ai-integration.md) - Connect AI assistants to CIDX
314314
- [MCP Bridge Guide](docs/mcpb/README.md) - Claude Desktop integration via MCP
315315

316+
### Server Administration
317+
- [Auto-Update Guide](docs/auto-update.md) - Job-aware auto-update with graceful drain mode
318+
316319
### Advanced
317320
- [Architecture Guide](docs/architecture.md) - System design and storage architecture
318321
- [Migration Guide](docs/migration-to-v8.md) - Upgrading from v7.x to v8.x

docs/auto-update.md

Lines changed: 296 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,296 @@
1+
# CIDX Server Auto-Update Documentation
2+
3+
Story #734: Job-Aware Auto-Update with Graceful Drain Mode
4+
5+
## Overview
6+
7+
The CIDX Server includes an auto-update feature that automatically deploys updates when changes are detected in the master branch of the configured git repository. This feature includes job-aware graceful drain mode to prevent orphan jobs and data corruption during restarts.
8+
9+
## Architecture
10+
11+
### Component Overview
12+
13+
```
14+
+------------------+
15+
| AutoUpdateService |
16+
| (Polling Loop) |
17+
+--------+---------+
18+
|
19+
+--------------+--------------+
20+
| | |
21+
+---------v----+ +------v------+ +---v-----------+
22+
|ChangeDetector| |DeploymentLock| |DeploymentExecutor|
23+
+--------------+ +-------------+ +-------+-------+
24+
|
25+
+-------------------+-------------------+
26+
| | |
27+
+------v------+ +------v------+ +------v------+
28+
| git pull | | pip install | | Maintenance |
29+
| | | | | Mode Flow |
30+
+-------------+ +-------------+ +------+------+
31+
|
32+
+--------------------+--------------------+
33+
| | |
34+
+------v------+ +------v------+ +------v------+
35+
|Enter Maint. | |Wait for | |systemctl |
36+
|Mode (API) | |Drain | |restart |
37+
+-------------+ +-------------+ +-------------+
38+
```
39+
40+
### State Machine
41+
42+
The AutoUpdateService uses a state machine to manage deployment:
43+
44+
1. **IDLE** - Waiting for next check interval
45+
2. **CHECKING** - Polling git remote for changes
46+
3. **DEPLOYING** - Running git pull and pip install
47+
4. **RESTARTING** - Executing graceful restart with drain
48+
49+
## Job-Aware Drain Process
50+
51+
### Problem Solved
52+
53+
Previous versions performed blind `systemctl restart` without checking for running jobs, causing:
54+
- Orphan jobs left in "running" state indefinitely
55+
- Potential data corruption from interrupted indexing operations
56+
- Poor user experience when long-running jobs are killed
57+
58+
### Solution: Graceful Drain Mode
59+
60+
The auto-update process now uses a three-step maintenance mode flow:
61+
62+
1. **Enter Maintenance Mode** - Server stops accepting new jobs
63+
2. **Wait for Drain** - Poll until all running jobs complete (with timeout)
64+
3. **Execute Restart** - Restart the server after jobs complete
65+
66+
### Drain Flow Diagram
67+
68+
```
69+
Auto-Update Triggered
70+
|
71+
v
72+
+-------------------+
73+
| Enter Maintenance |---(POST /api/admin/maintenance/enter)
74+
| Mode |
75+
+---------+---------+
76+
|
77+
v
78+
+-------------------+
79+
| Wait for Drain |---(GET /api/admin/maintenance/drain-status)
80+
| (poll every 10s) |
81+
+---------+---------+
82+
|
83+
+-----+-----+
84+
| |
85+
v v
86+
Drained Timeout (300s)
87+
| |
88+
| +---> Log WARNING with job details
89+
| |
90+
+-----+-----+
91+
|
92+
v
93+
+-------------------+
94+
| systemctl restart |
95+
| cidx-server |
96+
+-------------------+
97+
```
98+
99+
## Configuration
100+
101+
### DeploymentExecutor Parameters
102+
103+
| Parameter | Default | Description |
104+
|-----------|---------|-------------|
105+
| `server_url` | `http://localhost:8000` | CIDX server URL for maintenance API |
106+
| `drain_timeout` | `300` (5 min) | Maximum seconds to wait for drain |
107+
| `drain_poll_interval` | `10` | Seconds between drain status checks |
108+
109+
### Environment Variables (systemd)
110+
111+
Configure in `/etc/systemd/system/cidx-auto-update.service`:
112+
113+
```ini
114+
[Unit]
115+
Description=CIDX Auto-Update Service
116+
After=network.target cidx-server.service
117+
118+
[Service]
119+
Type=simple
120+
User=cidx
121+
WorkingDirectory=/opt/cidx
122+
ExecStart=/usr/bin/python3 -m code_indexer.server.auto_update.cli
123+
Restart=always
124+
RestartSec=30
125+
Environment=CIDX_REPO_PATH=/opt/cidx
126+
Environment=CIDX_CHECK_INTERVAL=300
127+
128+
[Install]
129+
WantedBy=multi-user.target
130+
```
131+
132+
## API Endpoints
133+
134+
### Maintenance Mode Endpoints
135+
136+
All endpoints are under `/api/admin/maintenance/`.
137+
138+
**Authentication Required**: All maintenance endpoints require admin authentication. Include a valid Bearer token in the Authorization header:
139+
140+
```bash
141+
curl -X POST http://localhost:8000/api/admin/maintenance/enter \
142+
-H "Authorization: Bearer YOUR_ADMIN_TOKEN"
143+
```
144+
145+
Without valid admin authentication, all endpoints return HTTP 401 Unauthorized.
146+
147+
#### POST /enter
148+
149+
Enter maintenance mode. Stops accepting new jobs while allowing running jobs to complete.
150+
151+
**Response (200 OK)**:
152+
```json
153+
{
154+
"maintenance_mode": true,
155+
"running_jobs": 3,
156+
"queued_jobs": 5,
157+
"entered_at": "2025-01-17T10:00:00Z",
158+
"message": "Maintenance mode active. 3 running, 5 queued."
159+
}
160+
```
161+
162+
#### POST /exit
163+
164+
Exit maintenance mode. Resumes accepting new jobs.
165+
166+
**Response (200 OK)**:
167+
```json
168+
{
169+
"maintenance_mode": false,
170+
"message": "Maintenance mode deactivated."
171+
}
172+
```
173+
174+
#### GET /status
175+
176+
Get current maintenance mode status.
177+
178+
**Response (200 OK)**:
179+
```json
180+
{
181+
"maintenance_mode": true,
182+
"drained": false,
183+
"running_jobs": 2,
184+
"queued_jobs": 0,
185+
"entered_at": "2025-01-17T10:00:00Z"
186+
}
187+
```
188+
189+
#### GET /drain-status
190+
191+
Get detailed drain status for auto-update coordination.
192+
193+
**Response (200 OK)**:
194+
```json
195+
{
196+
"drained": false,
197+
"running_jobs": 2,
198+
"queued_jobs": 0,
199+
"estimated_drain_seconds": 120,
200+
"jobs": [
201+
{
202+
"job_id": "abc-123",
203+
"operation_type": "add_golden_repo",
204+
"started_at": "2025-01-17T10:00:00Z",
205+
"progress": 50
206+
}
207+
]
208+
}
209+
```
210+
211+
## Health Endpoint Integration
212+
213+
The `/health` endpoint includes maintenance mode status:
214+
215+
```json
216+
{
217+
"status": "healthy",
218+
"maintenance_mode": true,
219+
"uptime": 3600,
220+
...
221+
}
222+
```
223+
224+
During maintenance mode, the health status is **degraded** (not unhealthy) to indicate:
225+
- Server is operational for queries
226+
- New jobs are rejected (returns 503)
227+
- System is draining for planned restart
228+
229+
## Job Rejection During Maintenance
230+
231+
When the server is in maintenance mode, job submission endpoints return HTTP 503:
232+
233+
```json
234+
{
235+
"error": "Server is in maintenance mode. New jobs are not accepted. Please retry after 60 seconds."
236+
}
237+
```
238+
239+
Affected operations:
240+
- Repository add/sync via `GoldenRepoManager`
241+
- Background jobs via `BackgroundJobManager`
242+
- Sync jobs via `SyncJobManager`
243+
244+
Query operations continue to work normally.
245+
246+
## Force Restart Logging
247+
248+
When drain timeout is exceeded, the system logs all running jobs at WARNING level before forcing restart:
249+
250+
```
251+
WARNING: Forcing restart - running job: job_id=abc-123, operation_type=add_golden_repo, started_at=2025-01-17T10:00:00Z, progress=50%
252+
WARNING: Drain timeout exceeded, forcing restart
253+
```
254+
255+
This provides visibility into which jobs were interrupted for post-restart recovery.
256+
257+
## Startup Behavior
258+
259+
On server startup, the maintenance state is automatically cleared:
260+
261+
1. Maintenance mode is NOT persisted to disk (in-memory only)
262+
2. Server starts in normal operation mode
263+
3. Log message confirms: "Server started in normal operation mode"
264+
265+
This ensures the server recovers cleanly from crashes or forced restarts.
266+
267+
## Troubleshooting
268+
269+
### Jobs stuck in maintenance mode
270+
271+
If the server is stuck in maintenance mode:
272+
273+
1. Check current status: `curl http://localhost:8000/api/admin/maintenance/status`
274+
2. Manually exit: `curl -X POST http://localhost:8000/api/admin/maintenance/exit`
275+
3. Or restart the server (maintenance state is cleared on restart)
276+
277+
### Auto-update not working
278+
279+
1. Check auto-update service status: `systemctl status cidx-auto-update`
280+
2. Check logs: `journalctl -u cidx-auto-update -f`
281+
3. Verify git remote access: `cd /opt/cidx && git fetch origin master`
282+
283+
### Jobs orphaned after restart
284+
285+
If jobs were interrupted during a forced restart:
286+
287+
1. Jobs are automatically marked as CANCELLED on next server startup
288+
2. Check job status via API: `GET /api/jobs/{job_id}/status`
289+
3. Re-submit failed jobs as needed
290+
291+
## Best Practices
292+
293+
1. **Set appropriate drain timeout** - Consider your longest-running jobs when configuring `drain_timeout`
294+
2. **Monitor during updates** - Watch logs during auto-update for any forced restarts
295+
3. **Schedule updates during low usage** - If possible, configure check intervals to align with low-traffic periods
296+
4. **Test recovery procedures** - Periodically verify job recovery after interruptions

src/code_indexer/server/app.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,8 @@
8181
from .routers.indexing import router as indexing_router
8282
from .routers.cache import router as cache_router
8383
from .routers.delegation_callbacks import router as delegation_callbacks_router
84+
from .routers.maintenance_router import router as maintenance_router
85+
from .services.maintenance_service import get_maintenance_state
8486
from .routers.groups import (
8587
router as groups_router,
8688
users_router,
@@ -2843,6 +2845,7 @@ async def health_check(
28432845
"failed_jobs": failed_jobs,
28442846
},
28452847
"started_at": get_server_start_time(),
2848+
"maintenance_mode": get_maintenance_state().is_maintenance_mode(),
28462849
}
28472850

28482851
# Add version if available
@@ -7557,6 +7560,7 @@ async def get_repository_info(
75577560
app.include_router(users_router)
75587561
app.include_router(audit_router)
75597562
app.include_router(delegation_callbacks_router)
7563+
app.include_router(maintenance_router)
75607564

75617565
# Mount Web Admin UI routes and static files
75627566
from fastapi.staticfiles import StaticFiles

0 commit comments

Comments
 (0)