|
| 1 | +# CIDX Server Auto-Update Documentation |
| 2 | + |
| 3 | +Story #734: Job-Aware Auto-Update with Graceful Drain Mode |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The CIDX Server includes an auto-update feature that automatically deploys updates when changes are detected in the master branch of the configured git repository. This feature includes job-aware graceful drain mode to prevent orphan jobs and data corruption during restarts. |
| 8 | + |
| 9 | +## Architecture |
| 10 | + |
| 11 | +### Component Overview |
| 12 | + |
| 13 | +``` |
| 14 | + +------------------+ |
| 15 | + | AutoUpdateService | |
| 16 | + | (Polling Loop) | |
| 17 | + +--------+---------+ |
| 18 | + | |
| 19 | + +--------------+--------------+ |
| 20 | + | | | |
| 21 | + +---------v----+ +------v------+ +---v-----------+ |
| 22 | + |ChangeDetector| |DeploymentLock| |DeploymentExecutor| |
| 23 | + +--------------+ +-------------+ +-------+-------+ |
| 24 | + | |
| 25 | + +-------------------+-------------------+ |
| 26 | + | | | |
| 27 | + +------v------+ +------v------+ +------v------+ |
| 28 | + | git pull | | pip install | | Maintenance | |
| 29 | + | | | | | Mode Flow | |
| 30 | + +-------------+ +-------------+ +------+------+ |
| 31 | + | |
| 32 | + +--------------------+--------------------+ |
| 33 | + | | | |
| 34 | + +------v------+ +------v------+ +------v------+ |
| 35 | + |Enter Maint. | |Wait for | |systemctl | |
| 36 | + |Mode (API) | |Drain | |restart | |
| 37 | + +-------------+ +-------------+ +-------------+ |
| 38 | +``` |
| 39 | + |
| 40 | +### State Machine |
| 41 | + |
| 42 | +The AutoUpdateService uses a state machine to manage deployment: |
| 43 | + |
| 44 | +1. **IDLE** - Waiting for next check interval |
| 45 | +2. **CHECKING** - Polling git remote for changes |
| 46 | +3. **DEPLOYING** - Running git pull and pip install |
| 47 | +4. **RESTARTING** - Executing graceful restart with drain |
| 48 | + |
| 49 | +## Job-Aware Drain Process |
| 50 | + |
| 51 | +### Problem Solved |
| 52 | + |
| 53 | +Previous versions performed blind `systemctl restart` without checking for running jobs, causing: |
| 54 | +- Orphan jobs left in "running" state indefinitely |
| 55 | +- Potential data corruption from interrupted indexing operations |
| 56 | +- Poor user experience when long-running jobs are killed |
| 57 | + |
| 58 | +### Solution: Graceful Drain Mode |
| 59 | + |
| 60 | +The auto-update process now uses a three-step maintenance mode flow: |
| 61 | + |
| 62 | +1. **Enter Maintenance Mode** - Server stops accepting new jobs |
| 63 | +2. **Wait for Drain** - Poll until all running jobs complete (with timeout) |
| 64 | +3. **Execute Restart** - Restart the server after jobs complete |
| 65 | + |
| 66 | +### Drain Flow Diagram |
| 67 | + |
| 68 | +``` |
| 69 | +Auto-Update Triggered |
| 70 | + | |
| 71 | + v |
| 72 | ++-------------------+ |
| 73 | +| Enter Maintenance |---(POST /api/admin/maintenance/enter) |
| 74 | +| Mode | |
| 75 | ++---------+---------+ |
| 76 | + | |
| 77 | + v |
| 78 | ++-------------------+ |
| 79 | +| Wait for Drain |---(GET /api/admin/maintenance/drain-status) |
| 80 | +| (poll every 10s) | |
| 81 | ++---------+---------+ |
| 82 | + | |
| 83 | + +-----+-----+ |
| 84 | + | | |
| 85 | + v v |
| 86 | + Drained Timeout (300s) |
| 87 | + | | |
| 88 | + | +---> Log WARNING with job details |
| 89 | + | | |
| 90 | + +-----+-----+ |
| 91 | + | |
| 92 | + v |
| 93 | ++-------------------+ |
| 94 | +| systemctl restart | |
| 95 | +| cidx-server | |
| 96 | ++-------------------+ |
| 97 | +``` |
| 98 | + |
| 99 | +## Configuration |
| 100 | + |
| 101 | +### DeploymentExecutor Parameters |
| 102 | + |
| 103 | +| Parameter | Default | Description | |
| 104 | +|-----------|---------|-------------| |
| 105 | +| `server_url` | `http://localhost:8000` | CIDX server URL for maintenance API | |
| 106 | +| `drain_timeout` | `300` (5 min) | Maximum seconds to wait for drain | |
| 107 | +| `drain_poll_interval` | `10` | Seconds between drain status checks | |
| 108 | + |
| 109 | +### Environment Variables (systemd) |
| 110 | + |
| 111 | +Configure in `/etc/systemd/system/cidx-auto-update.service`: |
| 112 | + |
| 113 | +```ini |
| 114 | +[Unit] |
| 115 | +Description=CIDX Auto-Update Service |
| 116 | +After=network.target cidx-server.service |
| 117 | + |
| 118 | +[Service] |
| 119 | +Type=simple |
| 120 | +User=cidx |
| 121 | +WorkingDirectory=/opt/cidx |
| 122 | +ExecStart=/usr/bin/python3 -m code_indexer.server.auto_update.cli |
| 123 | +Restart=always |
| 124 | +RestartSec=30 |
| 125 | +Environment=CIDX_REPO_PATH=/opt/cidx |
| 126 | +Environment=CIDX_CHECK_INTERVAL=300 |
| 127 | + |
| 128 | +[Install] |
| 129 | +WantedBy=multi-user.target |
| 130 | +``` |
| 131 | + |
| 132 | +## API Endpoints |
| 133 | + |
| 134 | +### Maintenance Mode Endpoints |
| 135 | + |
| 136 | +All endpoints are under `/api/admin/maintenance/`. |
| 137 | + |
| 138 | +**Authentication Required**: All maintenance endpoints require admin authentication. Include a valid Bearer token in the Authorization header: |
| 139 | + |
| 140 | +```bash |
| 141 | +curl -X POST http://localhost:8000/api/admin/maintenance/enter \ |
| 142 | + -H "Authorization: Bearer YOUR_ADMIN_TOKEN" |
| 143 | +``` |
| 144 | + |
| 145 | +Without valid admin authentication, all endpoints return HTTP 401 Unauthorized. |
| 146 | + |
| 147 | +#### POST /enter |
| 148 | + |
| 149 | +Enter maintenance mode. Stops accepting new jobs while allowing running jobs to complete. |
| 150 | + |
| 151 | +**Response (200 OK)**: |
| 152 | +```json |
| 153 | +{ |
| 154 | + "maintenance_mode": true, |
| 155 | + "running_jobs": 3, |
| 156 | + "queued_jobs": 5, |
| 157 | + "entered_at": "2025-01-17T10:00:00Z", |
| 158 | + "message": "Maintenance mode active. 3 running, 5 queued." |
| 159 | +} |
| 160 | +``` |
| 161 | + |
| 162 | +#### POST /exit |
| 163 | + |
| 164 | +Exit maintenance mode. Resumes accepting new jobs. |
| 165 | + |
| 166 | +**Response (200 OK)**: |
| 167 | +```json |
| 168 | +{ |
| 169 | + "maintenance_mode": false, |
| 170 | + "message": "Maintenance mode deactivated." |
| 171 | +} |
| 172 | +``` |
| 173 | + |
| 174 | +#### GET /status |
| 175 | + |
| 176 | +Get current maintenance mode status. |
| 177 | + |
| 178 | +**Response (200 OK)**: |
| 179 | +```json |
| 180 | +{ |
| 181 | + "maintenance_mode": true, |
| 182 | + "drained": false, |
| 183 | + "running_jobs": 2, |
| 184 | + "queued_jobs": 0, |
| 185 | + "entered_at": "2025-01-17T10:00:00Z" |
| 186 | +} |
| 187 | +``` |
| 188 | + |
| 189 | +#### GET /drain-status |
| 190 | + |
| 191 | +Get detailed drain status for auto-update coordination. |
| 192 | + |
| 193 | +**Response (200 OK)**: |
| 194 | +```json |
| 195 | +{ |
| 196 | + "drained": false, |
| 197 | + "running_jobs": 2, |
| 198 | + "queued_jobs": 0, |
| 199 | + "estimated_drain_seconds": 120, |
| 200 | + "jobs": [ |
| 201 | + { |
| 202 | + "job_id": "abc-123", |
| 203 | + "operation_type": "add_golden_repo", |
| 204 | + "started_at": "2025-01-17T10:00:00Z", |
| 205 | + "progress": 50 |
| 206 | + } |
| 207 | + ] |
| 208 | +} |
| 209 | +``` |
| 210 | + |
| 211 | +## Health Endpoint Integration |
| 212 | + |
| 213 | +The `/health` endpoint includes maintenance mode status: |
| 214 | + |
| 215 | +```json |
| 216 | +{ |
| 217 | + "status": "healthy", |
| 218 | + "maintenance_mode": true, |
| 219 | + "uptime": 3600, |
| 220 | + ... |
| 221 | +} |
| 222 | +``` |
| 223 | + |
| 224 | +During maintenance mode, the health status is **degraded** (not unhealthy) to indicate: |
| 225 | +- Server is operational for queries |
| 226 | +- New jobs are rejected (returns 503) |
| 227 | +- System is draining for planned restart |
| 228 | + |
| 229 | +## Job Rejection During Maintenance |
| 230 | + |
| 231 | +When the server is in maintenance mode, job submission endpoints return HTTP 503: |
| 232 | + |
| 233 | +```json |
| 234 | +{ |
| 235 | + "error": "Server is in maintenance mode. New jobs are not accepted. Please retry after 60 seconds." |
| 236 | +} |
| 237 | +``` |
| 238 | + |
| 239 | +Affected operations: |
| 240 | +- Repository add/sync via `GoldenRepoManager` |
| 241 | +- Background jobs via `BackgroundJobManager` |
| 242 | +- Sync jobs via `SyncJobManager` |
| 243 | + |
| 244 | +Query operations continue to work normally. |
| 245 | + |
| 246 | +## Force Restart Logging |
| 247 | + |
| 248 | +When drain timeout is exceeded, the system logs all running jobs at WARNING level before forcing restart: |
| 249 | + |
| 250 | +``` |
| 251 | +WARNING: Forcing restart - running job: job_id=abc-123, operation_type=add_golden_repo, started_at=2025-01-17T10:00:00Z, progress=50% |
| 252 | +WARNING: Drain timeout exceeded, forcing restart |
| 253 | +``` |
| 254 | + |
| 255 | +This provides visibility into which jobs were interrupted for post-restart recovery. |
| 256 | + |
| 257 | +## Startup Behavior |
| 258 | + |
| 259 | +On server startup, the maintenance state is automatically cleared: |
| 260 | + |
| 261 | +1. Maintenance mode is NOT persisted to disk (in-memory only) |
| 262 | +2. Server starts in normal operation mode |
| 263 | +3. Log message confirms: "Server started in normal operation mode" |
| 264 | + |
| 265 | +This ensures the server recovers cleanly from crashes or forced restarts. |
| 266 | + |
| 267 | +## Troubleshooting |
| 268 | + |
| 269 | +### Jobs stuck in maintenance mode |
| 270 | + |
| 271 | +If the server is stuck in maintenance mode: |
| 272 | + |
| 273 | +1. Check current status: `curl http://localhost:8000/api/admin/maintenance/status` |
| 274 | +2. Manually exit: `curl -X POST http://localhost:8000/api/admin/maintenance/exit` |
| 275 | +3. Or restart the server (maintenance state is cleared on restart) |
| 276 | + |
| 277 | +### Auto-update not working |
| 278 | + |
| 279 | +1. Check auto-update service status: `systemctl status cidx-auto-update` |
| 280 | +2. Check logs: `journalctl -u cidx-auto-update -f` |
| 281 | +3. Verify git remote access: `cd /opt/cidx && git fetch origin master` |
| 282 | + |
| 283 | +### Jobs orphaned after restart |
| 284 | + |
| 285 | +If jobs were interrupted during a forced restart: |
| 286 | + |
| 287 | +1. Jobs are automatically marked as CANCELLED on next server startup |
| 288 | +2. Check job status via API: `GET /api/jobs/{job_id}/status` |
| 289 | +3. Re-submit failed jobs as needed |
| 290 | + |
| 291 | +## Best Practices |
| 292 | + |
| 293 | +1. **Set appropriate drain timeout** - Consider your longest-running jobs when configuring `drain_timeout` |
| 294 | +2. **Monitor during updates** - Watch logs during auto-update for any forced restarts |
| 295 | +3. **Schedule updates during low usage** - If possible, configure check intervals to align with low-traffic periods |
| 296 | +4. **Test recovery procedures** - Periodically verify job recovery after interruptions |
0 commit comments