jsbattig
diff --git a/‎README.md‎
Lines changed: 3 additions & 0 deletions b/‎README.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/auto-update.md‎
Lines changed: 296 additions & 0 deletions b/‎docs/auto-update.md‎
Lines changed: 296 additions & 0 deletions
diff --git a/‎src/code_indexer/server/app.py‎
Lines changed: 4 additions & 0 deletions b/‎src/code_indexer/server/app.py‎
Lines changed: 4 additions & 0 deletions
@@ -313,6 +313,9 @@ For complete configuration reference including environment variables, daemon set
 - [AI Integration Guide](docs/ai-integration.md) - Connect AI assistants to CIDX
 - [MCP Bridge Guide](docs/mcpb/README.md) - Claude Desktop integration via MCP
 
+### Server Administration
+- [Auto-Update Guide](docs/auto-update.md) - Job-aware auto-update with graceful drain mode
+
 ### Advanced
 - [Architecture Guide](docs/architecture.md) - System design and storage architecture
 - [Migration Guide](docs/migration-to-v8.md) - Upgrading from v7.x to v8.x
 
@@ -0,0 +1,296 @@
+# CIDX Server Auto-Update Documentation
+
+Story #734: Job-Aware Auto-Update with Graceful Drain Mode
+
+## Overview
+
+The CIDX Server includes an auto-update feature that automatically deploys updates when changes are detected in the master branch of the configured git repository. This feature includes job-aware graceful drain mode to prevent orphan jobs and data corruption during restarts.
+
+## Architecture
+
+### Component Overview
+
+```
+                    +------------------+
+                    |  AutoUpdateService |
+                    |    (Polling Loop)  |
+                    +--------+---------+
+                             |
+              +--------------+--------------+
+              |              |              |
+    +---------v----+  +------v------+  +---v-----------+
+    |ChangeDetector|  |DeploymentLock|  |DeploymentExecutor|
+    +--------------+  +-------------+  +-------+-------+
+                                               |
+                           +-------------------+-------------------+
+                           |                   |                   |
+                    +------v------+     +------v------+     +------v------+
+                    | git pull    |     | pip install |     | Maintenance |
+                    |             |     |             |     | Mode Flow   |
+                    +-------------+     +-------------+     +------+------+
+                                                                   |
+                                              +--------------------+--------------------+
+                                              |                    |                    |
+                                       +------v------+      +------v------+      +------v------+
+                                       |Enter Maint. |      |Wait for     |      |systemctl    |
+                                       |Mode (API)   |      |Drain        |      |restart      |
+                                       +-------------+      +-------------+      +-------------+
+```
+
+### State Machine
+
+The AutoUpdateService uses a state machine to manage deployment:
+
+1. **IDLE** - Waiting for next check interval
+2. **CHECKING** - Polling git remote for changes
+3. **DEPLOYING** - Running git pull and pip install
+4. **RESTARTING** - Executing graceful restart with drain
+
+## Job-Aware Drain Process
+
+### Problem Solved
+
+Previous versions performed blind `systemctl restart` without checking for running jobs, causing:
+- Orphan jobs left in "running" state indefinitely
+- Potential data corruption from interrupted indexing operations
+- Poor user experience when long-running jobs are killed
+
+### Solution: Graceful Drain Mode
+
+The auto-update process now uses a three-step maintenance mode flow:
+
+1. **Enter Maintenance Mode** - Server stops accepting new jobs
+2. **Wait for Drain** - Poll until all running jobs complete (with timeout)
+3. **Execute Restart** - Restart the server after jobs complete
+
+### Drain Flow Diagram
+
+```
+Auto-Update Triggered
+        |
+        v
++-------------------+
+| Enter Maintenance |---(POST /api/admin/maintenance/enter)
+| Mode              |
++---------+---------+
+          |
+          v
++-------------------+
+| Wait for Drain    |---(GET /api/admin/maintenance/drain-status)
+| (poll every 10s)  |
++---------+---------+
+          |
+    +-----+-----+
+    |           |
+    v           v
+ Drained    Timeout (300s)
+    |           |
+    |           +---> Log WARNING with job details
+    |           |
+    +-----+-----+
+          |
+          v
++-------------------+
+| systemctl restart |
+| cidx-server       |
++-------------------+
+```
+
+## Configuration
+
+### DeploymentExecutor Parameters
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `server_url` | `http://localhost:8000` | CIDX server URL for maintenance API |
+| `drain_timeout` | `300` (5 min) | Maximum seconds to wait for drain |
+| `drain_poll_interval` | `10` | Seconds between drain status checks |
+
+### Environment Variables (systemd)
+
+Configure in `/etc/systemd/system/cidx-auto-update.service`:
+
+```ini
+[Unit]
+Description=CIDX Auto-Update Service
+After=network.target cidx-server.service
+
+[Service]
+Type=simple
+User=cidx
+WorkingDirectory=/opt/cidx
+ExecStart=/usr/bin/python3 -m code_indexer.server.auto_update.cli
+Restart=always
+RestartSec=30
+Environment=CIDX_REPO_PATH=/opt/cidx
+Environment=CIDX_CHECK_INTERVAL=300
+
+[Install]
+WantedBy=multi-user.target
+```
+
+## API Endpoints
+
+### Maintenance Mode Endpoints
+
+All endpoints are under `/api/admin/maintenance/`.
+
+**Authentication Required**: All maintenance endpoints require admin authentication. Include a valid Bearer token in the Authorization header:
+
+```bash
+curl -X POST http://localhost:8000/api/admin/maintenance/enter \
+  -H "Authorization: Bearer YOUR_ADMIN_TOKEN"
+```
+
+Without valid admin authentication, all endpoints return HTTP 401 Unauthorized.
+
+#### POST /enter
+
+Enter maintenance mode. Stops accepting new jobs while allowing running jobs to complete.
+
+**Response (200 OK)**:
+```json
+{
+  "maintenance_mode": true,
+  "running_jobs": 3,
+  "queued_jobs": 5,
+  "entered_at": "2025-01-17T10:00:00Z",
+  "message": "Maintenance mode active. 3 running, 5 queued."
+}
+```
+
+#### POST /exit
+
+Exit maintenance mode. Resumes accepting new jobs.
+
+**Response (200 OK)**:
+```json
+{
+  "maintenance_mode": false,
+  "message": "Maintenance mode deactivated."
+}
+```
+
+#### GET /status
+
+Get current maintenance mode status.
+
+**Response (200 OK)**:
+```json
+{
+  "maintenance_mode": true,
+  "drained": false,
+  "running_jobs": 2,
+  "queued_jobs": 0,
+  "entered_at": "2025-01-17T10:00:00Z"
+}
+```
+
+#### GET /drain-status
+
+Get detailed drain status for auto-update coordination.
+
+**Response (200 OK)**:
+```json
+{
+  "drained": false,
+  "running_jobs": 2,
+  "queued_jobs": 0,
+  "estimated_drain_seconds": 120,
+  "jobs": [
+    {
+      "job_id": "abc-123",
+      "operation_type": "add_golden_repo",
+      "started_at": "2025-01-17T10:00:00Z",
+      "progress": 50
+    }
+  ]
+}
+```
+
+## Health Endpoint Integration
+
+The `/health` endpoint includes maintenance mode status:
+
+```json
+{
+  "status": "healthy",
+  "maintenance_mode": true,
+  "uptime": 3600,
+  ...
+}
+```
+
+During maintenance mode, the health status is **degraded** (not unhealthy) to indicate:
+- Server is operational for queries
+- New jobs are rejected (returns 503)
+- System is draining for planned restart
+
+## Job Rejection During Maintenance
+
+When the server is in maintenance mode, job submission endpoints return HTTP 503:
+
+```json
+{
+  "error": "Server is in maintenance mode. New jobs are not accepted. Please retry after 60 seconds."
+}
+```
+
+Affected operations:
+- Repository add/sync via `GoldenRepoManager`
+- Background jobs via `BackgroundJobManager`
+- Sync jobs via `SyncJobManager`
+
+Query operations continue to work normally.
+
+## Force Restart Logging
+
+When drain timeout is exceeded, the system logs all running jobs at WARNING level before forcing restart:
+
+```
+WARNING: Forcing restart - running job: job_id=abc-123, operation_type=add_golden_repo, started_at=2025-01-17T10:00:00Z, progress=50%
+WARNING: Drain timeout exceeded, forcing restart
+```
+
+This provides visibility into which jobs were interrupted for post-restart recovery.
+
+## Startup Behavior
+
+On server startup, the maintenance state is automatically cleared:
+
+1. Maintenance mode is NOT persisted to disk (in-memory only)
+2. Server starts in normal operation mode
+3. Log message confirms: "Server started in normal operation mode"
+
+This ensures the server recovers cleanly from crashes or forced restarts.
+
+## Troubleshooting
+
+### Jobs stuck in maintenance mode
+
+If the server is stuck in maintenance mode:
+
+1. Check current status: `curl http://localhost:8000/api/admin/maintenance/status`
+2. Manually exit: `curl -X POST http://localhost:8000/api/admin/maintenance/exit`
+3. Or restart the server (maintenance state is cleared on restart)
+
+### Auto-update not working
+
+1. Check auto-update service status: `systemctl status cidx-auto-update`
+2. Check logs: `journalctl -u cidx-auto-update -f`
+3. Verify git remote access: `cd /opt/cidx && git fetch origin master`
+
+### Jobs orphaned after restart
+
+If jobs were interrupted during a forced restart:
+
+1. Jobs are automatically marked as CANCELLED on next server startup
+2. Check job status via API: `GET /api/jobs/{job_id}/status`
+3. Re-submit failed jobs as needed
+
+## Best Practices
+
+1. **Set appropriate drain timeout** - Consider your longest-running jobs when configuring `drain_timeout`
+2. **Monitor during updates** - Watch logs during auto-update for any forced restarts
+3. **Schedule updates during low usage** - If possible, configure check intervals to align with low-traffic periods
+4. **Test recovery procedures** - Periodically verify job recovery after interruptions
@@ -81,6 +81,8 @@
 from .routers.indexing import router as indexing_router
 from .routers.cache import router as cache_router
 from .routers.delegation_callbacks import router as delegation_callbacks_router
+from .routers.maintenance_router import router as maintenance_router
+from .services.maintenance_service import get_maintenance_state
 from .routers.groups import (
     router as groups_router,
     users_router,
@@ -2843,6 +2845,7 @@ async def health_check(
                     "failed_jobs": failed_jobs,
                 },
                 "started_at": get_server_start_time(),
+                "maintenance_mode": get_maintenance_state().is_maintenance_mode(),
             }
 
             # Add version if available
@@ -7557,6 +7560,7 @@ async def get_repository_info(
     app.include_router(users_router)
     app.include_router(audit_router)
     app.include_router(delegation_callbacks_router)
+    app.include_router(maintenance_router)
 
     # Mount Web Admin UI routes and static files
     from fastapi.staticfiles import StaticFiles