CIDX Server Auto-Update Documentation

Story #734: Job-Aware Auto-Update with Graceful Drain Mode

Overview

The CIDX Server includes an auto-update feature that automatically deploys updates when changes are detected in the master branch of the configured git repository. This feature includes job-aware graceful drain mode to prevent orphan jobs and data corruption during restarts.

Architecture

Component Overview

                    +------------------+
                    |  AutoUpdateService |
                    |    (Polling Loop)  |
                    +--------+---------+
                             |
              +--------------+--------------+
              |              |              |
    +---------v----+  +------v------+  +---v-----------+
    |ChangeDetector|  |DeploymentLock|  |DeploymentExecutor|
    +--------------+  +-------------+  +-------+-------+
                                               |
                           +-------------------+-------------------+
                           |                   |                   |
                    +------v------+     +------v------+     +------v------+
                    | git pull    |     | pip install |     | Maintenance |
                    |             |     |             |     | Mode Flow   |
                    +-------------+     +-------------+     +------+------+
                                                                   |
                                              +--------------------+--------------------+
                                              |                    |                    |
                                       +------v------+      +------v------+      +------v------+
                                       |Enter Maint. |      |Wait for     |      |systemctl    |
                                       |Mode (API)   |      |Drain        |      |restart      |
                                       +-------------+      +-------------+      +-------------+

State Machine

The AutoUpdateService uses a state machine to manage deployment:

IDLE - Waiting for next check interval
CHECKING - Polling git remote for changes
DEPLOYING - Running git pull and pip install
RESTARTING - Executing graceful restart with drain

Job-Aware Drain Process

Problem Solved

Previous versions performed blind systemctl restart without checking for running jobs, causing:

Orphan jobs left in "running" state indefinitely
Potential data corruption from interrupted indexing operations
Poor user experience when long-running jobs are killed

Solution: Graceful Drain Mode

The auto-update process now uses a three-step maintenance mode flow:

Enter Maintenance Mode - Server stops accepting new jobs
Wait for Drain - Poll until all running jobs complete (with timeout)
Execute Restart - Restart the server after jobs complete

Drain Flow Diagram

Auto-Update Triggered
        |
        v
+-------------------+
| Enter Maintenance |---(POST /api/admin/maintenance/enter)
| Mode              |
+---------+---------+
          |
          v
+-------------------+
| Wait for Drain    |---(GET /api/admin/maintenance/drain-status)
| (poll every 10s)  |
+---------+---------+
          |
    +-----+-----+
    |           |
    v           v
 Drained    Timeout (300s)
    |           |
    |           +---> Log WARNING with job details
    |           |
    +-----+-----+
          |
          v
+-------------------+
| systemctl restart |
| cidx-server       |
+-------------------+

Configuration

DeploymentExecutor Parameters

Parameter	Default	Description
`server_url`	`http://localhost:8000`	CIDX server URL for maintenance API
`drain_timeout`	`300` (5 min)	Maximum seconds to wait for drain
`drain_poll_interval`	`10`	Seconds between drain status checks

Environment Variables (systemd)

Configure in /etc/systemd/system/cidx-auto-update.service:

[Unit]
Description=CIDX Auto-Update Service
After=network.target cidx-server.service

[Service]
Type=simple
User=cidx
WorkingDirectory=/opt/cidx
ExecStart=/usr/bin/python3 -m code_indexer.server.auto_update.cli
Restart=always
RestartSec=30
Environment=CIDX_REPO_PATH=/opt/cidx
Environment=CIDX_CHECK_INTERVAL=300

[Install]
WantedBy=multi-user.target

API Endpoints

Maintenance Mode Endpoints

All endpoints are under /api/admin/maintenance/.

Authentication Required: All maintenance endpoints require admin authentication. Include a valid Bearer token in the Authorization header:

curl -X POST http://localhost:8000/api/admin/maintenance/enter \
  -H "Authorization: Bearer YOUR_ADMIN_TOKEN"

Without valid admin authentication, all endpoints return HTTP 401 Unauthorized.

POST /enter

Enter maintenance mode. Stops accepting new jobs while allowing running jobs to complete.

Response (200 OK):

{
  "maintenance_mode": true,
  "running_jobs": 3,
  "queued_jobs": 5,
  "entered_at": "2025-01-17T10:00:00Z",
  "message": "Maintenance mode active. 3 running, 5 queued."
}

POST /exit

Exit maintenance mode. Resumes accepting new jobs.

Response (200 OK):

{
  "maintenance_mode": false,
  "message": "Maintenance mode deactivated."
}

GET /status

Get current maintenance mode status.

Response (200 OK):

{
  "maintenance_mode": true,
  "drained": false,
  "running_jobs": 2,
  "queued_jobs": 0,
  "entered_at": "2025-01-17T10:00:00Z"
}

GET /drain-status

Get detailed drain status for auto-update coordination.

Response (200 OK):

{
  "drained": false,
  "running_jobs": 2,
  "queued_jobs": 0,
  "estimated_drain_seconds": 120,
  "jobs": [
    {
      "job_id": "abc-123",
      "operation_type": "add_golden_repo",
      "started_at": "2025-01-17T10:00:00Z",
      "progress": 50
    }
  ]
}

Health Endpoint Integration

The /health endpoint includes maintenance mode status:

{
  "status": "healthy",
  "maintenance_mode": true,
  "uptime": 3600,
  ...
}

During maintenance mode, the health status is degraded (not unhealthy) to indicate:

Server is operational for queries
New jobs are rejected (returns 503)
System is draining for planned restart

Job Rejection During Maintenance

When the server is in maintenance mode, job submission endpoints return HTTP 503:

{
  "error": "Server is in maintenance mode. New jobs are not accepted. Please retry after 60 seconds."
}

Affected operations:

Repository add/sync via GoldenRepoManager
Background jobs via BackgroundJobManager
Sync jobs via SyncJobManager

Query operations continue to work normally.

Force Restart Logging

When drain timeout is exceeded, the system logs all running jobs at WARNING level before forcing restart:

WARNING: Forcing restart - running job: job_id=abc-123, operation_type=add_golden_repo, started_at=2025-01-17T10:00:00Z, progress=50%
WARNING: Drain timeout exceeded, forcing restart

This provides visibility into which jobs were interrupted for post-restart recovery.

Startup Behavior

On server startup, the maintenance state is automatically cleared:

Maintenance mode is NOT persisted to disk (in-memory only)
Server starts in normal operation mode
Log message confirms: "Server started in normal operation mode"

This ensures the server recovers cleanly from crashes or forced restarts.

Troubleshooting

Jobs stuck in maintenance mode

If the server is stuck in maintenance mode:

Check current status: curl http://localhost:8000/api/admin/maintenance/status
Manually exit: curl -X POST http://localhost:8000/api/admin/maintenance/exit
Or restart the server (maintenance state is cleared on restart)

Auto-update not working

Check auto-update service status: systemctl status cidx-auto-update
Check logs: journalctl -u cidx-auto-update -f
Verify git remote access: cd /opt/cidx && git fetch origin master

Jobs orphaned after restart

If jobs were interrupted during a forced restart:

Jobs are automatically marked as CANCELLED on next server startup
Check job status via API: GET /api/jobs/{job_id}/status
Re-submit failed jobs as needed

Best Practices

Set appropriate drain timeout - Consider your longest-running jobs when configuring drain_timeout
Monitor during updates - Watch logs during auto-update for any forced restarts
Schedule updates during low usage - If possible, configure check intervals to align with low-traffic periods
Test recovery procedures - Periodically verify job recovery after interruptions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CIDX Server Auto-Update Documentation

Overview

Architecture

Component Overview

State Machine

Job-Aware Drain Process

Problem Solved

Solution: Graceful Drain Mode

Drain Flow Diagram

Configuration

DeploymentExecutor Parameters

Environment Variables (systemd)

API Endpoints

Maintenance Mode Endpoints

POST /enter

POST /exit

GET /status

GET /drain-status

Health Endpoint Integration

Job Rejection During Maintenance

Force Restart Logging

Startup Behavior

Troubleshooting

Jobs stuck in maintenance mode

Auto-update not working

Jobs orphaned after restart

Best Practices

FilesExpand file tree

auto-update.md

Latest commit

History

auto-update.md

File metadata and controls

CIDX Server Auto-Update Documentation

Overview

Architecture

Component Overview

State Machine

Job-Aware Drain Process

Problem Solved

Solution: Graceful Drain Mode

Drain Flow Diagram

Configuration

DeploymentExecutor Parameters

Environment Variables (systemd)

API Endpoints

Maintenance Mode Endpoints

POST /enter

POST /exit

GET /status

GET /drain-status

Health Endpoint Integration

Job Rejection During Maintenance

Force Restart Logging

Startup Behavior

Troubleshooting

Jobs stuck in maintenance mode

Auto-update not working

Jobs orphaned after restart

Best Practices