The Robust Startup System implements retry-with-timeout and graceful degradation patterns to prevent endless loops and ensure coding infrastructure starts reliably even when some services fail.
Before the robust startup system, coding/bin/coding had several critical reliability problems:
- Endless Loops: When VKB server failed to start, the health monitor would wait indefinitely without retry limits
- No Timeout Protection: Services could hang during startup without any timeout mechanism
- All-or-Nothing: If any service failed, the entire startup would block or fail ambiguously
- Port Checks Only: Health verification only checked if ports were listening, not if services were actually functional
- No Degraded Mode: Optional services that failed would block Claude startup unnecessarily
Users experienced:
- Waiting indefinitely for coding to start when VKB server had issues
- Claude sessions blocked by non-critical service failures
- No clear visibility into which services failed and why
- Unable to use coding tools while waiting for optional services
Reusable module providing:
import { startServiceWithRetry } from '../lib/service-starter.js';
const result = await startServiceWithRetry(
'VKB Server',
async () => startVKBServer(), // Start function
async () => checkHealth(), // Health check function
{
required: false, // Optional service - degrade gracefully
maxRetries: 3, // Try 3 times then give up
timeout: 30000, // 30 second timeout per attempt
exponentialBackoff: true // 2s, 4s, 8s delays
}
);Features:
- Configurable retry limits (default: 3 attempts)
- Timeout protection per attempt (default: 30 seconds)
- Exponential backoff between retries (2s → 4s → 8s)
- Actual health verification (HTTP endpoints, not just ports)
- Service classification (required vs optional)
- Graceful degradation for optional services
Service orchestrator that:
- Starts services in logical order (required first, then optional)
- Applies retry logic independently per service
- Kills unhealthy processes before retry
- Reports clear status for each service
- Exits with proper codes (0 = success, 1 = critical failure)
Entry point that:
- Defaults to robust mode (
ROBUST_MODE=true) - Allows fallback to legacy mode if needed
- Provides clear user feedback
- Transcript Monitor - Essential for LSL system
- Live Logging Coordinator - Essential for session tracking
If these fail after all retries → Claude startup blocked with clear error
- VKB Server - Knowledge visualization (port 8080)
- Constraint Monitor - Live guardrails system (Docker mode aware)
- Semantic Analysis - MCP semantic analysis server
- Memgraph - Code Graph RAG AST analysis (Docker mode aware)
If these fail after all retries → Continue in DEGRADED mode with warning
When CODING_DOCKER_MODE=true (set by both launch-claude.sh and launch-copilot.sh), start-services-robust.js automatically skips launching standalone containers that are already provided by the coding-services Docker container:
- Constraint Monitor: Skips
docker-compose up -dfor standalone Redis + Qdrant containers (they run insidecoding-servicesascoding-redisandcoding-qdrant) - Memgraph: Skips
docker-compose up -dfor standalone Memgraph container (runs insidecoding-services)
This prevents duplicate containers, port conflicts (both bind 6379, 6333), and ~540MB wasted RAM. Health checks in Docker mode verify the coding-services health endpoint or TCP port instead of counting standalone containers.
Figure: Unified Docker mode flow shared by both Claude and CoPilot launchers
Before starting any services, the robust startup system automatically cleans up dangling processes from crashed or abnormally terminated sessions. This ensures a clean slate and prevents startup failures due to zombie processes.
Figure: Complete service startup flow showing pre-startup cleanup, PSM integration, and session cleanup
Problem: When VSCode or Claude Code crashes:
- Transcript monitor processes remain running
- Live logging coordinator processes continue orphaned
- Process State Manager (PSM) contains stale entries
- New startup attempts fail due to competing processes
Solution: Automatic pre-startup cleanup runs before every service startup.
async function cleanupDanglingProcesses() {
// 1. Kill all orphaned transcript monitors
await execAsync('pkill -f "enhanced-transcript-monitor.js"');
// 2. Kill all orphaned live-logging coordinators
await execAsync('pkill -f "live-logging-coordinator.js"');
// 3. Clean up stale PSM entries
await psm.cleanupStaleServices();
}Normal session shutdowns are recorded in .data/session-shutdowns.json:
{
"claude-12345-1699364825": {
"timestamp": "2025-11-17T12:34:56.789Z",
"type": "graceful",
"pid": 12345
}
}Benefits:
- Detects abnormal terminations (missing graceful shutdown record)
- Enables crash analytics
- Provides session continuity information
═══════════════════════════════════════════════════════════════════════
🚀 STARTING CODING SERVICES (ROBUST MODE)
═══════════════════════════════════════════════════════════════════════
🧹 Pre-startup cleanup: Checking for dangling processes...
Found 21 dangling transcript monitor process(es)
Terminating dangling transcript monitors...
✅ Cleaned up transcript monitors
Found 25 dangling live-logging coordinator process(es)
Terminating dangling live-logging coordinators...
✅ Cleaned up live-logging coordinators
Cleaning up stale Process State Manager entries...
✅ PSM cleanup complete
✅ Pre-startup cleanup complete - system ready for fresh start
📋 Starting REQUIRED services (Live Logging System)...
File: scripts/start-services-robust.js
The cleanup function runs at the start of startAllServices():
async function startAllServices() {
console.log('🚀 STARTING CODING SERVICES (ROBUST MODE)');
// Clean up any dangling processes from crashed sessions
await cleanupDanglingProcesses();
// Then proceed with normal service startup...
}Related Files:
scripts/psm-session-cleanup.js- Records graceful shutdownsscripts/process-state-manager.js- PSM withcleanupStaleServices()method.data/session-shutdowns.json- Graceful shutdown tracking
For manual cleanup between sessions or when automatic cleanup isn't sufficient:
# Preview what would be cleaned
./bin/cleanup-orphans --dry-run
# Clean up orphaned processes manually
./bin/cleanup-orphansThe cleanup-orphans utility provides targeted cleanup of:
- Transcript monitors without valid project paths
- Stuck ukb/vkb operations
- Orphaned qdrant-sync processes
- Old shell snapshot processes
See: Process Management Analysis for detailed documentation
For each service attempt (1 to maxRetries):
1. Start service with timeout protection
2. Wait 2 seconds for initialization
3. Run health check with timeout
4. If healthy → SUCCESS
5. If unhealthy:
- Kill the unhealthy process
- Wait with exponential backoff (2^attempt seconds)
- Retry
If all retries exhausted:
- Required service → THROW ERROR (blocks startup)
- Optional service → RETURN DEGRADED STATUS (continue)
| Attempt | Delay Before Retry |
|---|---|
| 1 | 0s (immediate) |
| 2 | 2s |
| 3 | 4s |
| 4 | 8s |
Each service startup attempt has two timeouts:
- Startup Timeout: Process must start within timeout (default: 30s)
- Health Check Timeout: Health verification must complete within 10s
If either timeout expires → Attempt fails, retry or degrade
# PROBLEM: Port open ≠ server working
if lsof -i :8080 > /dev/null; then
echo "✅ VKB running" # FALSE POSITIVE!
fi// SOLUTION: Actual HTTP health endpoint verification
const healthCheck = createHttpHealthCheck(8080, '/health');
const healthy = await healthCheck(); // true only if HTTP 200 OKVKB Server Health Endpoint (/health):
{
"status": "healthy",
"timestamp": 1729764523.456,
"server": {
"port": 8080,
"pid": 12345,
"uptime": 45.2
}
}For background processes without HTTP endpoints:
const healthCheck = createPidHealthCheck();
const serviceInfo = await startService();
const healthy = await healthCheck(serviceInfo); // Checks if PID is running./start-services.shOutput example:
🚀 Starting Coding Services (Robust Mode)...
✨ Using robust startup mode with retry logic and graceful degradation
═══════════════════════════════════════════════════════════════════════
🚀 STARTING CODING SERVICES (ROBUST MODE)
═══════════════════════════════════════════════════════════════════════
📋 Starting REQUIRED services (Live Logging System)...
[ServiceStarter] ✅ Transcript Monitor started successfully on attempt 1/3
[ServiceStarter] ✅ Live Logging Coordinator started successfully on attempt 1/3
🔵 Starting OPTIONAL services (graceful degradation enabled)...
[ServiceStarter] ✅ VKB Server started successfully on attempt 1/3
[ServiceStarter] ⚠️ Constraint Monitor failed after 2 attempts - continuing in DEGRADED mode
═══════════════════════════════════════════════════════════════════════
📊 SERVICES STATUS SUMMARY
═══════════════════════════════════════════════════════════════════════
✅ Successfully started: 3 services
- Transcript Monitor
- Live Logging Coordinator
- VKB Server
⚠️ Degraded (optional failed): 1 services
- Constraint Monitor: Docker not running - required for Constraint Monitor
🎉 Startup complete in DEGRADED mode!
ℹ️ Some optional services are unavailable:
- Constraint Monitor will not be available this session
═══════════════════════════════════════════════════════════════════════
ROBUST_MODE=false ./start-services.shUse only for debugging or if robust mode has issues.
Before Robust System:
- Would wait indefinitely or block startup
- User couldn't use coding tools
- No clear error message
With Robust System:
[ServiceStarter] 📍 Attempt 1/3 for VKB Server...
[ServiceStarter] ❌ VKB Server attempt 1/3 failed: Startup timeout
[ServiceStarter] Waiting 2000ms before retry...
[ServiceStarter] 📍 Attempt 2/3 for VKB Server...
[ServiceStarter] ❌ VKB Server attempt 2/3 failed: Health check failed
[ServiceStarter] Waiting 4000ms before retry...
[ServiceStarter] 📍 Attempt 3/3 for VKB Server...
[ServiceStarter] ❌ VKB Server attempt 3/3 failed: Port not listening
[ServiceStarter] ⚠️ VKB Server failed after 3 attempts - continuing in DEGRADED mode
🎉 Startup complete in DEGRADED mode!
ℹ️ VKB Server will not be available this session
Result: Claude starts successfully without VKB visualization
[ServiceStarter] 📍 Attempt 1/3 for Transcript Monitor...
[ServiceStarter] ❌ Transcript Monitor attempt 1/3 failed
[ServiceStarter] 📍 Attempt 2/3 for Transcript Monitor...
[ServiceStarter] ❌ Transcript Monitor attempt 2/3 failed
[ServiceStarter] 📍 Attempt 3/3 for Transcript Monitor...
[ServiceStarter] ❌ Transcript Monitor attempt 3/3 failed
💥 CRITICAL: Transcript Monitor failed after 3 attempts - BLOCKING startup
═══════════════════════════════════════════════════════════════════════
❌ Failed (required): 1 services
- Transcript Monitor: Process failed to start
💥 CRITICAL: Required services failed - BLOCKING startup
═══════════════════════════════════════════════════════════════════════
Result: Claude startup blocked with clear error message
✅ Successfully started: 4 services
- Transcript Monitor
- Live Logging Coordinator
- VKB Server
- Constraint Monitor
🎉 Startup complete in FULL mode!
Result: All features available
Edit scripts/start-services-robust.js:
const SERVICE_CONFIGS = {
vkbServer: {
name: 'VKB Server',
required: false, // Optional - degrade gracefully
maxRetries: 3, // Try 3 times
timeout: 30000, // 30 second timeout per attempt
startFn: async () => { /* ... */ },
healthCheckFn: createHttpHealthCheck(8080, '/health')
}
};ROBUST_MODE=true- Enable robust startup (default)ROBUST_MODE=false- Use legacy startup modeVKB_DATA_SOURCE=combined- Data source for VKB serverCODING_DOCKER_MODE=true- Skip standalone containers already in coding-services
- Clear retry limits prevent indefinite waiting
- Timeouts protect against hanging services
- Optional services don't block Claude startup
- Clear communication about what's available/unavailable
- Fast startup even when some services fail
- Clear status reporting
- Can use coding tools immediately
- Exponential backoff prevents overwhelming failing services
- Health verification ensures services actually work
- Process cleanup prevents zombie processes
- Detailed logs for each retry attempt
- Clear failure reasons
- Distinct exit codes
| Code | Meaning | Example |
|---|---|---|
| 0 | Success - all services started (FULL or DEGRADED mode) | VKB failed but optional |
| 1 | Critical failure - required service failed | Transcript Monitor failed |
# Check VKB logs
tail -f /tmp/vkb-server.log
# Test VKB health endpoint
curl http://localhost:8080/health
# Check if port is blocked
lsof -i :8080# Check Docker status
docker info
# Containers live inside the coding-services stack
docker ps --filter "name=coding"
# Container logs
docker compose -f docker/docker-compose.yml logs coding-services- Check retry limits: Increase
maxRetriesif needed - Check timeout: Increase
timeoutfor slow-starting services - Check health endpoint: Verify
/healthreturns 200 OK - Check dependencies: Ensure Docker, Node.js, Python are available
The launcher automatically manages Docker Desktop availability via scripts/ensure-docker.sh, eliminating the need for users to manually start Docker before running coding.
When the launcher detects that Docker is not running:
- Check Docker client — Verifies the
dockerCLI is installed - Check daemon responsiveness — Runs
docker pswith a 5-second timeout - Start Docker Desktop — Launches via
open -F -a "Docker"and waits for the process to appear - Wait for daemon — Polls every second for up to 45 seconds with progress updates every 10 seconds
Docker Desktop can enter a "process running but daemon unresponsive" state (common after failed updates). The launcher detects and auto-recovers:
- Graceful quit —
osascript -e 'quit app "Docker"'(2s wait) - Force kill —
killallfor Docker Desktop, com.docker.backend, com.docker.vmnetd (3s wait) - Verify gone — Loop up to 5s checking processes
- Final force kill —
pkill -9for stubborn processes (2s wait) - Relaunch —
open -F -a "Docker"and wait for daemon readiness
| Phase | Timeout | Purpose |
|---|---|---|
| Daemon check | 5s | Quick responsiveness test |
| Initial wait | 45s (configurable via DOCKER_TIMEOUT) |
Fresh launch startup |
| Smart elapsed | Remaining from 45s | Accounts for time already spent |
| Restart recovery | +30s | Additional time after auto-restart |
| Minimum fallback | 10s | Always gives at least 10s more |
If Docker still isn't ready after all timeouts, the launcher continues with a warning — it does not block startup.
On Linux, the launcher uses systemctl start docker if systemd is available. If not, it displays the appropriate manual command.
The launcher ensures reliable operation across all network environments via scripts/detect-network.sh, which is sourced during agent-common-setup.sh initialization.
The system is tested and works in all combinations:
| Environment | CN Detection | Proxy Handling | Behavior |
|---|---|---|---|
| Corporate + proxy | SSH/HTTPS probe | Auto-configured | Full access via proxy |
| Corporate, no proxy | SSH/HTTPS probe | Warning issued | Degraded (no external access) |
| Public + proxy set | Skipped | Uses existing HTTP_PROXY |
Full access |
| Public, no proxy | Skipped | None needed | Direct access |
Corporate Network Detection (3 layers with fallback):
- Environment override:
CODING_FORCE_CN=true|false— Instant, bypasses probing - SSH probe: Tests SSH access to corporate GitHub (5s timeout, case-insensitive response matching)
- HTTPS fallback: Tests HTTPS access to corporate GitHub (5s timeout)
Proxy Auto-Configuration (conditional on CN detection):
- Check if
HTTP_PROXYalready set and working → skip - Test external access (google.de) → if works, no proxy needed
- Probe
127.0.0.1:3128for proxydetox service → auto-configure if found - Verify proxy works after configuration → warn if still failing
All network probes use strict 5-second timeouts to prevent hangs on unreliable networks. The launcher never blocks indefinitely on network operations.
Network detection issues never block startup:
- CN detection failure → assumes public network, proceeds
- Proxy configuration failure → warns and proceeds in degraded mode
- External access unavailable → warns that Docker pulls/npm installs may fail
Comprehensive end-to-end tests validate all environment combinations via tests/integration/launcher-e2e.sh.
| # | CN | Proxy | Agent | Verified |
|---|---|---|---|---|
| 1 | Yes | Yes | Claude | Output assertions |
| 2 | Yes | Yes | CoPilot | Output assertions |
| 3 | Yes | No | Claude | Warning assertions |
| 4 | Yes | No | CoPilot | Warning assertions |
| 5 | No | Yes | Claude | Output assertions |
| 6 | No | Yes | CoPilot | Output assertions |
| 7 | No | No | Claude | Output assertions |
| 8 | No | No | CoPilot | Output assertions |
- Agent flag equivalence (
--copi=--copilot) --claudeflag behavior- Invalid agent rejection
--helpoutput--verboselogging- Docker auto-start logic
- Dry-run markers
CODING_FORCE_CNoverride (true and false)
- Uses
--dry-runflag to skip blocking operations (Docker wait, service start, agent launch) - Uses
CODING_FORCE_CNenvironment variable to mock CN detection without network access - Output assertion functions:
assert_output_contains,assert_output_not_contains,assert_exit_code - Color-coded results with pass/fail/skip counters
# Run full test suite
./tests/integration/launcher-e2e.sh
# Verbose mode (shows output from passing tests too)
./tests/integration/launcher-e2e.sh --verbose- Add parallel service startup (currently sequential)
- Add service dependency management (start A before B)
- Add automatic service recovery/restart on crash
- Add metrics collection for startup times
- Add circuit breaker pattern for repeatedly failing services
- Add health monitoring dashboard
lib/service-starter.js- Core retry logic modulescripts/start-services-robust.js- Service orchestrator (Docker mode aware)start-services.sh- Entry point scriptscripts/ensure-docker.sh- Docker auto-start and recoveryscripts/detect-network.sh- Corporate network and proxy detectionscripts/agent-common-setup.sh- Shared agent initializationscripts/launch-claude.sh- Claude launcher (Docker mode detection & startup, tmux wrapping)scripts/launch-copilot.sh- CoPilot launcher (Docker mode detection & startup, tmux wrapping)scripts/tmux-session-wrapper.sh- Shared tmux wrapper for unified status bar across all agentstests/integration/launcher-e2e.sh- E2E test suitememory-visualizer/api-server.py- VKB health endpoint implementation

