What's Implemented (Layer 4):
- ✅ ProcessStateManager with PID tracking and service registry
- ✅ Database health monitoring (Level DB lock detection, Qdrant availability)
- ✅ Enhanced Transcript Monitor (per-session monitoring)
- ✅ StatusLine Health Monitor (session discovery and display)
- ✅ VKB server PSM integration
- ✅ Fail-fast database initialization
What's Missing (Layers 1-3):
- ❌ Layer 1: Watchdog (global service monitoring and recovery)
- ❌ Layer 2: Coordinator (active multi-project coordination)
- ❌ Layer 3: Verifier (comprehensive health verification)
Purpose: Individual session monitoring at the lowest level
Components:
- Enhanced Transcript Monitor (
scripts/enhanced-transcript-monitor.js) - ProcessStateManager (
scripts/process-state-manager.js) - Database health checks
Responsibilities:
- Monitor individual Claude Code session transcripts
- Track process PIDs and health status
- Detect database locks and conflicts
- Provide real-time health metrics
- Register/unregister services (global, per-project, per-session)
Data Flow:
Session Activity → Transcript Monitor → ProcessStateManager → Health Status
Database Operations → Lock Detection → Health Report
Purpose: Comprehensive health verification and proactive issue detection
Components to Build:
scripts/health-verifier.js- Main verification enginescripts/health-rules-engine.js- Configurable health rulesscripts/health-reporter.js- Health report generation
Responsibilities:
-
Periodic Health Verification
- Run comprehensive health checks every N seconds (configurable)
- Verify Layer 4 data integrity and consistency
- Cross-validate ProcessStateManager registry with actual processes
-
Proactive Issue Detection
- Detect stale PIDs (processes no longer running)
- Identify resource exhaustion (disk space, memory, CPU)
- Find orphaned processes not in PSM registry
- Detect zombie transcript monitors
-
Health Rule Evaluation
- Configurable rules for what constitutes "healthy"
- Thresholds for warnings vs errors
- Service-specific health criteria (e.g., VKB must be accessible, databases must be unlocked)
-
Health Reporting
- Generate structured health reports (JSON format)
- Categorize issues by severity (info, warning, error, critical)
- Provide actionable remediation steps
- Store health history for trend analysis
Key Health Checks:
// Health verification checklist
const healthChecks = {
// Database health
levelDB: {
check: () => checkDatabaseLock(),
severity: 'critical',
remediation: 'Stop VKB server or kill lock holder'
},
// Process health
registeredServices: {
check: () => verifyAllRegisteredProcessesAlive(),
severity: 'warning',
remediation: 'Clean up dead processes from registry'
},
// Resource health
diskSpace: {
check: () => checkDiskSpace('.data'),
threshold: '90%',
severity: 'error',
remediation: 'Clear old logs or expand storage'
},
// Service health
vkbServer: {
check: () => testHTTPConnectivity('http://localhost:8080'),
severity: 'warning',
remediation: 'Restart VKB server'
}
};Data Flow:
Layer 4 Data → Health Verifier → Rules Engine → Health Report
↓
Issue Detection
↓
Layer 2 (Coordinator)
Output Format:
{
"timestamp": "2025-11-08T10:30:00Z",
"overallStatus": "degraded",
"issues": [
{
"type": "database_lock",
"severity": "critical",
"component": "levelDB",
"message": "Level DB locked by unregistered process (PID: 12345)",
"remediation": "kill 12345 && ukb restart",
"affectedServices": ["ukb", "graph-database"]
}
],
"healthMetrics": {
"services": {
"total": 5,
"healthy": 3,
"unhealthy": 2
},
"databases": {
"levelDB": { "status": "locked", "lockedBy": 12345 },
"qdrant": { "status": "unavailable" }
}
}
}Purpose: Active coordination and remediation across multiple projects
Components to Build:
scripts/health-coordinator.js- Main coordination enginescripts/remediation-actions.js- Automated remediation libraryscripts/coordination-strategies.js- Multi-project coordination
Responsibilities:
-
Consume Health Reports from Layer 3
- Subscribe to health verification results
- Prioritize issues by severity
- Track issue resolution progress
-
Automated Remediation
- Execute remediation actions for known issues
- Clean up stale PIDs from registry
- Restart failed services automatically
- Resolve database lock conflicts
- Free orphaned resources
-
Multi-Project Coordination
- Coordinate database access across projects
- Prevent VKB/UKB conflicts (enforce mutual exclusion)
- Manage shared resource allocation
- Balance health monitoring load across projects
-
Recovery Orchestration
- Implement recovery workflows for common failure scenarios
- Coordinate service restarts in dependency order
- Validate successful recovery
- Escalate to Layer 1 if recovery fails
Remediation Actions:
const remediationActions = {
// Database conflicts
'database_lock': async (issue) => {
const pid = issue.details.lockedBy;
const service = await psm.getServiceByPid(pid);
if (service?.name === 'vkb-server') {
// Graceful shutdown
await vkb.stop();
} else {
// Force kill unregistered process
process.kill(pid, 'SIGTERM');
}
// Wait and verify lock released
await waitForLockRelease();
return { success: true, action: 'killed_lock_holder' };
},
// Dead process cleanup
'stale_pid': async (issue) => {
await psm.cleanupDeadProcesses();
return { success: true, action: 'cleaned_registry' };
},
// Service restart
'service_down': async (issue) => {
const serviceName = issue.details.service;
await restartService(serviceName);
await verifyServiceHealthy(serviceName);
return { success: true, action: 'restarted_service' };
}
};Coordination Strategies:
// VKB/UKB mutual exclusion
class DatabaseAccessCoordinator {
async requestDatabaseAccess(requester, operation) {
// Check if VKB server is running
if (await psm.isServiceRunning('vkb-server', 'global')) {
if (operation === 'write') {
// UKB needs write access - stop VKB first
await vkb.stop();
await waitForLockRelease();
return { granted: true, stoppedVKB: true };
} else {
// Read-only operations can coexist
return { granted: true, stoppedVKB: false };
}
}
return { granted: true, stoppedVKB: false };
}
async releaseDatabaseAccess(requester, options) {
if (options.stoppedVKB && options.autoRestart) {
// Restart VKB if we stopped it
await vkb.start({ foreground: false });
}
}
}Data Flow:
Layer 3 Health Report → Coordinator → Remediation Actions
↓
Recovery Workflows
↓
Verify Success (back to Layer 3)
↓
Escalate to Layer 1 if failed
Purpose: Ultimate failsafe monitoring - "Who watches the watchmen?"
Existing Component:
scripts/system-monitor-watchdog.js(EXISTS but not integrated with launchd)
What's Missing:
- launchd/cron integration for automatic startup
- Self-recovery mechanisms
- Alert/notification system
- Integration with Layer 2 Coordinator
Enhanced Responsibilities:
-
Monitor Layer 2 Coordinator
- Detect if Coordinator process crashes or hangs
- Restart Coordinator automatically
- Track restart count and prevent infinite restart loops
-
System-Level Health Checks
- Verify entire monitoring stack is operational
- Detect system-wide issues (out of disk, out of memory)
- Monitor system resources (CPU, RAM, disk I/O)
-
Escalation and Alerting
- Send notifications when automatic recovery fails
- Create incident reports for manual intervention
- Provide system administrator dashboard
-
Cannot Be Killed
- Run as system service (launchd on macOS)
- Restart automatically on system reboot
- Protected from user process termination
launchd Integration:
<!-- ~/Library/LaunchAgents/com.coding.system-watchdog.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.coding.system-watchdog</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/node</string>
<string>/Users/q284340/Agentic/coding/scripts/system-monitor-watchdog.js</string>
<string>--daemon</string>
</array>
<key>StartInterval</key>
<integer>60</integer> <!-- Run every 60 seconds -->
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<dict>
<key>SuccessfulExit</key>
<false/>
</dict>
<key>StandardOutPath</key>
<string>/Users/q284340/Agentic/coding/.logs/watchdog-stdout.log</string>
<key>StandardErrorPath</key>
<string>/Users/q284340/Agentic/coding/.logs/watchdog-stderr.log</string>
</dict>
</plist>Installation Commands:
# Install watchdog
launchctl load ~/Library/LaunchAgents/com.coding.system-watchdog.plist
# Check status
launchctl list | grep com.coding.system-watchdog
# Uninstall
launchctl unload ~/Library/LaunchAgents/com.coding.system-watchdog.plistEnhanced Watchdog Logic:
class SystemMonitorWatchdog {
async run() {
// 1. Check Layer 2 Coordinator health
const coordinatorHealthy = await this.checkCoordinatorHealth();
if (!coordinatorHealthy) {
this.warn('Layer 2 Coordinator is unhealthy or missing');
// Attempt restart (with backoff)
const restarted = await this.restartCoordinator();
if (!restarted) {
this.error('Failed to restart Coordinator after 3 attempts');
await this.sendAlert('CRITICAL: Coordinator restart failed');
return;
}
}
// 2. Check system resources
const resources = await this.checkSystemResources();
if (resources.diskUsage > 95) {
this.error('Disk usage critical: ' + resources.diskUsage + '%');
await this.sendAlert('CRITICAL: Disk space low');
}
// 3. Verify entire stack health
const stackHealth = await this.verifyStackHealth();
if (stackHealth.status !== 'healthy') {
this.warn('Health monitoring stack degraded: ' + JSON.stringify(stackHealth));
}
// 4. Log success
this.log('Watchdog check completed - system healthy');
}
async sendAlert(message) {
// TODO: Implement notification system
// - Email
// - Slack webhook
// - macOS notification center
console.error(`ALERT: ${message}`);
}
}Data Flow:
System Timer (every 60s) → Watchdog
↓
Check Layer 2 Coordinator
↓
Restart if dead
↓
Check system resources
↓
Send alerts if critical
Why First: Provides comprehensive health visibility that Layer 2 needs
Tasks:
- Create
scripts/health-verifier.js - Implement health check registry with configurable rules
- Add comprehensive database health checks
- Generate structured health reports (JSON)
- Integrate with ProcessStateManager
- Test health verification accuracy
Estimated Effort: 2-3 sessions
Why Second: Enables automated remediation based on Layer 3 reports
Tasks:
- Create
scripts/health-coordinator.js - Implement remediation action library
- Add database access coordination (VKB/UKB mutual exclusion)
- Create recovery workflows for common failures
- Add escalation logic to Layer 1
- Test automated remediation
Estimated Effort: 3-4 sessions
Why Last: Only needed when Layers 2-4 are fully operational
Tasks:
- Complete
scripts/system-monitor-watchdog.js - Create launchd plist configuration
- Add alert/notification system
- Implement restart backoff logic
- Add system resource monitoring
- Test launchd integration
Estimated Effort: 2-3 sessions
Layer 3 Success Criteria:
- ✅ Detects all database lock conflicts within 5 seconds
- ✅ Identifies stale PIDs with 100% accuracy
- ✅ Generates actionable remediation steps
- ✅ Health reports are structured and machine-parseable
Layer 2 Success Criteria:
- ✅ Automatically resolves 80%+ of database conflicts
- ✅ Prevents VKB/UKB conflicts through coordination
- ✅ Restarts failed services within 10 seconds
- ✅ Escalates unresolvable issues to Layer 1
Layer 1 Success Criteria:
- ✅ Detects Coordinator failures within 60 seconds
- ✅ Restarts Coordinator automatically with 95%+ success rate
- ✅ Sends alerts for critical failures
- ✅ Survives system reboots and resumes monitoring
With Full 4-Layer Implementation:
- Self-Healing: System automatically recovers from most failures
- No Silent Failures: Every issue is detected and reported
- Actionable Errors: Clear instructions for manual intervention when needed
- High Availability: Multiple layers of redundancy prevent downtime
- Observability: Complete visibility into system health at all times
The Stack:
Layer 1: Watchdog (System-level monitoring)
↓
Layer 2: Coordinator (Automated remediation)
↓
Layer 3: Verifier (Health verification)
↓
Layer 4: Monitor (Process/database monitoring)
↓
Actual Services (VKB, UKB, GraphDB, etc.)
START WITH LAYER 3 - The Verifier is the missing link that will:
- Give visibility into what's actually broken
- Provide structured data for Layer 2 to act on
- Replace manual debugging with automated health checks
- Prevent the "Level DB unavailable" silent failures you experienced
Once Layer 3 is operational, you'll have clear, actionable health reports that make implementing Layer 2 (automated remediation) straightforward.