4-Layer Health Monitoring Architecture - Implementation Plan

Current State Assessment

What's Implemented (Layer 4):

✅ ProcessStateManager with PID tracking and service registry
✅ Database health monitoring (Level DB lock detection, Qdrant availability)
✅ Enhanced Transcript Monitor (per-session monitoring)
✅ StatusLine Health Monitor (session discovery and display)
✅ VKB server PSM integration
✅ Fail-fast database initialization

What's Missing (Layers 1-3):

❌ Layer 1: Watchdog (global service monitoring and recovery)
❌ Layer 2: Coordinator (active multi-project coordination)
❌ Layer 3: Verifier (comprehensive health verification)

Layer 4: Monitor (IMPLEMENTED ✅)

Purpose: Individual session monitoring at the lowest level

Components:

Enhanced Transcript Monitor (scripts/enhanced-transcript-monitor.js)
ProcessStateManager (scripts/process-state-manager.js)
Database health checks

Responsibilities:

Monitor individual Claude Code session transcripts
Track process PIDs and health status
Detect database locks and conflicts
Provide real-time health metrics
Register/unregister services (global, per-project, per-session)

Data Flow:

Session Activity → Transcript Monitor → ProcessStateManager → Health Status
Database Operations → Lock Detection → Health Report

Layer 3: Verifier (PLANNED ❌)

Purpose: Comprehensive health verification and proactive issue detection

Components to Build:

scripts/health-verifier.js - Main verification engine
scripts/health-rules-engine.js - Configurable health rules
scripts/health-reporter.js - Health report generation

Responsibilities:

Periodic Health Verification
- Run comprehensive health checks every N seconds (configurable)
- Verify Layer 4 data integrity and consistency
- Cross-validate ProcessStateManager registry with actual processes
Proactive Issue Detection
- Detect stale PIDs (processes no longer running)
- Identify resource exhaustion (disk space, memory, CPU)
- Find orphaned processes not in PSM registry
- Detect zombie transcript monitors
Health Rule Evaluation
- Configurable rules for what constitutes "healthy"
- Thresholds for warnings vs errors
- Service-specific health criteria (e.g., VKB must be accessible, databases must be unlocked)
Health Reporting
- Generate structured health reports (JSON format)
- Categorize issues by severity (info, warning, error, critical)
- Provide actionable remediation steps
- Store health history for trend analysis

Key Health Checks:

// Health verification checklist
const healthChecks = {
  // Database health
  levelDB: {
    check: () => checkDatabaseLock(),
    severity: 'critical',
    remediation: 'Stop VKB server or kill lock holder'
  },

  // Process health
  registeredServices: {
    check: () => verifyAllRegisteredProcessesAlive(),
    severity: 'warning',
    remediation: 'Clean up dead processes from registry'
  },

  // Resource health
  diskSpace: {
    check: () => checkDiskSpace('.data'),
    threshold: '90%',
    severity: 'error',
    remediation: 'Clear old logs or expand storage'
  },

  // Service health
  vkbServer: {
    check: () => testHTTPConnectivity('http://localhost:8080'),
    severity: 'warning',
    remediation: 'Restart VKB server'
  }
};

Data Flow:

Layer 4 Data → Health Verifier → Rules Engine → Health Report
                                             ↓
                                    Issue Detection
                                             ↓
                               Layer 2 (Coordinator)

Output Format:

{
  "timestamp": "2025-11-08T10:30:00Z",
  "overallStatus": "degraded",
  "issues": [
    {
      "type": "database_lock",
      "severity": "critical",
      "component": "levelDB",
      "message": "Level DB locked by unregistered process (PID: 12345)",
      "remediation": "kill 12345 && ukb restart",
      "affectedServices": ["ukb", "graph-database"]
    }
  ],
  "healthMetrics": {
    "services": {
      "total": 5,
      "healthy": 3,
      "unhealthy": 2
    },
    "databases": {
      "levelDB": { "status": "locked", "lockedBy": 12345 },
      "qdrant": { "status": "unavailable" }
    }
  }
}

Layer 2: Coordinator (PLANNED ❌)

Purpose: Active coordination and remediation across multiple projects

Components to Build:

scripts/health-coordinator.js - Main coordination engine
scripts/remediation-actions.js - Automated remediation library
scripts/coordination-strategies.js - Multi-project coordination

Responsibilities:

Consume Health Reports from Layer 3
- Subscribe to health verification results
- Prioritize issues by severity
- Track issue resolution progress
Automated Remediation
- Execute remediation actions for known issues
- Clean up stale PIDs from registry
- Restart failed services automatically
- Resolve database lock conflicts
- Free orphaned resources
Multi-Project Coordination
- Coordinate database access across projects
- Prevent VKB/UKB conflicts (enforce mutual exclusion)
- Manage shared resource allocation
- Balance health monitoring load across projects
Recovery Orchestration
- Implement recovery workflows for common failure scenarios
- Coordinate service restarts in dependency order
- Validate successful recovery
- Escalate to Layer 1 if recovery fails

Remediation Actions:

const remediationActions = {
  // Database conflicts
  'database_lock': async (issue) => {
    const pid = issue.details.lockedBy;
    const service = await psm.getServiceByPid(pid);

    if (service?.name === 'vkb-server') {
      // Graceful shutdown
      await vkb.stop();
    } else {
      // Force kill unregistered process
      process.kill(pid, 'SIGTERM');
    }

    // Wait and verify lock released
    await waitForLockRelease();
    return { success: true, action: 'killed_lock_holder' };
  },

  // Dead process cleanup
  'stale_pid': async (issue) => {
    await psm.cleanupDeadProcesses();
    return { success: true, action: 'cleaned_registry' };
  },

  // Service restart
  'service_down': async (issue) => {
    const serviceName = issue.details.service;
    await restartService(serviceName);
    await verifyServiceHealthy(serviceName);
    return { success: true, action: 'restarted_service' };
  }
};

Coordination Strategies:

// VKB/UKB mutual exclusion
class DatabaseAccessCoordinator {
  async requestDatabaseAccess(requester, operation) {
    // Check if VKB server is running
    if (await psm.isServiceRunning('vkb-server', 'global')) {
      if (operation === 'write') {
        // UKB needs write access - stop VKB first
        await vkb.stop();
        await waitForLockRelease();
        return { granted: true, stoppedVKB: true };
      } else {
        // Read-only operations can coexist
        return { granted: true, stoppedVKB: false };
      }
    }

    return { granted: true, stoppedVKB: false };
  }

  async releaseDatabaseAccess(requester, options) {
    if (options.stoppedVKB && options.autoRestart) {
      // Restart VKB if we stopped it
      await vkb.start({ foreground: false });
    }
  }
}

Data Flow:

Layer 3 Health Report → Coordinator → Remediation Actions
                                   ↓
                          Recovery Workflows
                                   ↓
                    Verify Success (back to Layer 3)
                                   ↓
                 Escalate to Layer 1 if failed

Layer 1: Watchdog (PARTIALLY IMPLEMENTED ⚠️)

Purpose: Ultimate failsafe monitoring - "Who watches the watchmen?"

Existing Component:

scripts/system-monitor-watchdog.js (EXISTS but not integrated with launchd)

What's Missing:

launchd/cron integration for automatic startup
Self-recovery mechanisms
Alert/notification system
Integration with Layer 2 Coordinator

Enhanced Responsibilities:

Monitor Layer 2 Coordinator
- Detect if Coordinator process crashes or hangs
- Restart Coordinator automatically
- Track restart count and prevent infinite restart loops
System-Level Health Checks
- Verify entire monitoring stack is operational
- Detect system-wide issues (out of disk, out of memory)
- Monitor system resources (CPU, RAM, disk I/O)
Escalation and Alerting
- Send notifications when automatic recovery fails
- Create incident reports for manual intervention
- Provide system administrator dashboard
Cannot Be Killed
- Run as system service (launchd on macOS)
- Restart automatically on system reboot
- Protected from user process termination

launchd Integration:

<!-- ~/Library/LaunchAgents/com.coding.system-watchdog.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.coding.system-watchdog</string>

  <key>ProgramArguments</key>
  <array>
    <string>/usr/local/bin/node</string>
    <string>/Users/q284340/Agentic/coding/scripts/system-monitor-watchdog.js</string>
    <string>--daemon</string>
  </array>

  <key>StartInterval</key>
  <integer>60</integer> <!-- Run every 60 seconds -->

  <key>RunAtLoad</key>
  <true/>

  <key>KeepAlive</key>
  <dict>
    <key>SuccessfulExit</key>
    <false/>
  </dict>

  <key>StandardOutPath</key>
  <string>/Users/q284340/Agentic/coding/.logs/watchdog-stdout.log</string>

  <key>StandardErrorPath</key>
  <string>/Users/q284340/Agentic/coding/.logs/watchdog-stderr.log</string>
</dict>
</plist>

Installation Commands:

# Install watchdog
launchctl load ~/Library/LaunchAgents/com.coding.system-watchdog.plist

# Check status
launchctl list | grep com.coding.system-watchdog

# Uninstall
launchctl unload ~/Library/LaunchAgents/com.coding.system-watchdog.plist

Enhanced Watchdog Logic:

class SystemMonitorWatchdog {
  async run() {
    // 1. Check Layer 2 Coordinator health
    const coordinatorHealthy = await this.checkCoordinatorHealth();

    if (!coordinatorHealthy) {
      this.warn('Layer 2 Coordinator is unhealthy or missing');

      // Attempt restart (with backoff)
      const restarted = await this.restartCoordinator();

      if (!restarted) {
        this.error('Failed to restart Coordinator after 3 attempts');
        await this.sendAlert('CRITICAL: Coordinator restart failed');
        return;
      }
    }

    // 2. Check system resources
    const resources = await this.checkSystemResources();

    if (resources.diskUsage > 95) {
      this.error('Disk usage critical: ' + resources.diskUsage + '%');
      await this.sendAlert('CRITICAL: Disk space low');
    }

    // 3. Verify entire stack health
    const stackHealth = await this.verifyStackHealth();

    if (stackHealth.status !== 'healthy') {
      this.warn('Health monitoring stack degraded: ' + JSON.stringify(stackHealth));
    }

    // 4. Log success
    this.log('Watchdog check completed - system healthy');
  }

  async sendAlert(message) {
    // TODO: Implement notification system
    // - Email
    // - Slack webhook
    // - macOS notification center
    console.error(`ALERT: ${message}`);
  }
}

Data Flow:

System Timer (every 60s) → Watchdog
                              ↓
                  Check Layer 2 Coordinator
                              ↓
                       Restart if dead
                              ↓
                  Check system resources
                              ↓
                    Send alerts if critical

Implementation Priority

Phase 1: Layer 3 - Verifier (HIGHEST PRIORITY)

Why First: Provides comprehensive health visibility that Layer 2 needs

Tasks:

Create scripts/health-verifier.js
Implement health check registry with configurable rules
Add comprehensive database health checks
Generate structured health reports (JSON)
Integrate with ProcessStateManager
Test health verification accuracy

Estimated Effort: 2-3 sessions

Phase 2: Layer 2 - Coordinator (MEDIUM PRIORITY)

Why Second: Enables automated remediation based on Layer 3 reports

Tasks:

Create scripts/health-coordinator.js
Implement remediation action library
Add database access coordination (VKB/UKB mutual exclusion)
Create recovery workflows for common failures
Add escalation logic to Layer 1
Test automated remediation

Estimated Effort: 3-4 sessions

Phase 3: Layer 1 - Watchdog (LOWER PRIORITY)

Why Last: Only needed when Layers 2-4 are fully operational

Tasks:

Complete scripts/system-monitor-watchdog.js
Create launchd plist configuration
Add alert/notification system
Implement restart backoff logic
Add system resource monitoring
Test launchd integration

Estimated Effort: 2-3 sessions

Success Metrics

Layer 3 Success Criteria:

✅ Detects all database lock conflicts within 5 seconds
✅ Identifies stale PIDs with 100% accuracy
✅ Generates actionable remediation steps
✅ Health reports are structured and machine-parseable

Layer 2 Success Criteria:

✅ Automatically resolves 80%+ of database conflicts
✅ Prevents VKB/UKB conflicts through coordination
✅ Restarts failed services within 10 seconds
✅ Escalates unresolvable issues to Layer 1

Layer 1 Success Criteria:

✅ Detects Coordinator failures within 60 seconds
✅ Restarts Coordinator automatically with 95%+ success rate
✅ Sends alerts for critical failures
✅ Survives system reboots and resumes monitoring

Architecture Benefits

With Full 4-Layer Implementation:

Self-Healing: System automatically recovers from most failures
No Silent Failures: Every issue is detected and reported
Actionable Errors: Clear instructions for manual intervention when needed
High Availability: Multiple layers of redundancy prevent downtime
Observability: Complete visibility into system health at all times

The Stack:

Layer 1: Watchdog (System-level monitoring)
   ↓
Layer 2: Coordinator (Automated remediation)
   ↓
Layer 3: Verifier (Health verification)
   ↓
Layer 4: Monitor (Process/database monitoring)
   ↓
Actual Services (VKB, UKB, GraphDB, etc.)

Current Recommendation

START WITH LAYER 3 - The Verifier is the missing link that will:

Give visibility into what's actually broken
Provide structured data for Layer 2 to act on
Replace manual debugging with automated health checks
Prevent the "Level DB unavailable" silent failures you experienced

Once Layer 3 is operational, you'll have clear, actionable health reports that make implementing Layer 2 (automated remediation) straightforward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4-Layer Health Monitoring Architecture - Implementation Plan

Current State Assessment

Layer 4: Monitor (IMPLEMENTED ✅)

Layer 3: Verifier (PLANNED ❌)

Layer 2: Coordinator (PLANNED ❌)

Layer 1: Watchdog (PARTIALLY IMPLEMENTED ⚠️)

Implementation Priority

Phase 1: Layer 3 - Verifier (HIGHEST PRIORITY)

Phase 2: Layer 2 - Coordinator (MEDIUM PRIORITY)

Phase 3: Layer 1 - Watchdog (LOWER PRIORITY)

Success Metrics

Architecture Benefits

Current Recommendation

FilesExpand file tree

4-layer-architecture-implementation-plan.md

Latest commit

History

4-layer-architecture-implementation-plan.md

File metadata and controls

4-Layer Health Monitoring Architecture - Implementation Plan

Current State Assessment

Layer 4: Monitor (IMPLEMENTED ✅)

Layer 3: Verifier (PLANNED ❌)

Layer 2: Coordinator (PLANNED ❌)

Layer 1: Watchdog (PARTIALLY IMPLEMENTED ⚠️)

Implementation Priority

Phase 1: Layer 3 - Verifier (HIGHEST PRIORITY)

Phase 2: Layer 2 - Coordinator (MEDIUM PRIORITY)

Phase 3: Layer 1 - Watchdog (LOWER PRIORITY)

Success Metrics

Architecture Benefits

Current Recommendation