Name	Name	Last commit message	Last commit date
parent directory ..
4-layer-architecture-implementation-plan.md	4-layer-architecture-implementation-plan.md
README.md	README.md
enhanced-health-monitoring.md	enhanced-health-monitoring.md
monitoring-system.md	monitoring-system.md
process-management-analysis.md	process-management-analysis.md
robust-startup-system.md	robust-startup-system.md
status-line.md	status-line.md
ukb-workflow-system.md	ukb-workflow-system.md

Health System

Automatic system health monitoring and self-healing that ensures a stable development environment.

Phase 33 architecture (current). The 6-layer multi-supervisor design described below is the historical model. Phase 33 (plan 33-04) consolidated lifecycle ownership into a single health coordinator at :3034 running inside the coding-services container; reporters (ETM, health-verifier verify CLI) POST signals and consumers (statusline, dashboard, prompt hook) read state. The retired daemons (HealthVerifier daemon mode, StatusLineHealthMonitor, GlobalProcessSupervisor, GlobalLSLCoordinator, SystemMonitorWatchdog) and on-disk artifacts (.health/verification-status.json, .logs/statusline-health-status.txt, .lsl/global-registry.json) are no longer in use. See docs-content/architecture/health-monitoring.md and docs-content/guides/status-line.md for the current model.

What It Provides

The Health System provides failsafe monitoring with automatic verification and recovery:

Pre-Prompt Checks - Verifies system health before every Claude prompt
Self-Monitoring - The health system monitors itself
Auto-Healing - Automatically restarts failed services
Status Line - Real-time indicators in Claude Code status bar
Dashboard - Visual monitoring at http://localhost:3032 with 3-card system (Databases, Services, Processes)

Architecture

Core Components

The health system is built on interconnected components with active supervision:

Component	File	Purpose
GlobalProcessSupervisor	`global-process-supervisor.js`	Active supervision - 30s checks, restarts dead services, OS-level re-registration
GlobalServiceCoordinator	`global-service-coordinator.js`	Constraint service management with spawn guards and rate limiting
HealthVerifier	`health-verifier.js`	Core verification engine with dynamic discovery & auto-healing
StatusLineHealthMonitor	`statusline-health-monitor.js`	Health aggregation for tmux status bar
CombinedStatusLine	`combined-status-line.js`	Status display + fallback supervisor (GPS heartbeat-gated)
StatusLineFastPath	`status-line-fast.cjs`	Ultra-fast CJS cache reader (~60ms) for tmux `status-right`
EnhancedTranscriptMonitor	`enhanced-transcript-monitor.js`	Real-time per-project transcript monitoring
LiveLoggingCoordinator	`live-logging-coordinator.js`	Logging orchestration with multi-user support
ProcessStateManager	`process-state-manager.js`	Unified registry with atomic file locking (used by all)
AutoRestartWatcher	`auto-restart-watcher.js`	File-change detection — daemons exit cleanly on code change, supervision restarts with new code

Supervision Architecture

Cache Layer - status-line-fast.cjs serves pre-rendered cache in ~60ms (CJS, no ESM overhead) Display + Fallback Layer - CombinedStatusLine renders full status, writes cache; ensure* functions gated by GPS heartbeat Active Supervision Layer - GlobalProcessSupervisor actively monitors and restarts dead services with OS-level fallback discovery Service Coordination Layer - GlobalServiceCoordinator manages constraint services with spawn guards Verification Layer - HealthVerifier runs periodic checks with dynamic project discovery Status Aggregation Layer - StatusLineHealthMonitor + CombinedStatusLine display health Per-Project Layer - EnhancedTranscriptMonitor + LiveLoggingCoordinator per session Core Infrastructure - ProcessStateManager provides unified process registry

Spawn Storm Prevention

The supervision architecture includes multiple guards to prevent runaway process spawning:

Guard	Component	Mechanism
GPS heartbeat gate	CombinedStatusLine	ensure* functions skip when GPS heartbeat <60s old
OS-level dup check	GlobalServiceCoordinator	`findRunningProcessesByScript()` before every spawn
Orphan kill	GlobalServiceCoordinator	Kills spawned process if post-spawn health check fails
Cooldown	GPS (5min), Coordinator (2min)	Per-service cooldown between restart attempts
Rate limiting	GPS (10/hr), Coordinator (6/hr)	Maximum restarts per service per hour
OS-level re-registration	GlobalProcessSupervisor	Re-registers running-but-unregistered services instead of respawning
Health file staleness	GlobalProcessSupervisor	120s threshold (2× write interval) prevents false-positive "dead" detection

Key Features

CJS Fast-Path Cache - status-line-fast.cjs serves status in ~60ms via pre-rendered cache file, eliminating ESM module loading overhead (was 2-18s under load)
Dynamic Discovery - Discovers ALL projects from PSM, health files, and Claude transcript directories
Active Supervision - GlobalProcessSupervisor actively restarts dead monitors within 30 seconds
OS-Level Fallback - GPS re-registers running-but-unregistered services (prevents blind respawn cycles)
Auto-Restart on Code Change - Daemons watch their own source files and exit cleanly when code changes, relying on supervision to restart with updated code
Spawn Storm Prevention - Multi-layered guards: GPS heartbeat gating, OS-level duplicate checks, orphan kill, cooldowns, and rate limiting
Cooldown Protection - GPS: 5-minute cooldown per service; Coordinator: 2-minute cooldown
Rate Limiting - GPS: Max 10 restarts/hour; Coordinator: Max 6 restarts/hour per service
Fallback Supervision - CombinedStatusLine provides backup restart only when GPS heartbeat is stale
Active Session Gating - Transcript monitors only spawned for sessions with transcript activity in the last 2 minutes
Multi-Agent Detection - Detects Claude, Copilot, and OpenCode sessions via process scanning
Agent Age Cap - Running agent's display age capped at monitor uptime; transcripts with status: 'not_found' (e.g., OpenCode) correctly show as inactive instead of falsely green
GPS/Coordinator Deference - GPS defers transcript monitor management to GlobalLSLCoordinator when active (prevents dual-supervisor race condition)
Prompt Hook Safety Net - HealthPromptHook spawns coordinator when GPS has exhausted restart budget (rate-limited 1/min)
Idle Timeout Tmux Guard - Monitor checks for active tmux session before idle-exit (prevents restart budget waste)
Intentional Stop Markers - Graceful shutdown marks project as stopped, preventing restart loops

Component Details

GlobalProcessSupervisor (`scripts/global-process-supervisor.js`) - Active Supervision

Active supervision of ALL transcript monitors and global services
30-second supervision loop with dynamic project discovery
Discovers projects from: PSM registry, health files, Claude transcript directories
OS-level fallback: When PSM says "not registered", checks OS process table via findRunningProcessesByScript() — re-registers alive services instead of blind respawn
Health file staleness threshold: 120 seconds (2× write interval, prevents false-positive "dead" detection at boundary)
Defers to GlobalLSLCoordinator: checks for coordinator daemon via pgrep before restarting transcript monitors (prevents dual-supervisor race)
5-minute cooldown per service prevents restart storms
Max 10 restarts per hour per service (safety limit)
Respects intentional stop markers (skips projects that were gracefully shut down)
Auto-restarts on code change via AutoRestartWatcher
Heartbeat file: .health/supervisor-heartbeat.json
Started via: start-services-robust.js or manually

ProcessStateManager (`scripts/process-state-manager.js`) - Core Infrastructure

Unified registry for all system processes
Atomic file operations via proper-lockfile
Session-aware process tracking (global, per-project, per-session)
Storage: .live-process-registry.json

HealthVerifier (`scripts/health-verifier.js`) - Verification Layer

Core verification engine with 60-second periodic checks
Dynamic discovery of ALL projects (enabled via dynamic_discovery: true in config)
Checks databases (LevelDB, Qdrant, SQLite, Memgraph), services, processes
Generates health scores (0-100) per service
Triggers auto-healing via HealthRemediationActions

StatusLineHealthMonitor (`scripts/statusline-health-monitor.js`) - Status Aggregation

Health aggregation for tmux status bar
15-second update interval with auto-healing
Detects all agent types: Claude, Copilot, OpenCode (via process scanning)
Agent age cap: running agent's age capped at monitor uptime — fresh sessions start green and cool naturally
not_found transcript guard: agents without Claude-compatible transcripts (e.g., OpenCode) correctly show as inactive (⚫) instead of falsely active (🟢)
Sessions removed only when agent process exits, never hidden
Outputs to: .logs/statusline-health-status.txt

CombinedStatusLine (`scripts/combined-status-line.js`) - Display + Fallback Supervisor

Display-first: renders full status bar with all segments
GPS heartbeat gate: ensure* functions only run when GPS heartbeat is stale (>60s) — eliminates overlapping supervision
When GPS is running (normal case): CSL is display-only, writes cache to .logs/combined-status-line-cache.txt
When GPS is dead: CSL acts as fallback supervisor for GPS, SHM, and transcript monitors
2-minute active session gating — only spawns monitors for projects with transcript activity in last 2 min
Respects intentional stop markers (prevents restart loops after graceful shutdown)

StatusLineFastPath (`scripts/status-line-fast.cjs`) - Cache Reader

Ultra-fast CJS cache reader (~60ms) — invoked by tmux status-right every 5 seconds
CommonJS module (no ESM overhead) — eliminates the 2-18 second ESM module resolution penalty under system load
Reads pre-rendered status from .logs/combined-status-line-cache.txt
If cache <60s old: serves immediately
If cache >20s old: triggers background CSL refresh (detached, non-blocking)
If cache missing/stale: falls back to synchronous full CSL execution
Configured in tmux-session-wrapper.sh for all new sessions

EnhancedTranscriptMonitor (`scripts/enhanced-transcript-monitor.js`) - Per-Project

Real-time transcript monitoring per project
2-second check interval for prompt detection
Writes health files to centralized .health/ directory
Generates LSL files in .specstory/history/
Auto-restarts on code change via AutoRestartWatcher
Marks project as intentionally stopped on graceful shutdown (prevents restart loops)

LiveLoggingCoordinator (`scripts/live-logging-coordinator.js`) - Logging

Orchestrates live logging components
Manages LSLFileManager and operational logging
Multi-user support with user hash tracking
Performance metrics collection

Supporting Components

Pre-Prompt Hook (scripts/health-prompt-hook.js)

Runs automatically before every Claude prompt
Returns cached status if fresh (<5 minutes)
Spawns async verification if stale

Auto-Healing (scripts/health-remediation-actions.js)

Automatic service restart capabilities
Database lock cleanup
Zombie process termination

Crash Recovery (scripts/start-services-robust.js)

Pre-startup cleanup of dangling processes
Automatic cleanup after VSCode/Claude crashes
Graceful shutdown tracking for crash detection

Orphan Cleanup (bin/cleanup-orphans)

Manual cleanup utility for orphaned processes
Targets stuck ukb/vkb operations, invalid transcript monitors
Dry-run mode for safe previewing

Status Line Display (scripts/combined-status-line.js)

Reads from StatusLineHealthMonitor output
Real-time indicators in Claude Code status bar
Multi-session support with smart abbreviations

Dashboard (integrations/system-health-dashboard/)

React-based real-time visualization at port 3032
3-card monitoring system (Databases, Services, Processes)
UKB Workflow Monitor with visual workflow graph of agent execution
Service status indicators
Auto-healing history
Manual restart controls

What It Monitors

Databases

LevelDB - Knowledge graph storage
Qdrant - Vector database (port 6333)
SQLite - Analytics database
Memgraph - Code graph database (port 7687 Bolt protocol, port 3100 Lab UI)
CGR Cache - Code graph index staleness (commit tracking)

Services

VKB Server - Knowledge visualization (port 8080)
Constraint Monitor Dashboard - Code quality UI (port 3030)
Constraint Monitor API - Code quality enforcement (port 3031)
System Health Dashboard UI - Health monitoring UI (port 3032)
System Health Dashboard API - Health monitoring API (port 3033)

Docker MCP Servers (Docker Mode Only)

When running in Docker mode, the health system also monitors MCP SSE servers:

Semantic Analysis SSE - AI analysis server (port 3848)
Constraint Monitor SSE - Constraint enforcement MCP (port 3849)
Code Graph RAG SSE - Code graph analysis MCP (port 3850)

Processes

Stale PID detection
Zombie cleanup
Resource monitoring

Transcript Monitor

LSL Health - Verifies transcript monitor is running and processing
Exchange Activity - Tracks exchange count and last processed UUID
Suspicious Activity - Detects stuck or stale monitors

How It Works

Quick Flow:

User issues Claude prompt
Pre-prompt hook fires
Check health status (<5 min old? Use cache : Spawn verification)
Background verification checks all systems
Auto-healing triggers if failures detected
Status updated in dashboard and status line

Detailed Flow: See Enhanced Health Monitoring

CGR Cache Staleness

The .git directory is not mounted into the coding-services container (for performance), so the CGR cache staleness check cannot count commits behind. The verifier reads cache-metadata.json directly and the dashboard displays the cached commit, e.g. coding @ abc123 (staleness unknown).

Service Supervision Hierarchy

The health system uses a multi-layer resilience architecture with spawn storm prevention:

Layer	Component	Trigger	What It Supervises	Spawn Guards
Cache	StatusLineFastPath	Every 5s (tmux)	N/A (read-only)	N/A
1	CombinedStatusLine	Cache miss	GPS, SHM (fallback only)	GPS heartbeat gate
2	GlobalProcessSupervisor	30s loop	HealthVerifier, SHM, TranscriptMonitors	OS dup check, re-registration, 120s staleness
2	GlobalServiceCoordinator	15s loop	Constraint services (api, dashboard)	OS dup check, orphan kill, 2m cooldown, 6/hr limit
3	HealthVerifier	60s loop	Databases, Services, Processes	N/A

Key Guarantee: If any service dies, it will be restarted within:

30 seconds (by GlobalProcessSupervisor)
Or the next cache miss (by CombinedStatusLine as fallback supervisor, only if GPS is also dead)

Quick Start

View Dashboard

# Dashboard automatically available at:
http://localhost:3032

# Or start manually:
cd integrations/system-health-dashboard
npm run dev

Check System Health

# Manual health check
node scripts/health-verifier.js

# View status
cat .health/verification-status.json | jq '.'

Status Line

The status line appears automatically in Claude Code:

Native Mode:

[C🟢 UT🟤] [🔒67% 🔍EX] [Gq$0FEB A$0 O$0 X$25] [📚✅] [🏥✅] 📋17-18

Docker Mode:

[🐳] [🐳MCP:✅] [C🟢 UT🟤] [🔒67% 🔍EX] [Gq$0FEB A$0 O$0 X$25] [📚✅] [🏥✅] 📋17-18

Components:

[🐳] - Docker mode indicator (only shown when running in Docker mode)
[🐳MCP:✅] - Docker MCP health: SA=Semantic Analysis, CM=Constraint Monitor, CGR=Code Graph RAG
[C🟢 UT🟤] - Active sessions with activity icons (all sessions shown, 💤 for sleeping)
[🔒67%] - Constraint compliance percentage (with optional 🟡N violations sub-segment when non-zero)
[📚✅] - Knowledge system status (icons only, no counts)
[🏥✅] - Unified health (GCM + Health Verifier + Enforcement)
📋17-18 - LSL time window (HHMM-HHMM)

See Status Line System for complete documentation.

Detailed Documentation

In-Depth Guides

Status Line System - Complete status line documentation including architecture, state diagrams, multi-session support, and configuration
Enhanced Health Monitoring - Comprehensive health monitoring system with auto-recovery, plug'n'play behavior, and session management
4-Layer Architecture - Detailed architectural design and implementation plan
Monitoring System - Core monitoring components and integration patterns
Process Management - Process lifecycle management and recovery mechanisms
Robust Startup System - Service startup, initialization, and failsafe mechanisms

Related Systems

System Health Dashboard - Dashboard UI and API documentation
LSL - Health events logged in session logs
Constraints - Constraint monitor is a monitored service
Knowledge Management - Monitors VKB, LevelDB, Qdrant

Key Files

Core Health Components:

scripts/global-process-supervisor.js - Active supervision of all services (with OS-level re-registration)
scripts/global-service-coordinator.js - Constraint service management (with spawn guards)
scripts/process-state-manager.js - Unified process registry with atomic locking
scripts/health-verifier.js - Core verification with dynamic discovery & auto-healing
scripts/statusline-health-monitor.js - Health aggregation daemon
scripts/combined-status-line.js - Status display + fallback supervisor
scripts/status-line-fast.cjs - Ultra-fast CJS cache reader (~60ms) for tmux
scripts/enhanced-transcript-monitor.js - Per-project monitoring
scripts/live-logging-coordinator.js - Logging orchestration

Supporting Scripts:

scripts/auto-restart-watcher.js - File-change detection for daemon code reloading
scripts/health-prompt-hook.js - Pre-prompt integration
scripts/health-remediation-actions.js - Auto-healing actions
scripts/start-services-robust.js - Service startup with supervisor
scripts/tmux-session-wrapper.sh - Tmux session wrapper (configures status-line-fast.cjs)

Data Files:

.live-process-registry.json - ProcessStateManager registry
.health/supervisor-heartbeat.json - GlobalProcessSupervisor heartbeat
.health/verification-status.json - HealthVerifier output
.health/*-transcript-monitor-health.json - Per-project health files
.logs/statusline-health-status.txt - StatusLineHealthMonitor output
.logs/combined-status-line-cache.txt - Pre-rendered status cache (served by fast-path)

Configuration:

config/health-verification-rules.json - Health check rules with dynamic_discovery flag

Dashboard:

integrations/system-health-dashboard/server.js - API server (port 3033)
integrations/system-health-dashboard/src/ - React UI (port 3032)

Troubleshooting

Dashboard not loading?

curl http://localhost:3033/api/health
PORT=3030 npm run dashboard

Services stuck unhealthy?

node scripts/health-verifier.js
cat .health/verification-status.json | jq '.'

Auto-healing not working?

cat logs/health-remediation.log
cat scripts/health-checks-config.json | jq '.vkb_server.auto_heal'

For comprehensive troubleshooting, see Enhanced Health Monitoring.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Health System

What It Provides

Architecture

Core Components

Supervision Architecture

Spawn Storm Prevention

Key Features

Component Details

GlobalProcessSupervisor (`scripts/global-process-supervisor.js`) - Active Supervision

ProcessStateManager (`scripts/process-state-manager.js`) - Core Infrastructure

HealthVerifier (`scripts/health-verifier.js`) - Verification Layer

StatusLineHealthMonitor (`scripts/statusline-health-monitor.js`) - Status Aggregation

CombinedStatusLine (`scripts/combined-status-line.js`) - Display + Fallback Supervisor

StatusLineFastPath (`scripts/status-line-fast.cjs`) - Cache Reader

EnhancedTranscriptMonitor (`scripts/enhanced-transcript-monitor.js`) - Per-Project

LiveLoggingCoordinator (`scripts/live-logging-coordinator.js`) - Logging

Supporting Components

What It Monitors

Databases

Services

Docker MCP Servers (Docker Mode Only)

Processes

Transcript Monitor

How It Works

CGR Cache Staleness

Service Supervision Hierarchy

Quick Start

View Dashboard

Check System Health

Status Line

Detailed Documentation

In-Depth Guides

Related Systems

Key Files

Troubleshooting

FilesExpand file tree

health-system

Directory actions

More options

Directory actions

More options

Latest commit

History

health-system

Folders and files

parent directory

README.md

Health System

What It Provides

Architecture

Core Components

Supervision Architecture

Spawn Storm Prevention

Key Features

Component Details

GlobalProcessSupervisor (scripts/global-process-supervisor.js) - Active Supervision

ProcessStateManager (scripts/process-state-manager.js) - Core Infrastructure

HealthVerifier (scripts/health-verifier.js) - Verification Layer

StatusLineHealthMonitor (scripts/statusline-health-monitor.js) - Status Aggregation

CombinedStatusLine (scripts/combined-status-line.js) - Display + Fallback Supervisor

StatusLineFastPath (scripts/status-line-fast.cjs) - Cache Reader

EnhancedTranscriptMonitor (scripts/enhanced-transcript-monitor.js) - Per-Project

LiveLoggingCoordinator (scripts/live-logging-coordinator.js) - Logging

Supporting Components

What It Monitors

Databases

Services

Docker MCP Servers (Docker Mode Only)

Processes

Transcript Monitor

How It Works

CGR Cache Staleness

Service Supervision Hierarchy

Quick Start

View Dashboard

Check System Health

Status Line

Detailed Documentation

In-Depth Guides

Related Systems

Key Files

Troubleshooting

GlobalProcessSupervisor (`scripts/global-process-supervisor.js`) - Active Supervision

ProcessStateManager (`scripts/process-state-manager.js`) - Core Infrastructure

HealthVerifier (`scripts/health-verifier.js`) - Verification Layer

StatusLineHealthMonitor (`scripts/statusline-health-monitor.js`) - Status Aggregation

CombinedStatusLine (`scripts/combined-status-line.js`) - Display + Fallback Supervisor

StatusLineFastPath (`scripts/status-line-fast.cjs`) - Cache Reader

EnhancedTranscriptMonitor (`scripts/enhanced-transcript-monitor.js`) - Per-Project

LiveLoggingCoordinator (`scripts/live-logging-coordinator.js`) - Logging