Created: 2025-11-20 Status: Draft Purpose: Define heartbeat detection, automatic restart, anomaly detection, escalation, and quarantine protocols for resilient multi‑agent operations. Related: ../multi_agent_orchestration.md, ../agents/lifecycle_management.md, ../workflows/task_queue_management.md, ../integration/mcp_servers.md
This document details heartbeat detection, automatic restart, anomaly detection, escalation, and quarantine protocols for resilient multi-agent operations.
Parent Document: Multi-Agent Orchestration Requirements
Monitors SHALL process agent heartbeats and send acknowledgments; missing acknowledgments MUST trigger retries and be observable.
Default thresholds:
- IDLE TTL: 30s
- RUNNING TTL: 15s
- MONITOR TTL (watchdog layer): 15s (heartbeat every 5s)
Sequence numbers SHALL detect lost heartbeats; 3 consecutive misses → UNRESPONSIVE.
All checks SHALL tolerate clock skew up to 2s; timestamps must be compared using monotonic time where available.
1 missed → warn; 2 missed → DEGRADED; 3 missed → UNRESPONSIVE and initiate restart.
- Graceful stop (10s)
- Force terminate
- Spawn replacement with same config
- Reassign incomplete tasks
- Emit AGENT_RESTARTED event with cause chain
After restart, apply RESTART_COOLDOWN (default 60s) before considering further restarts for the same lineage.
MAX_RESTART_ATTEMPTS per ESCALATION_WINDOW (default 3 per hour); exceed → escalate to guardian.
anomaly_score ∈ [0,1] computed from:
- Latency deviation (z-score)
- Error rate trend (EMA)
- Resource skew (CPU/Memory vs baseline)
- Queue impact (blocked dependents)
Thresholds configurable (default 0.8).
Baselines SHALL be learned per agent type and phase; decayed after resurrection.
Require consecutive K anomalous readings (default 3) OR peer-confirmation (watchdog) before quarantine.
- SEV-1: Multiple CRITICAL tasks blocked or guardian unavailable
- SEV-2: Repeated restarts > threshold
- SEV-3: Single agent chronic anomalies
Publish to: Guardian, On-call channel, Incident feed; include trace IDs, last N events, config snapshot, and remediation hints.
For SEV-1, require human acknowledgment within ACK_SLA (default 5m); auto-mitigations proceed regardless to contain blast radius.
Quarantined agents SHALL receive no new tasks; current task is checkpointed or aborted safely.
Preserve memory, logs, metrics, and recent event stream; generate immutable case bundle.
Guardian MAY clear quarantine with evidence; system MUST re-validate upon re-entry (smoke test + short TTL).
P95 TTD for missed heartbeats < 20s (worker), < 10s (monitor by watchdog).
Mean time to recover from single agent failure < 60s.
All escalations, quarantines, and restarts MUST be audited with actor and reason.
| Parameter | Default | Notes |
|---|---|---|
| RESTART_COOLDOWN | 60s | Per lineage |
| MAX_RESTART_ATTEMPTS | 3 | Per hour |
| ESCALATION_WINDOW | 1h | Rolling window |
| ANOMALY_THRESHOLD | 0.8 | Composite score |
| ANOMALY_CONSECUTIVE | 3 | Readings before action |
| ACK_SLA | 5m | For SEV-1 |
If anomaly_score >= ANOMALY_THRESHOLD for ANOMALY_CONSECUTIVE readings AND DIAG_ON_ANOMALY=true, THE SYSTEM SHALL start a Diagnosis Agent for the affected agent to collect evidence and recommendations.
If restart attempts exceed MAX_RESTART_ATTEMPTS within ESCALATION_WINDOW AND DIAG_ON_RESTART_ESCALATION=true, THE SYSTEM SHALL start a Diagnosis Agent focusing on systemic failure causes.
| Parameter | Default | Notes |
|---|---|---|
| DIAG_ON_ANOMALY | true | Enable diagnosis when anomalies persist |
| DIAG_ON_RESTART_ESCALATION | true | Enable diagnosis after restart escalation |
from __future__ import annotations
from datetime import datetime
from enum import Enum
from typing import Any, Dict, List, Optional
from pydantic import BaseModel, Field
class AgentStatus(str, Enum):
IDLE = "IDLE"
RUNNING = "RUNNING"
DEGRADED = "DEGRADED"
UNRESPONSIVE = "UNRESPONSIVE"
QUARANTINED = "QUARANTINED"
TERMINATED = "TERMINATED"
class SeverityEnum(str, Enum):
CRITICAL = "CRITICAL"
HIGH = "HIGH"
MEDIUM = "MEDIUM"
LOW = "LOW"
class HeartbeatMessage(BaseModel):
agent_id: str
timestamp: datetime
sequence_number: int
status: AgentStatus
current_task_id: Optional[str] = None
health_metrics: Dict[str, Any] = Field(default_factory=dict)
checksum: str
class RestartEvent(BaseModel):
agent_id: str
reason: str
graceful_attempt_ms: int = 10000
forced: bool = False
spawned_agent_id: Optional[str] = None
reassigned_tasks: List[str] = Field(default_factory=list)
occurred_at: datetime
class AnomalyReading(BaseModel):
agent_id: str
timestamp: datetime
latency_z: float = 0.0
error_rate_ema: float = 0.0
resource_skew: float = 0.0
queue_impact: float = 0.0
@property
def composite_score(self) -> float:
# Example non-binding reference implementation
return min(
1.0,
0.35 * abs(self.latency_z)
+ 0.30 * self.error_rate_ema
+ 0.20 * self.resource_skew
+ 0.15 * self.queue_impact,
)
class EscalationNotice(BaseModel):
severity: SeverityEnum
agent_ids: List[str] = Field(default_factory=list)
summary: str
trace_ids: List[str] = Field(default_factory=list)
recent_events: List[Dict[str, Any]] = Field(default_factory=list)
config_snapshot: Dict[str, Any] = Field(default_factory=dict)
created_at: datetime
class QuarantineRecord(BaseModel):
agent_id: str
initiated_at: datetime
reason: str
evidence_bundle_uri: str
cleared_at: Optional[datetime] = None
cleared_by: Optional[str] = None- Agent Lifecycle Management Requirements
- Task Queue Management Requirements
- MCP Integration Requirements
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-11-16 | AI Spec Agent | Initial draft |