The Marcus Error Framework is a comprehensive, autonomous agent-optimized error handling system that provides structured exception hierarchies, intelligent recovery strategies, circuit breaker patterns, and real-time monitoring capabilities. Unlike traditional error handling systems designed for human-operated applications, this framework is specifically engineered for autonomous agent environments where errors must be self-diagnosing, self-recovering, and provide actionable intelligence for both automated retry logic and human escalation.
The Error Framework consists of four primary modules operating in concert:
Marcus Error Framework Architecture
├── error_framework.py (Core Exception System)
│ ├── MarcusBaseError (Base Exception Class)
│ ├── ErrorContext (Rich Contextual Information)
│ ├── RemediationSuggestion (Recovery Guidance)
│ └── Error Type Hierarchy (Domain-Specific Exceptions)
├── error_strategies.py (Recovery Strategies)
│ ├── RetryHandler (Exponential Backoff & Jitter)
│ ├── CircuitBreaker (Cascade Failure Prevention)
│ ├── FallbackHandler (Graceful Degradation)
│ └── ErrorAggregator (Batch Operation Handling)
├── error_responses.py (Format Adapters)
│ ├── MCP Protocol Responses
│ ├── JSON API Responses
│ ├── User-Friendly Messages
│ └── Logging & Monitoring Formats
└── error_monitoring.py (Intelligence & Analytics)
├── Pattern Detection Engine
├── Correlation Analysis
├── Health Scoring Algorithm
└── Alert Management System
The framework implements a sophisticated six-tier error classification system:
Tier 1: Transient Errors (Auto-recoverable)
NetworkTimeoutError: Network operations exceeding time limitsServiceUnavailableError: Temporary external service outagesRateLimitError: API rate limit violations with retry timingTemporaryResourceError: Temporary system resource exhaustion
Tier 2: Configuration Errors (User-resolvable)
MissingCredentialsError: Absent authentication credentialsInvalidConfigurationError: Malformed configuration valuesMissingDependencyError: Required dependencies not installedEnvironmentError: Incorrect environment setup
Tier 3: Business Logic Errors (Logic violations)
TaskAssignmentError: Task allocation conflicts or impossibilitiesWorkflowViolationError: Workflow state machine violationsValidationError: Data validation failuresStateConflictError: System state inconsistenciesTaskValidationError: Task-specific validation failures (e.g., invalid task fields)ProjectRootNotFoundError: Unable to locate the project root directory
Tier 4: Integration Errors (External service issues)
KanbanIntegrationError: Kanban board connectivity/operation failuresAIProviderError: AI service integration failuresAuthenticationError: External service authentication failuresExternalServiceError: Generic external service errors
Tier 5: Security Errors (Critical security events)
AuthorizationError: Permission/authorization violationsWorkspaceSecurityError: Workspace isolation breachesPermissionError: File/resource permission violations
Tier 6: System Errors (Critical infrastructure failures)
ResourceExhaustionError: System resource depletionCorruptedStateError: Data corruption detectionDatabaseError: Database operation failuresCriticalDependencyError: Essential system component failures
The Error Framework operates as a cross-cutting concern throughout the Marcus ecosystem:
- MCP Server Layer: All MCP tool calls are wrapped with error handling for consistent response formatting
- Agent Workflow Layer: Task assignment, progress reporting, and blocker handling utilize error context and retry strategies
- Integration Layer: External service calls (Kanban, AI providers) are protected by circuit breakers and fallback mechanisms
- Core Processing Layer: Business logic operations leverage validation and state conflict detection
- Monitoring Layer: All errors feed into the monitoring system for pattern analysis and health scoring
The Error Framework intercepts and enhances error handling at key workflow stages:
Typical Agent Workflow Error Integration:
create_project → register_agent → request_next_task → report_progress → report_blocker → finish_task
↓ ↓ ↓ ↓ ↓ ↓
Configuration Business Logic Integration Business Logic Integration Integration
Errors Errors Errors Errors Errors Errors
↓ ↓ ↓ ↓ ↓ ↓
No Retry Validation & Circuit Breaker Context Logging AI-Powered Final Cleanup
State Checking & Retry Logic & Monitoring Suggestions & Reporting
Every Marcus error carries comprehensive context through the ErrorContext class:
@dataclass
class ErrorContext:
# Operation identification
operation: str = "" # What was being attempted
operation_id: str = uuid4() # Unique operation identifier
# Agent context
agent_id: Optional[str] = None # Which agent encountered the error
task_id: Optional[str] = None # Current task being processed
agent_state: Optional[Dict] = None # Agent's current state snapshot
# System context
timestamp: datetime = now() # When error occurred
correlation_id: str = uuid4() # For tracing related operations
system_state: Optional[Dict] = None # System resource state
# Integration context
integration_name: Optional[str] = None # External service involved
integration_state: Optional[Dict] = None # Service-specific state
# Extensible context
user_context: Optional[Dict] = None # User-specific information
custom_context: Optional[Dict] = None # Operation-specific dataThe framework provides automatic context injection through the error_context context manager:
with error_context("kanban_sync", agent_id="agent_123", task_id="task_456"):
# Any MarcusBaseError raised here automatically includes context
sync_task_with_kanban()The retry system implements sophisticated backoff strategies.
Production implementation is in src/core/resilience.py:
@dataclass
class RetryConfig:
max_attempts: int = 3 # Maximum retry attempts
base_delay: float = 1.0 # Initial delay in seconds
max_delay: float = 60.0 # Maximum delay cap
exponential_base: float = 2.0 # Backoff multiplier
jitter: bool = True # Add randomizationNote:
src/core/error_strategies.pycontains a more fully-featuredRetryConfigwith additional fields (retry_on,stop_on,multiplier), but this module is NOT integrated into the production codebase. Theresilience.pyversion above is what is actually used.
Jitter Algorithm (actual implementation using secrets.SystemRandom()):
delay *= 0.5 + secure_random.random() — gives 50%–150% of the calculated delay,
where secure_random = secrets.SystemRandom() (cryptographically secure).
Benefits:
- Prevents thundering herd problems
- Configurable per operation type
- Cryptographically secure jitter
Prevents cascading failures through intelligent service protection.
Production implementation is the CircuitBreaker class in src/core/resilience.py.
State is tracked as a plain string attribute ("closed", "open", "half-open").
Note:
src/core/error_strategies.pyhas aCircuitBreakerStateenum (singular), but that module is NOT integrated into production. There is no class calledCircuitBreakerStates(plural) anywhere in the codebase.
State Transitions:
"closed"→"open": Afterfailure_thresholdconsecutive failures"open"→"half-open": Afterrecovery_timeoutduration expires"half-open"→"closed": On successful test call"half-open"→"open": On any failure during testing
Autonomous Benefits:
- Prevents agents from hammering failing services
- Automatic recovery testing
- Service health awareness
- Resource conservation
Graceful degradation through priority-ordered fallback functions:
fallback_handler = FallbackHandler("task_creation")
fallback_handler.add_fallback(create_task_locally, priority=1) # Try first
fallback_handler.add_fallback(queue_for_later, priority=2) # Then this
fallback_handler.add_fallback(use_cached_template, priority=3) # Finally thisFallback Strategy Selection:
- Primary function attempted
- On failure, fallbacks tried in priority order
- First successful fallback result returned
- If all fail, cached results used if available
- If no cache, enhanced error with exhausted fallback information
The monitoring system identifies four categories of error patterns:
1. Frequency Patterns: Same error type occurring repeatedly
Threshold: 5+ occurrences within 10 minutes
Detection: Error type fingerprinting
Action: Pattern alert with error type analysis
2. Burst Patterns: High error volume in short timeframe
Threshold: 10+ errors within 5 minutes (any type)
Detection: Time-window error counting
Action: System stability alert
3. Agent-Specific Patterns: High error rate from individual agents
Threshold: 20+ errors from single agent within 30 minutes
Detection: Agent ID error aggregation
Action: Agent health check recommendation
4. Cascade Patterns: Related errors occurring in sequence
Threshold: 3+ similar errors with 70%+ similarity within 5 minutes
Detection: Multi-dimensional error similarity scoring
Action: Root cause investigation trigger
The framework calculates error similarity using weighted factors:
def calculate_similarity(error1, error2) -> float:
factors = []
if error1.error_type == error2.error_type: factors.append(0.4) # 40% weight
if error1.operation == error2.operation: factors.append(0.3) # 30% weight
if error1.integration == error2.integration: factors.append(0.2) # 20% weight
if abs(error1.timestamp - error2.timestamp) < 60s: factors.append(0.1) # 10% weight
return sum(factors) # 0.0 to 1.0 similarity scoreSystem health calculated as weighted score (0-100):
health_score = 100
if error_rate_per_minute > 10: health_score -= 30
elif error_rate_per_minute > 5: health_score -= 15
elif error_rate_per_minute > 2: health_score -= 5
if critical_errors > 0: health_score -= 25
health_score -= active_patterns * 10 # 10 points per active pattern
health_score = max(0, health_score)Health Status Mapping:
- 90-100: Excellent
- 75-89: Good
- 50-74: Fair
- 25-49: Poor
- 0-24: Critical
Optimized for Claude Code agent consumption:
{
"success": false,
"error": {
"code": "KANBAN_INTEGRATION_ERROR",
"message": "Failed to create task on board 'Development'",
"type": "KanbanIntegrationError",
"severity": "medium",
"retryable": true,
"context": {
"operation": "create_task",
"correlation_id": "corr_abc123",
"agent_id": "agent_dev_001",
"task_id": "task_456"
},
"remediation": {
"immediate": "Retry task creation with exponential backoff",
"fallback": "Create task locally and sync when service recovers",
"retry": "Automatic retry in 2.5 seconds (attempt 2/3)"
}
}
}Human-readable error presentation:
Unable to create task on Kanban board due to service timeout.
💡 What to do: The system will retry automatically in 30 seconds
🔄 Alternative: Task has been created locally and will sync when the service recovers
🔁 Retry: This is attempt 2 of 3 - if all attempts fail, the task will remain in local queue
Structured for log analysis and debugging:
{
"level": "error",
"timestamp": "2025-07-14T15:30:45.123Z",
"error_code": "KANBAN_INTEGRATION_ERROR",
"error_type": "KanbanIntegrationError",
"correlation_id": "corr_abc123",
"operation": "create_task",
"agent_id": "agent_dev_001",
"task_id": "task_456",
"integration": "planka_board",
"retryable": true,
"severity": "medium",
"caused_by": "requests.exceptions.Timeout",
"custom_context": {
"board_id": "board_789",
"task_title": "Implement user authentication"
}
}Error Types: Configuration, Validation Handling:
- Immediate validation of project parameters
- Configuration error detection and user guidance
- No retries (user input required)
- Clear remediation instructions
Error Types: Business Logic, System Handling:
- Agent ID validation and conflict detection
- Capability matching verification
- State initialization error recovery
- Agent registry consistency checking
Error Types: Integration, Business Logic, Transient Handling:
- Kanban integration with circuit breaker protection
- Task assignment algorithm error recovery
- AI-powered task matching with fallbacks
- Dependency conflict resolution
Error Types: Integration, Validation, Transient Handling:
- Progress validation against task constraints
- Kanban sync with retry logic
- State consistency verification
- Context preservation for correlation
Error Types: Integration, AI Provider, Business Logic Handling:
- AI-powered suggestion generation with fallbacks
- Blocker classification and severity assessment
- Escalation path determination
- Context aggregation for pattern analysis
Error Types: Integration, System, Validation Handling:
- Task completion validation
- Final state synchronization
- Cleanup operation error handling
- Correlation group closure
Error Strategy:
- Basic retry logic (3 attempts, 1s base delay)
- Simple circuit breaker (5 failure threshold)
- Minimal context collection
- Standard monitoring
@with_retry(RetryConfig(max_attempts=3, base_delay=1.0))
async def handle_simple_task():
# Lightweight error handling
passError Strategy:
- Enhanced retry logic (5 attempts, 2s base delay)
- Sensitive circuit breaker (3 failure threshold)
- Rich context collection including dependency state
- Enhanced monitoring with pattern correlation
@with_retry(RetryConfig(max_attempts=5, base_delay=2.0))
async def handle_complex_task():
with error_context("complex_task",
agent_id=agent_id,
task_id=task_id,
custom_context={"dependencies": dep_list}):
# Enhanced error handling with dependency awareness
passComplex Task Enhancements:
- Dependency state tracking in error context
- Multi-agent error correlation
- Cascade failure prevention
- Enhanced pattern detection sensitivity
The Error Framework integrates with Marcus's Kanban provider abstraction layer:
Planka Provider Errors:
- Connection timeouts: 30s timeout with 3 retries
- Authentication failures: No retry, immediate credential refresh
- Rate limiting: Exponential backoff with provider-specific limits
- Board access errors: Permission validation and fallback board selection
Generic Kanban Errors:
- Provider detection and capability matching
- Failover between multiple configured providers
- Provider-specific error code translation
- Board synchronization conflict resolution
Error Scenarios:
- Task creation conflicts during multi-agent operations
- Board state drift during network partitions
- Concurrent modification conflicts
- Board access permission changes
Resolution Strategies:
- Optimistic locking with conflict detection
- Last-writer-wins with conflict notification
- Manual merge conflict resolution
- Fallback to local state with delayed sync
The error_context context manager is implemented in src/core/error_framework.py as a plain
@contextmanager. It modifies a MarcusBaseError in-place if one is raised within the block.
There is no ContextVar, no current_error_context variable, and no token-based reset.
@contextmanager
def error_context(operation: str, **context_kwargs):
try:
yield
except MarcusBaseError as e:
# Inject context fields into the error in-place
if not e.context.operation:
e.context.operation = operation
for key, value in context_kwargs.items():
setattr(e.context, key, value)
raiseThe monitoring system implements intelligent memory management:
class ErrorMonitor:
def __init__(self):
self.error_history: deque = deque(maxlen=10000) # Ring buffer
self.pattern_cleanup_threshold = timedelta(days=7)
self.correlation_timeout = timedelta(hours=24)
def _cleanup_old_data(self):
# Automatic cleanup of old patterns and correlations
# Prevents memory leaks in long-running agents
passThe framework provides seamless async/sync compatibility:
def with_retry(config: RetryConfig = None):
def decorator(func):
@wraps(func)
async def async_wrapper(*args, **kwargs):
# Async implementation
pass
@wraps(func)
def sync_wrapper(*args, **kwargs):
# Sync implementation using asyncio.run()
pass
return async_wrapper if asyncio.iscoroutinefunction(func) else sync_wrapper
return decoratorAll error data structures are designed for safe JSON serialization:
@dataclass
class ErrorContext:
def to_dict(self) -> Dict[str, Any]:
return {
'operation': self.operation,
'timestamp': self.timestamp.isoformat(),
'custom_context': self.custom_context or {}
}1. Autonomous Agent Optimization
- Self-diagnosing errors with actionable remediation
- Automatic retry and recovery strategies
- Context-aware error handling
- Pattern detection for proactive issue identification
2. Comprehensive Error Intelligence
- Rich contextual information for debugging
- Multi-dimensional error correlation
- Real-time health monitoring and scoring
- Predictive pattern analysis
3. Integration-Friendly Design
- Multiple response format adapters
- Seamless legacy code integration
- Configurable retry and circuit breaker policies
- Extensible error type hierarchy
4. Production-Ready Features
- Memory-efficient monitoring with cleanup
- Thread-safe and async-compatible
- Structured logging integration
- Security-conscious sensitive data handling
5. Developer Experience
- Decorator-based easy integration
- Context manager automatic error enhancement
- Clear error type classification
- Comprehensive remediation suggestions
1. Complexity Overhead
- Substantial codebase complexity for simple applications
- Learning curve for new developers
- Additional memory footprint for monitoring
- Configuration complexity for advanced features
2. Performance Considerations
- Context collection overhead on every error
- Monitoring system background processing
- Pattern detection computational cost
- Serialization overhead for error responses
3. Framework Lock-in
- Marcus-specific error types create vendor lock-in
- Migration complexity from existing error handling
- Dependency on Marcus ecosystem components
- Framework-specific debugging knowledge required
4. Configuration Complexity
- Multiple configuration layers (retry, circuit breaker, monitoring)
- Environment-specific tuning requirements
- Provider-specific error mapping complexity
- Fine-tuning required for optimal performance
1. Autonomous Agent Requirements Traditional error handling systems assume human operators who can interpret error messages and take corrective action. Marcus agents require:
- Machine-interpretable error classifications
- Automatic recovery strategies
- Rich context for correlation across operations
- Predictive pattern analysis for proactive issue resolution
2. Microservices-Style Error Handling The framework treats each component (Kanban integration, AI providers, task assignment) as independent services requiring:
- Circuit breaker protection against cascade failures
- Service-specific retry strategies
- Fallback mechanisms for graceful degradation
- Health monitoring and automatic service discovery
3. Observable System Design Error handling as a first-class observability concern:
- Every error contributes to system health understanding
- Pattern detection enables proactive issue resolution
- Error correlation provides root cause analysis
- Health scoring guides system optimization
4. Developer Experience Priority Balancing power with usability:
- Decorator-based integration for minimal code changes
- Context managers for automatic error enhancement
- Clear error type hierarchy for easy classification
- Multiple response formats for different consumption patterns
1. Simple Exception Hierarchy Rejected: Insufficient for autonomous agent needs
- No automatic retry logic
- No context preservation
- No pattern detection capabilities
- No service protection mechanisms
2. External Error Management Service Rejected: Added complexity and latency
- Network dependency for error handling
- Additional service to maintain and monitor
- Latency impact on error processing
- Single point of failure
3. Framework-Agnostic Error Handling Rejected: Generic solutions lack domain specificity
- No Marcus-specific error types
- No integration with agent workflow
- No Kanban provider awareness
- No AI provider error handling
1. Enhanced AI Integration
- GPT-4 powered error analysis and remediation suggestions
- Automatic root cause analysis using error correlations
- Predictive error modeling based on historical patterns
- Context-aware error severity adjustment
2. Advanced Pattern Detection
- Machine learning-based pattern recognition
- Seasonal and cyclical error pattern detection
- Cross-agent error correlation analysis
- Predictive failure forecasting
3. Performance Optimizations
- Streaming error data processing
- Compressed error history storage
- Lazy error context evaluation
- Background pattern analysis
1. Distributed Error Management
- Multi-instance error correlation
- Distributed circuit breaker coordination
- Global system health aggregation
- Cross-deployment error pattern sharing
2. Self-Healing Capabilities
- Automatic configuration adjustment based on error patterns
- Dynamic retry strategy optimization
- Self-tuning circuit breaker thresholds
- Autonomous remediation action execution
3. Advanced Monitoring Integration
- Prometheus metrics export
- Grafana dashboard templates
- AlertManager integration
- Custom metric collection and analysis
1. Predictive Error Prevention
- Pre-error condition detection
- Proactive remediation action triggering
- Resource usage prediction and scaling
- Failure cascade prevention
2. Cross-System Error Learning
- Error pattern sharing between Marcus instances
- Community-driven error knowledge base
- Automated error handling best practice evolution
- Cross-domain error pattern recognition
3. Advanced Recovery Strategies
- AI-powered custom recovery strategy generation
- Dynamic fallback chain optimization
- Context-aware recovery strategy selection
- Self-evolving error handling policies
async def mcp_create_task(arguments: Dict[str, Any]) -> Dict[str, Any]:
try:
with error_context("mcp_create_task",
custom_context={"tool": "create_task", "args": arguments}):
result = await task_service.create_task(arguments)
return {"success": True, "result": result}
except Exception as e:
return handle_mcp_tool_error(e, "create_task", arguments)@with_retry(RetryConfig(max_attempts=3))
@with_circuit_breaker("kanban_service")
async def sync_agent_progress(agent_id: str, task_id: str, progress: int):
with error_context("progress_sync", agent_id=agent_id, task_id=task_id):
await kanban_provider.update_task_progress(task_id, progress)
record_agent_event("progress_updated", agent_id, {"task_id": task_id, "progress": progress})# Before: Basic error handling
try:
result = external_service_call()
return {"success": True, "data": result}
except Exception as e:
logger.error(f"Service call failed: {e}")
return {"success": False, "error": str(e)}
# After: Marcus Error Framework
@with_retry()
@with_circuit_breaker("external_service")
async def safe_external_service_call():
with error_context("external_service_call"):
return await external_service_call()
try:
result = await safe_external_service_call()
return {"success": True, "data": result}
except Exception as e:
return create_error_response(e, ResponseFormat.MCP)The Marcus Error Framework represents a paradigm shift from reactive error handling to proactive error intelligence. By treating errors as valuable system intelligence rather than mere exceptions, the framework enables autonomous agents to operate more reliably, recover more intelligently, and provide better visibility into system health.
The framework's multi-tiered approach—from simple retry logic to sophisticated pattern detection—allows it to scale from basic error recovery to advanced system intelligence. Its integration with the broader Marcus ecosystem ensures that error handling is not an afterthought but a core system capability that enhances every aspect of autonomous agent operation.
As Marcus continues to evolve toward more sophisticated autonomous operation, the Error Framework provides the foundation for self-healing, self-monitoring, and self-optimizing agent systems that can operate reliably in complex, distributed environments with minimal human intervention.