- Overview
- Architecture
- Ecosystem Integration
- Workflow Integration
- What Makes This System Special
- Technical Implementation
- Pros and Cons
- Design Rationale
- Future Evolution
- Task Complexity Handling
- Board-Specific Considerations
- Cato Integration
- Typical Scenario Integration
The Marcus Testing Framework is a comprehensive, multi-layered testing system designed specifically for autonomous agent environments. It provides structured test organization, rich error handling, and intelligent test automation to ensure reliable operation of the Marcus ecosystem.
The Testing Framework provides:
- Structured Test Organization: Clear separation of unit, integration, performance, and future feature tests
- Autonomous Agent Testing: Specialized fixtures and utilities for testing agent interactions
- Rich Error Context: Integration with Marcus Error Framework for detailed failure analysis
- Intelligent Test Discovery: Automated test placement based on dependency analysis
- Performance Benchmarking: Comprehensive performance testing with scaling metrics
- TDD Support: Future features testing for test-driven development workflows
Marcus Testing Framework
├── Test Organization Layer
│ ├── Unit Tests (Isolated, < 100ms)
│ ├── Integration Tests (Real services)
│ ├── Performance Tests (Benchmarking)
│ └── Future Features (TDD)
├── Test Infrastructure Layer
│ ├── Fixtures & Factories
│ ├── Mock Systems
│ └── Base Test Classes
├── Automation Layer
│ ├── Test Discovery
│ ├── Async Test Management
│ └── Coverage Analysis
└── Integration Layer
├── MCP Protocol Testing
├── Kanban Integration Testing
└── Error Framework Testing
The Testing Framework deeply integrates with all Marcus core systems:
Error Framework Integration:
# tests/unit/core/test_error_framework.py
from src.core.error_framework import (
MarcusBaseError, ErrorContext, RemediationSuggestion,
KanbanIntegrationError, AIProviderError
)
class TestErrorFramework:
def test_error_context_creation(self):
"""Test creating error context with agent tracking"""
context = ErrorContext(
operation="task_assignment",
agent_id="agent-001",
task_id="TASK-123"
)
assert context.correlation_id is not NoneMCP Server Integration:
# tests/unit/mcp/test_marcus_server_complete.py
@pytest.mark.anyio
@pytest.mark.parametrize("anyio_backend", ["asyncio"])
async def test_mcp_tool_execution():
"""Test MCP tool execution with proper context"""
server = MarcusServer()
result = await server.call_tool("register_agent", {
"agent_id": "test-001",
"name": "Test Agent",
"role": "Developer"
})
assert result["success"] is TrueAI Engine Integration:
# tests/integration/ai/test_prd_parser_real_ai.py
@pytest.mark.integration
async def test_ai_prd_analysis():
"""Test AI engine with real provider integration"""
engine = AIAnalysisEngine()
result = await engine.analyze_prd(sample_prd_text)
assert result.confidence > 0.8Kanban Provider Testing:
# Shared fixture in conftest.py
@pytest.fixture
async def mcp_session() -> AsyncGenerator[ClientSession, None]:
"""MCP session connected to Kanban server"""
server_params = StdioServerParameters(
command="/opt/homebrew/bin/node",
args=["/Users/lwgray/dev/kanban-mcp/dist/index.js"],
env={
"PLANKA_BASE_URL": "http://localhost:3333",
"PLANKA_AGENT_EMAIL": "demo@demo.demo",
"PLANKA_AGENT_PASSWORD": "demo" # pragma: allowlist secret
}
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
yield sessionTest Board Management:
@pytest.fixture
async def test_board(mcp_session, test_project_id, test_board_name):
"""Auto-managed test board with cleanup"""
result = await mcp_session.call_tool("mcp_kanban_project_board_manager", {
"action": "create_board",
"projectId": test_project_id,
"name": test_board_name
})
yield board_data
# Automatic cleanup after testThe Testing Framework integrates into the Marcus workflow at multiple points:
create_project → register_agent → request_next_task → report_progress → report_blocker → finish_task
↓ ↓ ↓ ↓ ↓ ↓
Unit Tests Integration Performance Tests Error Testing Recovery Tests E2E Tests
Pre-Development: Future feature tests guide TDD implementation During Development: Unit tests provide rapid feedback Integration Phase: Integration tests verify component interactions Performance Validation: Benchmarking ensures scalability Error Handling: Error framework tests validate recovery paths Deployment: E2E tests verify complete workflows
The framework uses an intelligent test placement system:
┌─────────────────────────────────────────────────────────────┐
│ START: New Test Needed │
└──────────────────────────┬──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Q1: Does it need external services? │
│ (Database, API, Network, File System) │
└────────────┬─────────────────┬───────────┘
│ NO │ YES
▼ ▼
┌───────────────────┐ ┌────────────────────────┐
│ Q2: Single unit? │ │ Q3: Future feature? │
│ (class/function) │ │ (TDD/unimplemented) │
└──┬──────────┬─────┘ └──┬───────────────┬────┘
│ YES │ NO │ YES │ NO
▼ ▼ ▼ ▼
tests/unit/ tests/unit/ tests/future_ tests/integration/
[component]/ test_*.py features/ [type]/
Unlike traditional testing frameworks, Marcus's system is designed for autonomous agents:
# tests/fixtures/factories.py
class AgentFactory:
"""Factory for creating WorkerStatus objects for testing"""
@classmethod
def create(cls, **kwargs) -> WorkerStatus:
"""Create agent with realistic autonomous behavior patterns"""
defaults = {
'worker_id': f'agent-{cls._counter:04d}',
'current_tasks': [],
'performance_score': 1.0,
'skills': ['python', 'autonomous-execution'],
'capacity': 40
}
return WorkerStatus(**defaults)Deep integration with the Error Framework provides unprecedented debugging capability:
def test_error_context_propagation(self):
"""Test error context flows through autonomous systems"""
with error_context("task_assignment", agent_id="agent-001"):
try:
raise KanbanIntegrationError(
board_name="test_board",
operation="create_task"
)
except MarcusBaseError as e:
assert e.context.agent_id == "agent-001"
assert e.context.operation == "task_assignment"Tests operate against real external services with automatic cleanup:
@pytest.mark.integration
async def test_real_kanban_workflow(test_board):
"""Test complete workflow with real Kanban backend"""
# Test operates on real board, auto-cleaned after test
task_result = await create_task(test_board["id"], task_data)
progress_result = await update_progress(task_result["id"], 50)
assert progress_result["status"] == "in_progress"Built-in performance benchmarking with scaling analysis:
# tests/performance/benchmark_scaling.py
@dataclass
class BenchmarkResult:
scenario: str
total_connections: int
successful_requests: int
avg_response_time: float
p95_response_time: float
requests_per_second: float
memory_usage_mb: float
cpu_usage_percent: floatTDD support for unimplemented features guides development:
# tests/future_features/ai/core/test_ai_engine.py
class TestMarcusAIEngine:
"""Test the core AI engine with hybrid intelligence"""
async def test_hybrid_decision_making(self, ai_engine):
"""Test AI+rule hybrid decision framework"""
# This test drives implementation of hybrid AI system
decision = await ai_engine.make_hybrid_decision(context)
assert decision.confidence > 0.9
assert decision.reasoning is not Nonetests/
├── unit/ # 181 tests - 100% passing
│ ├── ai/ # AI component tests
│ ├── core/ # Core logic tests
│ ├── mcp/ # MCP protocol tests
│ └── visualization/ # UI component tests
├── integration/ # Real service tests
│ ├── e2e/ # End-to-end workflows
│ ├── api/ # API integrations
│ ├── external/ # 3rd party services
│ └── diagnostics/ # Connection/debug tests
├── performance/ # Benchmarks and load tests
│ ├── benchmarks/ # Speed benchmarks
│ └── load/ # Concurrent load tests
├── future_features/ # TDD for unimplemented
│ └── [mirrors src structure]
└── fixtures/ # Shared test data
├── factories.py # Object factories
└── __init__.py
Due to MCP protocol requirements, the framework uses sophisticated async handling:
# conftest.py
@pytest.fixture(scope="session")
def event_loop() -> asyncio.AbstractEventLoop:
"""Session-scoped event loop for MCP connections"""
loop = asyncio.get_event_loop_policy().new_event_loop()
yield loop
loop.close()
# Test implementation
@pytest.mark.anyio
@pytest.mark.parametrize("anyio_backend", ["asyncio"])
async def test_mcp_workflow():
"""Async test with proper MCP protocol handling"""
result = await mcp_client.call_tool("register_agent", params)
assert result is not NoneConsistent, realistic test data generation:
class TaskFactory:
"""Factory for creating Task objects for testing"""
@classmethod
def create(cls, **kwargs) -> Task:
"""Create task with auto-incrementing ID and realistic defaults"""
cls._counter += 1
defaults = {
'id': f"TASK-{cls._counter:04d}",
'name': f"Task {cls._counter}",
'status': TaskStatus.TODO,
'priority': Priority.MEDIUM,
'estimated_hours': 4.0,
'dependencies': [],
'labels': []
}
defaults.update(kwargs)
return Task(**defaults)Comprehensive mocking for external dependencies:
# tests/utils/base.py
class BaseTestCase:
def create_mock_kanban_client(self) -> AsyncMock:
"""Create properly configured mock kanban client"""
client = AsyncMock()
client.get_available_tasks = AsyncMock(return_value=[])
client.update_task_progress = AsyncMock()
client.get_board_summary = AsyncMock(return_value={
'totalCards': 0,
'doneCount': 0,
'inProgressCount': 0
})
return client# Requirements enforced by framework
- Minimum coverage: 80%
- Unit tests must run in < 100ms
- All external dependencies mocked in unit tests
- Integration tests use real services
- Future feature tests guide TDD implementationAutonomous Agent Specialization:
- Purpose-built for agent testing scenarios
- Rich context tracking for debugging agent interactions
- Real-world integration testing with automatic cleanup
Comprehensive Coverage:
- Four-layer test organization (unit/integration/performance/future)
- 181 unit tests with 100% pass rate
- Performance benchmarking with scaling analysis
Developer Experience:
- Intelligent test placement guidance
- Rich factory pattern for test data
- Comprehensive fixture ecosystem
Quality Assurance:
- 80% minimum coverage requirement
- Automatic async handling for MCP protocol
- Integration with Marcus Error Framework
Complexity:
- Four-layer organization can be overwhelming for simple projects
- Async test management requires understanding of event loops
- Factory pattern adds indirection for simple test cases
External Dependencies:
- Integration tests require running Kanban MCP server
- Performance tests need stable external services
- Real-world testing can be flaky due to network issues
Learning Curve:
- Test placement decision tree requires understanding
- MCP protocol testing has specific requirements
- Future feature testing paradigm is non-standard
Maintenance Overhead:
- Test board cleanup requires careful lifecycle management
- Factory patterns need maintenance as models evolve
- Async fixtures are complex to debug
Autonomous Agent Requirements: Traditional testing frameworks don't account for autonomous agents that make independent decisions and interact with external services. Marcus needed testing that could handle:
- Agent state tracking across operations
- Error context propagation through autonomous systems
- Real-world integration testing with external boards
MCP Protocol Complexity: The Model Context Protocol requires sophisticated async handling and session management that standard pytest async plugins couldn't handle reliably:
# Standard pytest-asyncio had introspection issues
# Solution: pytest-anyio with explicit backend selection
@pytest.mark.anyio
@pytest.mark.parametrize("anyio_backend", ["asyncio"])
async def test_mcp_operation():
# Reliable async test executionTDD for AI Systems: AI system development benefits enormously from writing tests first, as it forces clarification of expected behavior before implementation:
# tests/future_features/ai/core/test_ai_engine.py
async def test_hybrid_decision_making(self, ai_engine):
"""This test drives AI engine development"""
# Define expected AI behavior before implementation
decision = await ai_engine.make_hybrid_decision(context)
assert decision.confidence > 0.9Four-Layer Organization: The separation into unit/integration/performance/future reflects the realities of autonomous agent development:
- Unit: Fast feedback for individual components
- Integration: Real-world behavior validation
- Performance: Scaling requirements for multi-agent systems
- Future: TDD guidance for complex AI features
AI-Powered Test Generation:
# Future: AI generates tests from code changes
class AITestGenerator:
async def generate_tests_for_commit(self, commit_diff):
"""Generate comprehensive tests for code changes"""
analysis = await self.analyze_code_changes(commit_diff)
return await self.generate_test_suite(analysis)Distributed Testing:
# Future: Multi-node test execution
class DistributedTestRunner:
async def run_tests_across_agents(self, test_suite):
"""Distribute tests across multiple Marcus agents"""
results = await self.coordinate_test_execution(test_suite)
return self.aggregate_results(results)Real-Time Test Feedback:
# Future: Live test results during development
class LiveTestRunner:
async def watch_code_changes(self):
"""Run relevant tests on code changes"""
async for change in self.watch_filesystem():
relevant_tests = await self.find_affected_tests(change)
await self.run_tests_live(relevant_tests)Predictive Test Selection:
# Future: ML-powered test selection
class PredictiveTestSelector:
async def select_optimal_tests(self, change_context):
"""Use ML to select tests most likely to catch issues"""
risk_analysis = await self.analyze_change_risk(change_context)
return await self.select_high_value_tests(risk_analysis)Microservice Test Architecture: As Marcus scales, testing will evolve to support distributed microservice architectures with service-specific test suites and cross-service integration testing.
Continuous Testing Pipeline: Integration with CI/CD for continuous test execution, automated test generation, and intelligent test result analysis.
Performance Prediction: ML models to predict performance impacts from code changes and automatically trigger appropriate performance test suites.
For simple tasks like configuration updates or single-function implementations:
# Simple task testing approach
def test_config_update():
"""Test simple configuration change"""
config = load_config()
config.update_setting("max_agents", 10)
assert config.max_agents == 10Characteristics:
- Single unit test
- Minimal mocking
- Fast execution (< 50ms)
- Direct assertion
For complex tasks like multi-agent coordination or AI system integration:
# Complex task testing approach
@pytest.mark.integration
@pytest.mark.e2e
class TestMultiAgentCoordination:
async def test_coordinated_task_execution(self):
"""Test complex multi-agent task coordination"""
# Setup multiple agents
agents = [AgentFactory.create() for _ in range(3)]
# Create interdependent tasks
tasks = TaskFactory.create_dependency_chain(3)
# Test coordination
coordinator = TaskCoordinator()
results = await coordinator.execute_coordinated_tasks(agents, tasks)
# Verify coordination behavior
assert all(r.success for r in results)
assert results[0].completion_time < results[1].start_timeCharacteristics:
- Multiple test layers (unit + integration + e2e)
- Complex setup with factories
- Real service integration
- Multi-step verification
For AI-powered tasks requiring intelligence and decision-making:
# AI task testing approach
@pytest.mark.integration
@pytest.mark.ai
class TestAITaskAnalysis:
async def test_intelligent_task_breakdown(self, ai_engine):
"""Test AI-powered task analysis and breakdown"""
# Provide complex project description
project_description = load_test_prd("complex_ecommerce.md")
# Test AI analysis
analysis = await ai_engine.analyze_and_breakdown(project_description)
# Verify AI reasoning
assert analysis.confidence > 0.8
assert len(analysis.subtasks) >= 5
assert analysis.estimated_complexity == "high"
# Test dependency inference
dependencies = analysis.inferred_dependencies
assert any(d.relationship == "blocks" for d in dependencies)The testing framework has special handling for different Kanban board configurations:
# Board-specific test configuration
@pytest.fixture
def kanban_board_config():
"""Configuration for different board types"""
return {
"simple_board": {
"columns": ["To Do", "In Progress", "Done"],
"complexity": "low"
},
"advanced_board": {
"columns": ["Backlog", "Analysis", "Development", "Testing", "Review", "Done"],
"complexity": "high"
}
}
@pytest.mark.parametrize("board_type", ["simple_board", "advanced_board"])
async def test_board_specific_behavior(board_type, kanban_board_config):
"""Test behavior adapts to board configuration"""
config = kanban_board_config[board_type]
board = await create_test_board(config)
# Test that Marcus adapts to board structure
task_flow = await analyze_board_task_flow(board)
assert len(task_flow.stages) == len(config["columns"])Special tests for board quality and structure validation:
# tests/unit/detection/test_board_analyzer.py
class TestBoardAnalyzer:
def test_board_quality_assessment(self):
"""Test board quality analysis"""
analyzer = BoardAnalyzer()
# Test high-quality board
good_board = create_well_structured_board()
quality = analyzer.assess_board_quality(good_board)
assert quality.score > 0.8
assert quality.issues == []
# Test low-quality board
poor_board = create_poorly_structured_board()
quality = analyzer.assess_board_quality(poor_board)
assert quality.score < 0.4
assert len(quality.issues) > 0Currently, the Marcus Testing Framework doesn't have direct Cato integration, but it's designed to support it:
# Future Cato integration
class CatoTestIntegration:
"""Integration layer for Cato testing"""
async def test_cato_decision_quality(self, decision_context):
"""Test Cato's decision-making quality"""
cato = CatoEngine()
decision = await cato.make_decision(decision_context)
# Test decision quality metrics
assert decision.confidence > 0.8
assert decision.reasoning_steps >= 3
assert decision.considers_alternatives
async def test_cato_marcus_collaboration(self, marcus_context):
"""Test collaboration between Cato and Marcus"""
collaboration = CatoMarcusCollaboration()
result = await collaboration.coordinate_decision(marcus_context)
assert result.marcus_execution_plan is not None
assert result.cato_oversight_active is TrueMarcus Testing Framework
├── Core Testing (Current)
├── MCP Integration (Current)
├── Kanban Integration (Current)
└── Cato Integration (Planned)
├── Decision Quality Tests
├── Collaboration Tests
└── Override Scenario Tests
The Testing Framework integrates into the standard Marcus workflow at each phase:
# tests/integration/project_creation/test_create_project_workflow.py
@pytest.mark.integration
async def test_complete_project_creation():
"""Test project creation end-to-end"""
project_spec = {
"name": "E-commerce Platform",
"description": "Full-featured online store",
"complexity": "high"
}
result = await create_project(project_spec)
assert result.project_id is not None
assert len(result.initial_tasks) >= 10
assert result.board_created is TrueTesting Focus: Project initialization, task generation, board setup
# tests/integration/e2e/test_marcus_workflows.py
@pytest.mark.integration
async def test_agent_registration_workflow():
"""Test agent registration and capability matching"""
agent_spec = {
"agent_id": "agent-001",
"name": "Senior Developer",
"role": "Full Stack Developer",
"skills": ["python", "react", "postgresql"]
}
result = await register_agent(agent_spec)
assert result.registered is True
assert result.capability_score > 0.7
assert result.initial_assignment is not NoneTesting Focus: Agent registration, skill matching, initial assignment
async def test_intelligent_task_assignment():
"""Test AI-powered task assignment logic"""
agent = AgentFactory.create(skills=["python", "testing"])
available_tasks = TaskFactory.create_batch(5, priority="high")
assignment = await request_next_task(agent.worker_id)
assert assignment.task_id is not None
assert assignment.skill_match_score > 0.8
assert assignment.estimated_completion_time is not NoneTesting Focus: Task matching, AI assignment logic, dependency resolution
async def test_progress_reporting_workflow():
"""Test progress reporting and tracking"""
task = TaskFactory.create(status=TaskStatus.IN_PROGRESS)
progress_data = {
"task_id": task.id,
"progress": 75,
"status": "in_progress",
"message": "API endpoints implemented, working on frontend"
}
result = await report_progress(progress_data)
assert result.progress_recorded is True
assert result.next_milestone is not None
assert result.risk_assessment.level == "low"Testing Focus: Progress tracking, milestone detection, risk assessment
async def test_blocker_reporting_and_resolution():
"""Test blocker reporting and AI-powered resolution"""
blocker_data = {
"task_id": "TASK-001",
"agent_id": "agent-001",
"blocker_description": "Database connection failing in test environment",
"severity": "medium"
}
result = await report_blocker(blocker_data)
assert result.blocker_recorded is True
assert len(result.ai_suggestions) >= 3
assert result.escalation_needed is FalseTesting Focus: Blocker analysis, AI suggestions, escalation logic
async def test_task_completion_workflow():
"""Test task completion and knowledge capture"""
completion_data = {
"task_id": "TASK-001",
"agent_id": "agent-001",
"status": "completed",
"completion_notes": "Feature implemented with comprehensive tests",
"artifacts": ["src/api/users.py", "tests/test_users.py"]
}
result = await finish_task(completion_data)
assert result.task_completed is True
assert result.knowledge_captured is True
assert result.next_task_suggested is not NoneTesting Focus: Completion validation, knowledge capture, workflow continuation
@pytest.mark.integration
@pytest.mark.e2e
@pytest.mark.slow
async def test_complete_marcus_workflow():
"""Test complete workflow from project creation to task completion"""
# 1. Create project
project = await create_project(sample_project_spec)
# 2. Register agent
agent_result = await register_agent(sample_agent_spec)
# 3. Request first task
task_assignment = await request_next_task(agent_result.agent_id)
# 4. Report progress
await report_progress({
"task_id": task_assignment.task_id,
"progress": 50,
"status": "in_progress"
})
# 5. Complete task
completion = await finish_task({
"task_id": task_assignment.task_id,
"status": "completed"
})
# Verify complete workflow
assert project.created is True
assert agent_result.registered is True
assert task_assignment.assigned is True
assert completion.completed is True
# Verify system state
project_state = await get_project_status(project.project_id)
assert project_state.tasks_completed == 1
assert project_state.agents_active == 1This comprehensive end-to-end test validates the entire Marcus workflow, ensuring all systems work together correctly and that the testing framework provides visibility into each phase of the autonomous agent lifecycle.
The Marcus Testing Framework represents a sophisticated approach to testing autonomous agent systems, providing the structure, tools, and patterns necessary to ensure reliable operation in complex, multi-agent environments. Its deep integration with Marcus's core systems, combined with its support for TDD and real-world testing scenarios, makes it an essential component of the Marcus ecosystem.