Skip to content

Commit f04876c

Browse files
chaos engineering
1 parent bf23498 commit f04876c

13 files changed

Lines changed: 3025 additions & 0 deletions

File tree

chaos_engineering/Dockerfile

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Multi-stage Dockerfile for Chaos Engineering Test Suite
2+
3+
# Build stage
4+
FROM python:3.11-slim as builder
5+
6+
WORKDIR /app
7+
8+
# Install build dependencies
9+
RUN apt-get update && apt-get install -y \
10+
gcc \
11+
python3-dev \
12+
&& rm -rf /var/lib/apt/lists/*
13+
14+
# Copy requirements
15+
COPY requirements.txt .
16+
17+
# Install Python dependencies
18+
RUN pip install --user --no-cache-dir -r requirements.txt
19+
20+
# Runtime stage
21+
FROM python:3.11-slim
22+
23+
# Install runtime dependencies
24+
RUN apt-get update && apt-get install -y \
25+
procps \
26+
&& rm -rf /var/lib/apt/lists/*
27+
28+
# Create non-root user
29+
RUN useradd -m -u 1000 chaos && \
30+
mkdir -p /app/chaos_reports && \
31+
chown -R chaos:chaos /app
32+
33+
WORKDIR /app
34+
35+
# Copy Python dependencies from builder
36+
COPY --from=builder /root/.local /home/chaos/.local
37+
38+
# Copy application code
39+
COPY --chown=chaos:chaos . .
40+
41+
# Switch to non-root user
42+
USER chaos
43+
44+
# Update PATH
45+
ENV PATH=/home/chaos/.local/bin:$PATH
46+
47+
# Default configuration
48+
ENV PYTHONUNBUFFERED=1
49+
ENV LOG_LEVEL=INFO
50+
51+
# Health check
52+
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
53+
CMD python -c "import sys; sys.exit(0)"
54+
55+
# Entry point
56+
ENTRYPOINT ["python", "main.py"]
57+
58+
# Default arguments
59+
CMD ["--scenario", "cascade-failure"]

chaos_engineering/Readme.md

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# AI Agent Chaos Engineering Test Suite
2+
3+
A production-ready chaos engineering framework for testing AI agent resilience, recovery mechanisms, and data integrity under catastrophic failures.
4+
5+
## Overview
6+
7+
This test suite simulates real-world failure scenarios to validate that your multi-agent AI system can:
8+
- Detect failures within 1 second
9+
- Recover failed agents within 2 seconds
10+
- Maintain 100% data integrity during failures
11+
12+
## Features
13+
14+
- **Multiple Chaos Injection Types**:
15+
- Network failures (packet loss, latency)
16+
- Agent timeouts and crashes
17+
- Memory pressure and leaks
18+
- CPU throttling
19+
- Cascading failures
20+
21+
- **Comprehensive Monitoring**:
22+
- Real-time failure detection
23+
- Recovery time measurement
24+
- Data integrity validation
25+
- Performance metrics collection
26+
27+
- **Production-Ready**:
28+
- Async/await architecture
29+
- Configurable test scenarios
30+
- Detailed logging and tracing
31+
- Integration with observability platforms
32+
33+
## Architecture
34+
35+
```
36+
chaos_engineering/
37+
├── __init__.py
38+
├── agent.py # Agent implementation with health states
39+
├── chaos_injector.py # Failure injection mechanisms
40+
├── monitor.py # Health monitoring and detection
41+
├── orchestrator.py # Recovery orchestration
42+
├── metrics.py # Metrics collection and reporting
43+
├── test_scenarios.py # Predefined test scenarios
44+
├── main.py # Main test runner
45+
└── config.yaml # Configuration file
46+
```
47+
48+
## Quick Start
49+
50+
1. Install dependencies:
51+
```bash
52+
pip install -r requirements.txt
53+
```
54+
55+
2. Configure your test scenario in `config.yaml`
56+
57+
3. Run the chaos test:
58+
```bash
59+
python main.py --scenario cascade-failure
60+
```
61+
62+
## Test Scenarios
63+
64+
### 1. Cascade Failure
65+
Tests system resilience when multiple agents fail simultaneously.
66+
67+
### 2. Network Partition
68+
Simulates network splits and communication failures.
69+
70+
### 3. Resource Exhaustion
71+
Tests behavior under memory and CPU pressure.
72+
73+
### 4. Slow Death
74+
Gradual degradation leading to eventual failure.
75+
76+
## Success Criteria
77+
78+
| Metric | Threshold | Description |
79+
|--------|-----------|-------------|
80+
| Detection Time | < 1s | Time to detect agent failure |
81+
| Recovery Time | < 2s | Time to restore failed agents |
82+
| Data Integrity | 100% | No lost tasks during failure |
83+
84+
## Integration
85+
86+
### With Observability Platforms
87+
88+
```python
89+
# Example: W&B Weave integration
90+
from chaos_engineering import ChaosTest
91+
import weave
92+
93+
weave.init('chaos-test')
94+
test = ChaosTest(trace_enabled=True)
95+
results = await test.run_scenario('cascade-failure')
96+
```
97+
98+
### With CI/CD
99+
100+
```yaml
101+
# Example: GitHub Actions
102+
- name: Run Chaos Tests
103+
run: |
104+
python main.py --scenario all --report junit
105+
```
106+
107+
## Monitoring
108+
109+
The suite provides real-time monitoring during test execution:
110+
111+
```
112+
[19:52:41] CHAOS: Initiating cascade failure scenario
113+
[19:52:41] CHAOS: Forcing TIMEOUT state for Agent E
114+
[19:52:41] CHAOS: Forcing TIMEOUT state for Agent F
115+
[19:52:41] MONITOR: Failure detected in 0.892s
116+
[19:52:41] RECOVERY: Initiating recovery protocol
117+
[19:52:42] RECOVERY: Recovery complete in 1.2s
118+
[19:52:45] TEST: All tasks processed successfully
119+
120+
┌───────────────────────────────────┐
121+
│ RESILIENCE REPORT │
122+
└───────────────────────────────────┘
123+
• Failure Detection: 0.892s [PASSED]
124+
• Recovery Time: 1.2s [PASSED]
125+
• Data Integrity: 100% [PASSED]
126+
```
127+
128+
## Configuration
129+
130+
Edit `config.yaml` to customize test parameters:
131+
132+
```yaml
133+
chaos:
134+
network:
135+
packet_loss: 0.4 # 40% packet loss
136+
latency_ms: 500 # Additional latency
137+
agents:
138+
failure_count: 2 # Number of agents to fail
139+
failure_type: timeout
140+
duration_seconds: 60
141+
```
142+
143+
## Development
144+
145+
To add new chaos scenarios:
146+
147+
1. Create a new scenario in `test_scenarios.py`
148+
2. Implement the injection logic in `chaos_injector.py`
149+
3. Add monitoring rules in `monitor.py`
150+
151+
## License
152+
153+
MIT License

chaos_engineering/__init__.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
"""
2+
AI Agent Chaos Engineering Test Suite
3+
4+
A production-ready framework for testing AI agent resilience through chaos engineering.
5+
"""
6+
7+
__version__ = "1.0.0"
8+
__author__ = "AI Resilience Team"
9+
10+
from .agent import Agent, AgentPool, AgentState, Task
11+
from .chaos_injector import ChaosInjector, ChaosType, ChaosEvent
12+
from .monitor import HealthMonitor, HealthStatus, HealthMetrics, FailureEvent
13+
from .orchestrator import RecoveryOrchestrator, RecoveryStrategy, RecoveryEvent
14+
from .metrics import MetricsCollector, TestResult, TestReporter
15+
from .test_scenarios import TestScenario, get_scenario, SCENARIOS
16+
17+
__all__ = [
18+
# Agent components
19+
"Agent",
20+
"AgentPool",
21+
"AgentState",
22+
"Task",
23+
24+
# Chaos injection
25+
"ChaosInjector",
26+
"ChaosType",
27+
"ChaosEvent",
28+
29+
# Monitoring
30+
"HealthMonitor",
31+
"HealthStatus",
32+
"HealthMetrics",
33+
"FailureEvent",
34+
35+
# Recovery
36+
"RecoveryOrchestrator",
37+
"RecoveryStrategy",
38+
"RecoveryEvent",
39+
40+
# Metrics and reporting
41+
"MetricsCollector",
42+
"TestResult",
43+
"TestReporter",
44+
45+
# Test scenarios
46+
"TestScenario",
47+
"get_scenario",
48+
"SCENARIOS",
49+
]

0 commit comments

Comments
 (0)