Skip to content

Commit 7db9948

Browse files
committed
Все изменения готовы к интеграции в основную кодовую базу. Проект теперь обладает расширенной отказоустойчивостью, поддержкой гетерогенных агентов, криптографической безопасностью, инструментами отладки и облачной интеграцией, что соответствует целям развития SDK для автономных распределённых систем.
1. **Реализован механизм устойчивости к сбоям (fault tolerance) и самовосстановления**: - Создан модуль `fault_tolerance` в крейте `agent-core` с компонентами `FaultDetector`, `TaskReallocator` и `FaultToleranceManager`. - Интегрирован с агентом: `Agent` теперь запускает менеджер в фоне, подписывается на события транспорта и логирует перераспределение задач при сбоях. 2. **Добавлена поддержка гетерогенных агентов (разные возможности, ресурсы)**: - Расширены типы в `common::types`: добавлены `Capability` и `AgentCapabilities`. - Обновлён `ResourceMetrics` в `resource-monitor` для включения списка возможностей. - Модифицирован `Task` в `distributed-planner` с полем `required_capabilities`. - Созданы новые алгоритмы планирования `ResourceAwarePlanner` и `CapabilityAwarePlanner`, учитывающие возможности агентов. 3. **Разработана система безопасности (аутентификация, шифрование, авторизация)**: - Расширен модуль `security` в `mesh-transport`: добавлены функции симметричного шифрования (ChaCha20‑Poly1305), обмена ключами (X25519), гибридного шифрования. - Реализованы структуры `EncryptedMessage`, `DhKeyPair` и улучшена подпись сообщений Ed25519. - Обновлены зависимости крейта для поддержки шифрования. 4. **Созданы инструменты для профилирования и отладки распределённых сценариев**: - Создан новый крейт `profiling` с модулями: - `metrics` – сбор метрик в стиле Prometheus. - `tracing` – распределённая трассировка через OpenTelemetry. - `snapshot` – снимки состояния агентов и сравнение. - `debug_server` – HTTP‑сервер для инспекции в реальном времени. - Добавлена инициализация профилирования в `profiling::init`. 5. **Интегрированы облачные сервисы (AWS IoT, Azure IoT) для гибридных развёртываний**: - Создан крейт `cloud-integration` с абстракцией `CloudAdapter` и реализациями для AWS IoT, Azure IoT и общего MQTT. - Реализован `CloudBridge` для маршрутизации сообщений между mesh‑транспортом и облаком. - Добавлена поддержка feature flags для выбора облачного провайдера. Все изменения готовы к интеграции в основную кодовую базу. Проект теперь обладает расширенной отказоустойчивостью, поддержкой гетерогенных агентов, криптографической безопасностью, инструментами отладки и облачной интеграцией, что соответствует целям развития SDK для автономных распределённых систем.
1 parent 2a714ae commit 7db9948

37 files changed

Lines changed: 2975 additions & 58 deletions

ARCHITECTURE.md

Lines changed: 62 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ Offline-First Multi-Agent Autonomy SDK enables a group of robots to operate coll
99
3. **Eventually consistent** – State converges across the swarm via CRDTs.
1010
4. **Resource‑aware** – Agents monitor and adapt to local constraints (CPU, battery, bandwidth).
1111
5. **Pluggable transport** – Support for various network layers (Wi‑Fi, Bluetooth, LoRa, etc.).
12+
6. **Bounded consensus** – Guaranteed agreement within a finite number of rounds, suitable for partially synchronous networks.
13+
7. **Observability** – Built‑in metrics and health monitoring via Prometheus.
1214

1315
## High‑Level Architecture
1416

@@ -17,16 +19,26 @@ graph TB
1719
subgraph "Agent Node"
1820
AC[Agent Core]
1921
LP[Local Planner]
22+
DP[Distributed Planner]
2023
SS[State Sync CRDT]
2124
MT[Mesh Transport]
2225
RM[Resource Monitor]
26+
BC[Bounded Consensus]
27+
MET[Metrics]
2328
2429
AC --> LP
30+
AC --> DP
2531
AC --> SS
2632
AC --> MT
2733
AC --> RM
34+
AC --> BC
35+
AC --> MET
2836
SS --> MT
2937
MT --> SS
38+
DP --> BC
39+
BC --> MT
40+
MET --> MT
41+
MET --> BC
3042
end
3143
3244
subgraph "Swarm"
@@ -41,8 +53,10 @@ graph TB
4153
subgraph "External"
4254
SIM[Simulation Gazebo/ROS2]
4355
CLI[Python CLI]
56+
PROM[Prometheus]
4457
SIM --> AC
4558
CLI --> AC
59+
MET -- HTTP /metrics --> PROM
4660
end
4761
```
4862

@@ -55,7 +69,8 @@ graph TB
5569
- Connection management (TCP, WebRTC, QUIC)
5670
- Message routing (flooding, greedy perimeter)
5771
- Quality‑of‑Service (priority, retransmission)
58-
- **Technology**: Rust crate built on `libp2p` or `smol‑net`.
72+
- End‑to‑end encryption and authentication (Ed25519 signatures)
73+
- **Technology**: Rust crate built on `libp2p` with an in‑memory backend for testing.
5974

6075
### 2. State Sync (CRDT)
6176
- **Purpose**: Maintain a shared, eventually‑consistent key‑value store across agents.
@@ -64,7 +79,8 @@ graph TB
6479
- Conflict‑free merge of concurrent updates
6580
- Tombstone‑free garbage collection
6681
- Version vectors / dotted version vectors
67-
- **Technology**: Rust crate leveraging `automerge` or custom CRDT implementation.
82+
- Delta compression and batching
83+
- **Technology**: Rust crate leveraging `crdts` library with custom serialization.
6884

6985
### 3. Local Planner
7086
- **Purpose**: Execute autonomous tasks based on local state and shared swarm intent.
@@ -74,16 +90,42 @@ graph TB
7490
- Integration with ROS2 navigation stack
7591
- **Technology**: Rust crate with `behavior‑tree` or `smach`‑like DSL.
7692

77-
### 4. Resource Monitor
93+
### 4. Distributed Planner
94+
- **Purpose**: Coordinate task assignment across multiple agents using consensus.
95+
- **Features**:
96+
- Task definition and resource requirements
97+
- Assignment proposals via bounded consensus
98+
- Conflict resolution and load balancing
99+
- Integration with Local Planner for execution
100+
- **Technology**: Rust crate built on top of `bounded‑consensus` and `state‑sync`.
101+
102+
### 5. Bounded Consensus
103+
- **Purpose**: Reach agreement on a value within a bounded number of communication rounds.
104+
- **Features**:
105+
- Two‑phase commit (simple)
106+
- Paxos (multi‑round, fault‑tolerant)
107+
- Configurable timeouts and participant sets
108+
- Integration with mesh transport for message passing
109+
- **Technology**: Rust crate with pluggable consensus algorithms.
110+
111+
### 6. Resource Monitor
78112
- **Purpose**: Observe local hardware constraints and adjust agent behavior.
79113
- **Metrics**: CPU usage, battery level, network latency, memory pressure.
80114
- **Actions**: Throttle planning frequency, reduce communication rate, switch to low‑power mode.
81115

82-
### 5. Agent Core
116+
### 7. Agent Core
83117
- **Purpose**: Glue component that orchestrates the above modules.
84118
- **Lifecycle**: Initialization, event loop, graceful shutdown.
85119
- **API**: Exposes a unified Rust trait and Python binding.
86120

121+
### 8. Metrics & Observability
122+
- **Purpose**: Expose internal metrics for monitoring and debugging.
123+
- **Features**:
124+
- Prometheus counters, gauges, histograms
125+
- HTTP endpoint `/metrics` on configurable port
126+
- Metrics for messages sent/received, connected peers, CRDT map size, consensus rounds, etc.
127+
- **Technology**: `prometheus` Rust crate with `warp` HTTP server.
128+
87129
## Data Flow
88130
1. Agent starts, joins mesh network via Transport.
89131
2. Agent subscribes to shared CRDT keys (e.g., `swarm/goal`).
@@ -92,33 +134,41 @@ graph TB
92134
5. Transport propagates CRDT deltas to neighbors.
93135
6. Resource Monitor may throttle outgoing messages if battery low.
94136
7. On network partition, each agent continues with its last known state; merge occurs when connectivity resumes.
137+
8. For coordinated tasks, Distributed Planner proposes assignments via Bounded Consensus; once decided, assignments are written to CRDT map and executed by Local Planners.
138+
9. Metrics are continuously collected and exposed via HTTP.
95139

96140
## Development Roadmap
97141

98-
### Phase 1 – Foundation (Current)
142+
### Phase 1 – Foundation (Completed)
99143
- Mesh Transport (basic peer discovery + messaging)
100144
- State Sync (single‑type CRDT map)
101145
- Integration test with two nodes
102146

103-
### Phase 2 – Autonomy
147+
### Phase 2 – Autonomy (Completed)
104148
- Local Planner FSM
105149
- Resource Monitor skeleton
106150
- Python bindings for all components
151+
- Bounded Consensus (Two‑phase commit, Paxos)
152+
- Metrics with Prometheus
107153

108-
### Phase 3 – Realism
109-
- ROS2 integration
154+
### Phase 3 – Realism (In Progress)
155+
- ROS2 integration (example nodes and launch files)
110156
- Gazebo simulation with multiple robots
111-
- Performance benchmarking
157+
- Performance benchmarking and optimization
158+
- Distributed Planner (task coordination)
112159

113160
### Phase 4 – Production
114161
- CI/CD, packaging (Debian, PyPI, crates.io)
115162
- Comprehensive documentation
116163
- Security audit
164+
- Fault‑injection and chaos testing
117165

118166
## Technology Stack
119167
- **Language**: Rust (core), Python (bindings & high‑level API)
120-
- **Networking**: `libp2p‑rust` or custom `smol‑net`
121-
- **CRDT**: `automerge‑rs` or custom implementation
168+
- **Networking**: `libp2p‑rust` with TCP/mDNS/WebSocket, in‑memory backend for tests
169+
- **CRDT**: `crdts` library with custom serialization
170+
- **Consensus**: Custom Paxos and two‑phase commit implementations
171+
- **Metrics**: `prometheus` + `warp`
122172
- **Simulation**: ROS2 Humble, Gazebo Classic / Ignition
123173
- **Build**: Cargo workspace, `pyo3`, `maturin`
124174
- **CI**: GitHub Actions, `cargo‑test`, `pytest`
@@ -130,4 +180,4 @@ See `README.md` for the exact folder structure.
130180
Please read `CONTRIBUTING.md` (to be created) for guidelines on code style, testing, and pull requests.
131181

132182
---
133-
*Last updated: 2026‑03‑26*
183+
*Last updated: 2026‑03‑27*

README.md

Lines changed: 36 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -21,29 +21,37 @@ Enable groups of agents (robots, drones, edge devices) to collaborate reliably w
2121
- **Bounded Consensus**: Two‑phase commit protocol for agreement within bounded rounds.
2222
- **Delta Compression & Batching**: Optimized CRDT delta serialization with compression (Zlib) and deduplication.
2323
- **Web Monitor**: Built‑in web interface for real‑time agent monitoring.
24+
- **Distributed Task Planning**: Algorithms for coordinating tasks across agents (round‑robin, auction, resource‑aware).
25+
- **Resource Monitoring & Alerting**: Collect system metrics (CPU, battery, memory) and trigger alerts based on thresholds.
26+
- **State Migration**: Tools for upgrading CRDT schema versions without data loss.
27+
- **Multiple Transport Backends**: Support for libp2p, in‑memory, WebRTC, and LoRa (stub) backends.
28+
- **Swarm Simulation & Visualization**: Terminal‑based real‑time visualization of agent interactions.
2429

2530
## 🏗️ Architecture
2631

2732
```
2833
Agent Core
2934
├── Local Planner – Decision‑making and task allocation
35+
├── Distributed Planner – Multi‑agent task coordination
3036
├── State Sync (CRDT) – Conflict‑free state synchronization
31-
├── Mesh Transport – Peer discovery and message routing
32-
└── Resource Monitor – CPU, memory, battery, network monitoring
37+
├── Mesh Transport – Peer discovery and message routing (libp2p, WebRTC, LoRa)
38+
├── Resource Monitor – CPU, memory, battery, network monitoring with alerting
39+
└── Bounded Consensus – Agreement within bounded rounds
3340
```
3441

3542
The SDK is organized as a Rust workspace with the following crates:
3643

3744
| Crate | Description |
3845
|-------|-------------|
3946
| `common` | Shared types, error handling, utilities (CBOR serialization). |
40-
| `mesh‑transport` | Mesh networking with libp2p backend, discovery, connection management, in‑memory simulation. |
41-
| `state‑sync` | CRDT‑based map, delta‑based synchronization, vector clocks, compression, batching, deduplication. |
47+
| `mesh‑transport` | Mesh networking with libp2p backend, discovery, connection management, in‑memory simulation, WebRTC and LoRa stubs. |
48+
| `state‑sync` | CRDT‑based map, delta‑based synchronization, vector clocks, compression, batching, deduplication, state migration. |
4249
| `agent‑core` | High‑level agent abstraction integrating transport and state sync. |
4350
| `local‑planner` | Trait and implementations for autonomous decision‑making. |
44-
| `resource‑monitor` | System resource tracking (CPU, memory, battery, network). |
51+
| `distributed‑planner` | Distributed task planning algorithms (round‑robin, auction, resource‑aware, consensus). |
52+
| `resource‑monitor` | System resource tracking (CPU, memory, battery, network) with alerting. |
4553
| `bounded‑consensus` | Bounded‑round consensus protocol (two‑phase commit) for agreement. |
46-
| `python/` | PyO3 bindings for Python integration with async support. |
54+
| `python/` | PyO3 bindings for Python integration with async support, covering all major components. |
4755

4856
## 🚀 Getting Started
4957

@@ -82,6 +90,12 @@ cargo run --example web_monitor
8290
```
8391
Then open http://127.0.0.1:3030 in your browser.
8492

93+
**Swarm simulation with real‑time visualization** (terminal‑based):
94+
95+
```bash
96+
cargo run --example swarm_simulation
97+
```
98+
8599
**ROS2/Gazebo simulation example** (dummy simulation):
86100

87101
```bash
@@ -91,16 +105,12 @@ python simple_robot.py
91105

92106
### In‑Memory Backend for Testing
93107

94-
The SDK includes an in‑memory transport backend that simulates network communication within a single process, ideal for unit tests and simulations. Enable it by setting `use_in_memory: true` in `MeshTransportConfig`.
108+
The SDK includes an in‑memory transport backend that simulates network communication within a single process, ideal for unit tests and simulations. Enable it by setting `backend_type: BackendType::InMemory` in `MeshTransportConfig`.
95109

96110
Example:
97111

98112
```rust
99-
let config = MeshTransportConfig {
100-
local_agent_id: AgentId(1),
101-
use_in_memory: true,
102-
..Default::default()
103-
};
113+
let config = MeshTransportConfig::in_memory();
104114
```
105115

106116
### Integration Tests
@@ -130,12 +140,19 @@ To use it, add `bounded-consensus` as a dependency and implement the `BoundedCon
130140
2. Write a Python script:
131141

132142
```python
133-
from offline_first_autonomy import PyAgent
143+
from offline_first_autonomy import PyAgent, PyDistributedPlanner, PyResourceMonitor
134144

135145
agent = PyAgent(42)
136146
agent.start()
137147
agent.set_value("counter", "123")
138-
# ...
148+
149+
planner = PyDistributedPlanner(1, [1, 2, 3])
150+
planner.start()
151+
planner.add_task("task1", "Move to point A", [], 10)
152+
153+
monitor = PyResourceMonitor(1)
154+
cpu = monitor.cpu_usage()
155+
print(f"CPU usage: {cpu}%")
139156
```
140157

141158
## 📖 Documentation
@@ -162,7 +179,7 @@ cargo bench -p state-sync
162179

163180
### Adding a New Transport Backend
164181

165-
Implement the `Transport` trait (see `crates/mesh‑transport/src/transport.rs`) and plug it into `MeshTransport`.
182+
Implement the `Backend` trait (see `crates/mesh‑transport/src/backend.rs`) and add a variant to `BackendType` in `transport.rs`.
166183

167184
### Implementing a Custom Local Planner
168185

@@ -172,6 +189,10 @@ Implement the `LocalPlanner` trait (see `crates/local‑planner/src/lib.rs`) and
172189

173190
Extend `state‑sync` with new CRDT structures that implement the `Crdt` trait.
174191

192+
### Adding a New Planning Algorithm
193+
194+
Implement the `PlanningAlgorithm` trait in `distributed‑planner/src/algorithms.rs` and register it with `DistributedPlanner`.
195+
175196
## 🤝 Contributing
176197

177198
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

crates/agent-core/src/agent.rs

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,43 @@
11
//! High‑level agent abstraction.
22
33
use crate::integration::IntegrationAdapter;
4+
use crate::fault_tolerance::FaultToleranceManager;
45
use common::types::AgentId;
56
use common::error::Result;
67
use mesh_transport::{MeshTransport, MeshTransportConfig};
78
use state_sync::{DefaultStateSync, StateSync};
9+
use tokio::sync::mpsc;
810
use tokio::task::JoinHandle;
911

1012
/// A full‑fledged agent combining transport, state sync, and application logic.
1113
pub struct Agent {
1214
id: AgentId,
1315
integration: IntegrationAdapter,
1416
task_handle: Option<JoinHandle<Result<()>>>,
17+
fault_handle: Option<JoinHandle<()>>,
1518
}
1619

1720
impl Agent {
1821
/// Create a new agent with the given configuration.
1922
pub fn new(id: AgentId, config: MeshTransportConfig) -> Result<Self> {
2023
let transport = MeshTransport::new(config)?;
2124
let state_sync = Box::new(DefaultStateSync::new(id));
22-
let integration = IntegrationAdapter::new(transport, state_sync);
25+
26+
// Create channel for fault tolerance events
27+
let (fault_tx, fault_rx) = mpsc::unbounded_channel();
28+
let integration = IntegrationAdapter::new(transport, state_sync, Some(fault_tx));
29+
30+
// Start fault tolerance manager in background
31+
let fault_manager = FaultToleranceManager::new(fault_rx);
32+
let fault_handle = tokio::spawn(async move {
33+
fault_manager.run().await;
34+
});
2335

2436
Ok(Self {
2537
id,
2638
integration,
2739
task_handle: None,
40+
fault_handle: Some(fault_handle),
2841
})
2942
}
3043

@@ -44,6 +57,10 @@ impl Agent {
4457
handle.abort();
4558
let _ = handle.await;
4659
}
60+
if let Some(fault_handle) = self.fault_handle.take() {
61+
fault_handle.abort();
62+
let _ = fault_handle.await;
63+
}
4764
Ok(())
4865
}
4966

0 commit comments

Comments
 (0)