This document explains the core architecture of the ECS framework for LLM agents. It covers the motivation behind using an Entity Component System and details how the different parts of the system interact.
Traditional agent frameworks rely on complex class hierarchies like Agent, ToolAgent, and MemoryAgent. This approach often creates rigid code where adding new features requires changing the base class or managing multiple inheritance.
The Entity Component System (ECS) paradigm shifts the focus from what an agent is to what an agent has. In this architecture, an agent is simply a unique ID. Its capabilities come from the data attached to it. If you want an agent to use tools, you add a tool component. You can change these capabilities at runtime by adding or removing components. This flexibility makes it easier to test individual systems and run logic in parallel.
Benefits include:
- Composition over Inheritance: Build complex agents by combining simple data components.
- Runtime Capability Changes: Dynamically add or remove features like memory or tool access while the agent is running.
- Isolated Testability: Test business logic in systems independently from agent state.
- Parallel Execution: Run independent systems simultaneously to improve performance.
The World acts as the central container that coordinates all other parts of the system.
graph TD
World --> EntityIdGenerator
World --> ComponentStore
World --> SystemExecutor
World --> EventBus
World --> Query
EntityIdGenerator -->|Produces| EntityId
ComponentStore -->|Stores| Component[Data]
SystemExecutor -->|Runs| System[Logic]
Query -->|Filters| EntityId
In this model:
- Entities are unique identifiers used to group data.
- Components are pure data structures with no logic.
- Systems contain the logic and operate on entities that have specific component sets.
The system operates in discrete steps called ticks. Each tick represents a full pass through the active logic.
sequenceDiagram
participant R as Runner
participant W as World
participant SE as SystemExecutor
participant S as Systems
participant Q as Query
participant CS as ComponentStore
R->>W: process()
W->>SE: execute_systems()
loop By Priority
SE->>S: run in TaskGroup
S->>Q: get entities
Q->>CS: check component sets
CS-->>Q: return components
Q-->>S: filtered entities
S->>CS: update data
end
W-->>R: check TerminalComponent
Typical tick cycle steps:
- The Runner initiates the process by calling
world.process(). - The
SystemExecutororganizes registered systems by their priority values. - Systems with the same priority run at the same time using
asyncio.TaskGroup. - Systems use the
Querytool to find entities that match specific component requirements. - After processing logic, systems update data in the
ComponentStoreor trigger events through theEventBus. - The Runner checks for a
TerminalComponentat the end of every tick to decide whether to stop the execution loop.
Interactive runtimes can use the opt-in TerminalCleanupSystem to clear specific terminal reasons after reasoning and before the next tick proceeds. This helper does not change core Runner semantics; it is a scoped system-level policy that is typically registered at priority=1 and defaults to clearing only reasoning_complete.
Systems run based on their priority. Lower numbers execute first. We recommend the following priority ordering for agent workflows:
- Priority -10:
RAGSystem,UserInputSystem. Pre-processing: vector search retrieval and user input before reasoning. - Priority -5:
ToolApprovalSystem. Filters tool calls before execution. - Priority 0:
ReasoningSystemorPlanningSystem. These systems usually handle LLM calls to decide the next action. - Priority 1:
TerminalCleanupSystem. Optional post-reasoning cleanup for interactive flows that must continue afterreasoning_complete. - Priority 5:
ToolExecutionSystemorMessageBusSystem. These handle the actual work or interactions with other agents. - Priority 7:
ReplanningSystem. This checks results and updates the plan if needed. - Priority 10:
MemorySystem. This persists important information to long-term storage. - Priority 99:
ErrorHandlingSystem. This runs last to catch and process any issues that occurred during the tick. - Priority 15+:
CheckpointSystem. Creates state snapshots. - Priority 20+:
CompactionSystem. Compresses conversation when needed.
The EventBus provides a typed pub/sub mechanism for asynchronous communication between systems. Subscribers register for specific event type classes. When an event is published, the bus triggers all handlers associated with that exact type.
Common event types include:
ConversationTruncatedEvent: Triggered when context limits are reached.ErrorOccurredEvent: Published when a system encounters an exception.MessageBusPublishedEvent,MessageBusDeliveredEvent,MessageBusResponseEvent,MessageBusTimeoutEvent: Message bus lifecycle events.PlanStepCompletedEvent: Published after a plan step finishes successfully.PlanRevisedEvent: Triggered when an agent changes its strategy.ToolApprovalRequestedEvent,ToolApprovedEvent,ToolDeniedEvent: Tool approval workflow.MCTSNodeScoredEvent: Tree search scoring.StreamStartEvent,StreamReasoningDeltaEvent,StreamReasoningEndEvent,StreamContentStartEvent,StreamContentDeltaEvent,StreamEndEvent: System-level streaming phases.CheckpointCreatedEvent,CheckpointRestoredEvent: Checkpoint lifecycle.CompactionCompleteEvent: Conversation compaction.RAGRetrievalCompletedEvent: RAG retrieval.UserInputRequestedEvent: User input requests.
Several key decisions guide the implementation of this framework:
- Performance: Components use
@dataclass(slots=True)to keep memory usage low and access fast. - Asynchrony: Every system implements the
async def process(self, world: World) -> Noneprotocol to ensure non-blocking execution. - Storage:
ComponentStorestores data in adict[type[Any], dict[EntityId, Any]]structure. This provides O(1) lookup speed for any component given its type and entity ID. - Filtering:
Queryoperations find entities by intersecting sets of component keys. This ensures efficient filtering even as the number of entities grows. - Termination: If an LLM provider runs out of responses, the system adds a
TerminalComponentwith aprovider_exhaustedreason. - Error Pattern: Systems catch exceptions and add an
ErrorComponent. TheErrorHandlingSystem(priority 99) logs the issue, publishes anErrorOccurredEvent, and removes the component to prevent infinite error loops.