Skip to content

Thread/async safety: unprotected global mutable state causes race conditions in multi-agent deployments #1167

@MervinPraison

Description

@MervinPraison

Summary

Multiple critical concurrency bugs violate the "multi-agent + async safe by default" principle. These can cause RuntimeError, lost data, and crashes in production multi-agent deployments.

Specific Issues

1. Unprotected global dicts in agents/agents.py (lines 33-35)

# NO lock protection — contrast with agent.py which HAS _server_lock
_agents_server_started = {}
_agents_registered_endpoints = {}
_agents_shared_apps = {}

Multiple agents starting API servers concurrently can race on these dicts. agent/agent.py correctly uses _server_lock = threading.Lock() for the same pattern — agents.py does not.

2. Race condition in global ToolRegistry singleton (tools/registry.py, lines 256-261)

_global_registry: Optional[ToolRegistry] = None

def get_registry() -> ToolRegistry:
    global _global_registry
    if _global_registry is None:        # Thread A reads None
        _global_registry = ToolRegistry()  # Thread B also reads None → two registries created
    return _global_registry

No lock around initialization. Two threads can create separate registries, causing tools registered in one to be invisible in the other.

Fix: Add double-checked locking:

_registry_lock = threading.Lock()

def get_registry() -> ToolRegistry:
    global _global_registry
    if _global_registry is None:
        with _registry_lock:
            if _global_registry is None:
                _global_registry = ToolRegistry()
    return _global_registry

3. asyncio.run() inside potentially-async context (agent/agent.py, line 5067)

if hasattr(backend, 'request_approval_sync'):
    decision = backend.request_approval_sync(request)
else:
    decision = asyncio.run(backend.request_approval(request))  # 💥 RuntimeError if event loop running

When _check_tool_approval_sync() is called during achat() or async execution, asyncio.run() will raise RuntimeError: asyncio.run() cannot be called from a running event loop. The code should detect whether an event loop is running and use asyncio.get_running_loop().create_task() or similar instead.

4. Unprotected _pending_approvals dict (agent/agent.py, lines 1630, 8334-8384)

self._pending_approvals = {}  # No lock

# Write (async method):
self._pending_approvals[tracking_id] = {...}

# Read+Delete (concurrent method):
for tid, info in self._pending_approvals.items():  # RuntimeError: dict changed size
    del self._pending_approvals[tid]

Concurrent async tasks modifying this dict can cause RuntimeError: dictionary changed size during iteration.

Impact

  • Production crashes in multi-agent async deployments
  • Silent data loss (tools registered to wrong registry instance)
  • Intermittent failures that are hard to reproduce and debug

Expected Behavior

Per the stated principle: "Multi-agent + async safe by default" — all shared mutable state should be lock-protected, and async/sync boundaries should be handled correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingclaudeAuto-trigger Claude analysis

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions