Skip to content

Commit 7880183

Browse files
committed
proposed fix for gaia2 notification management
1 parent 27e1ecf commit 7880183

7 files changed

Lines changed: 322 additions & 73 deletions

File tree

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Fixed
11+
12+
- Fixed GAIA2 multi-turn notification loop: `wait_for_notification()` no longer terminates the agent prematurely, enabling correct behavior for `time` and `adaptability` scenarios that require the agent to wait for simulation events and resume (PR: #PR_NUMBER_PLACEHOLDER)
13+
- Added `Gaia2Environment.poll_notifications()` convenience method for custom agent implementations to drain the notification queue without needing ARE-internal imports (PR: #PR_NUMBER_PLACEHOLDER)
14+
1015
### Added
1116

1217
**Benchmarks**

docs/benchmark/gaia2.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,45 @@ tasks = load_tasks(capability="time", limit=10)
108108
tasks = load_tasks(limit=50)
109109
```
110110

111+
## Multi-Turn Notification Loop
112+
113+
GAIA2 uses an **event-driven** multi-turn architecture, not user-turn interaction. Unlike Tau2 (where a user simulator drives multi-turn), GAIA2 scenarios have scheduled events (e.g., "calendar events added at t=240s", "friend replies at t=300s") that the agent must wait for and react to.
114+
115+
The benchmark invokes the agent **once**. The agent handles multi-turn internally via the notification loop:
116+
117+
1. Agent calls `SystemApp__wait_for_notification(timeout=N)` as a normal tool.
118+
2. The ARE environment processes scheduled events, advances simulation time, and queues resulting notifications — all synchronously during the tool call.
119+
3. The tool returns. The agent's loop continues (it does **not** terminate).
120+
4. Before the next LLM call, the agent polls `environment.poll_notifications()` to retrieve messages that arrived during the wait.
121+
5. The agent injects those messages into its context and continues reasoning.
122+
6. Eventually the agent calls `AgentUserInterface__send_message_to_user` — the **only** termination signal.
123+
124+
### What custom agents must implement
125+
126+
The ARE tools handle all environment-side mechanics automatically (event processing, time advancement, notification queuing). No callbacks or hooks required. Custom agents must handle two things:
127+
128+
**1. Do not terminate on `wait_for_notification`.** Treat it as a regular tool call. Only terminate on `AgentUserInterface__send_message_to_user`.
129+
130+
**2. Poll notifications between steps.** After `wait_for_notification` returns, new messages are in the queue. Call `environment.poll_notifications()` to drain them:
131+
132+
```python
133+
# Between agent steps (e.g., before each LLM call):
134+
user_msgs, env_notifs, has_stop = environment.poll_notifications()
135+
136+
# Inject into agent context (format matches ARE's convention):
137+
if user_msgs:
138+
content = "\n".join(user_msgs)
139+
messages.append({"role": "user", "content": f"User messages updates:\n***\n{content}\n***\n"})
140+
if env_notifs:
141+
content = "\n".join(env_notifs)
142+
messages.append({"role": "user", "content": f"Environment notifications updates:\n***\n{content}\n***\n"})
143+
if has_stop:
144+
# Environment signalled simulation end — stop the agent loop
145+
break
146+
```
147+
148+
See `DefaultGaia2Agent` source for the canonical single-loop implementation.
149+
111150
## Key Differences from Tau2
112151

113152
| Aspect | Gaia2 | Tau2 |

maseval/benchmark/gaia2/environment.py

Lines changed: 59 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
Reference Paper: "GAIA-2: A Controllable Multi-Turn Conversational Benchmark for Agents"
66
"""
77

8-
from typing import Any, Dict, List, Optional
8+
from typing import Any, Dict, List, Optional, Tuple
99

1010
from maseval import Environment
1111

@@ -229,6 +229,64 @@ def get_notification_system(self) -> Any:
229229
return None
230230
return getattr(self._are_env, "notification_system", None)
231231

232+
def poll_notifications(self) -> Tuple[List[str], List[str], bool]:
233+
"""Poll pending notifications from the ARE notification system.
234+
235+
Drains all pending messages from the notification queue and returns
236+
them as pre-formatted strings. Call this between agent steps to
237+
receive messages that arrived during ``wait_for_notification()`` or
238+
from background simulation events.
239+
240+
GAIA2 uses an event-driven multi-turn architecture. When the agent
241+
calls ``SystemApp__wait_for_notification``, the ARE environment
242+
processes scheduled events, advances simulation time, and queues
243+
notifications. After the tool returns, call this method to retrieve
244+
those notifications and inject them into the agent's context before
245+
the next LLM call.
246+
247+
ARE agents/default_agent/steps/are_simulation.py:26-62
248+
249+
Returns:
250+
Tuple of ``(user_messages, env_notifications, has_stop_message)``.
251+
``user_messages`` and ``env_notifications`` contain pre-formatted
252+
strings ready to inject into agent context. ``has_stop_message``
253+
is True when the environment has signalled the simulation is over.
254+
"""
255+
notification_system = self.get_notification_system()
256+
if notification_system is None:
257+
return [], [], False
258+
259+
try:
260+
from datetime import datetime, timezone
261+
262+
from are.simulation.notification_system import MessageType # type: ignore[import-not-found]
263+
264+
timestamp = datetime.now(tz=timezone.utc)
265+
unhandled = notification_system.message_queue.get_by_timestamp(timestamp=timestamp)
266+
267+
if not unhandled:
268+
return [], [], False
269+
270+
# Separate by message type, matching ARE steps/are_simulation.py:34-61
271+
user_messages: List[str] = []
272+
env_notifications: List[str] = []
273+
has_stop = False
274+
275+
for notif in unhandled:
276+
msg_type = getattr(notif, "message_type", None)
277+
if msg_type == MessageType.USER_MESSAGE:
278+
user_messages.append(notif.message)
279+
elif msg_type == MessageType.ENVIRONMENT_NOTIFICATION:
280+
ts = notif.timestamp.strftime("%Y-%m-%d %H:%M:%S") if notif.timestamp else ""
281+
env_notifications.append(f"[{ts}] {notif.message}")
282+
elif msg_type == MessageType.ENVIRONMENT_STOP:
283+
has_stop = True
284+
285+
return user_messages, env_notifications, has_stop
286+
287+
except Exception:
288+
return [], [], False
289+
232290
def get_start_time(self) -> Optional[float]:
233291
"""Get the scenario start time.
234292

maseval/benchmark/gaia2/gaia2.py

Lines changed: 79 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -73,14 +73,49 @@ class Gaia2Benchmark(Benchmark):
7373
MASEval orchestration, tracing, and agent flexibility.
7474
7575
The ARE simulation runs internally; agents interact purely via tool calls.
76-
Time control happens through SystemApp.wait_for_notification().
76+
Time control happens through ``SystemApp__wait_for_notification``.
7777
7878
Subclasses must implement:
79-
- setup_agents(): Create agents for the task
80-
- get_model_adapter(): Provide model adapters
79+
80+
- ``setup_agents()`` — Create agents for the task
81+
- ``get_model_adapter()`` — Provide model adapters
82+
83+
Multi-Turn Notification Loop:
84+
GAIA2 uses an **event-driven** multi-turn architecture, not user-turn
85+
interaction. The benchmark invokes the agent **once**; the agent
86+
handles multi-turn internally via the notification loop described below.
87+
88+
**How it works:**
89+
90+
1. The agent calls ``SystemApp__wait_for_notification(timeout=N)`` as
91+
a normal tool. This triggers ``ARE.Environment.wait_for_next_notification()``
92+
synchronously: the simulation processes scheduled events, advances
93+
time, and queues any resulting notifications.
94+
2. The tool returns an observation (the agent's loop continues).
95+
3. **Before the next LLM call**, the agent must poll the notification
96+
queue to retrieve messages that arrived during the wait.
97+
4. The agent injects those messages into its context and continues
98+
reasoning.
99+
5. Eventually the agent calls ``AgentUserInterface__send_message_to_user``
100+
which is the **only** termination signal.
101+
102+
**What custom agents must do:**
103+
104+
- **Do NOT terminate** when ``wait_for_notification`` returns. Treat
105+
it as a regular tool call whose observation is added to context.
106+
- **Poll notifications** between steps by calling
107+
``environment.poll_notifications()``, which returns pre-formatted
108+
``(user_messages, env_notifications, has_stop_message)`` without
109+
requiring any ARE imports.
110+
- **Terminate only** on ``AgentUserInterface__send_message_to_user``.
111+
112+
See the default agent implementation for the canonical single-loop
113+
approach, and ``Gaia2Environment.poll_notifications()`` for the
114+
polling API.
81115
"""
82116

83-
# Single-turn by default (ARE handles time internally via tools)
117+
# The benchmark invokes the agent once; multi-turn is the agent's
118+
# responsibility (see "Multi-Turn Notification Loop" above).
84119
MAX_INVOCATIONS = 1
85120

86121
def __init__(
@@ -355,11 +390,14 @@ def _get_offset_from_time_config_mode(
355390
# Stop sequences for text-based action parsing
356391
_STOP_SEQUENCES = ["<end_action>", "Observation:"]
357392

358-
# Termination tool names - agent terminates when these are called
393+
# Termination tool names — agent terminates when these are called.
394+
# wait_for_notification is NOT a termination tool: it pauses the agent while
395+
# the ARE environment processes events and advances simulation time, then the
396+
# agent resumes by polling notifications. See Gaia2Benchmark docstring for
397+
# the full multi-turn contract.
359398
_TERMINATION_TOOLS = frozenset(
360399
{
361400
"AgentUserInterface__send_message_to_user",
362-
"SystemApp__wait_for_notification",
363401
}
364402
)
365403

@@ -659,7 +697,9 @@ class DefaultGaia2Agent:
659697
- Default max_iterations: 80 (ARE are_simulation_agent_config.py:36)
660698
- Invalid format retry: up to 10 times (ARE base_agent.py:347)
661699
- Iteration counter incremented EVERY loop (including errors) (ARE base_agent.py:849)
662-
- Terminates on send_message_to_user or wait_for_notification
700+
- Terminates on send_message_to_user only
701+
- wait_for_notification pauses: ARE processes events, then agent polls
702+
notifications and continues the loop
663703
- Max-iterations sends message to user via tool (ARE are_simulation.py:109-116)
664704
- Pre-step notification polling (ARE steps/are_simulation.py:26-62)
665705
"""
@@ -751,58 +791,35 @@ def run(self, query: str) -> str:
751791
def _pull_notifications(self) -> None:
752792
"""Pull messages from the ARE notification system.
753793
794+
Delegates to ``Gaia2Environment.poll_notifications()`` which drains
795+
the notification queue and returns pre-formatted strings.
796+
754797
Matches ARE's pre-step notification polling behavior.
755798
ARE agents/default_agent/steps/are_simulation.py:26-62
756799
"""
757800
if self.environment is None:
758801
return
759802

760-
notification_system = self.environment.get_notification_system()
761-
if notification_system is None:
762-
return
803+
user_messages, env_notifications, _ = self.environment.poll_notifications()
763804

764-
try:
765-
from datetime import datetime, timezone
766-
767-
timestamp = datetime.now(tz=timezone.utc)
768-
unhandled = notification_system.message_queue.get_by_timestamp(timestamp=timestamp)
769-
770-
if not unhandled:
771-
return
772-
773-
# Separate user messages from environment notifications
774-
# ARE steps/are_simulation.py:34-61
775-
user_messages = []
776-
env_notifications = []
777-
for notif in unhandled:
778-
msg_type = getattr(notif, "message_type", None)
779-
if msg_type is not None:
780-
# Import MessageType at call time to avoid import errors
781-
from are.simulation.notification_system import MessageType # type: ignore[import-not-found]
782-
783-
if msg_type == MessageType.USER_MESSAGE:
784-
user_messages.append(notif.message)
785-
elif msg_type == MessageType.ENVIRONMENT_NOTIFICATION:
786-
ts = notif.timestamp.strftime("%Y-%m-%d %H:%M:%S") if notif.timestamp else ""
787-
env_notifications.append(f"[{ts}] {notif.message}")
788-
789-
# Inject into message history matching ARE's format
790-
# ARE base_agent.py:107: "User messages updates:\n***\n{content}\n***\n"
791-
if user_messages:
792-
content = "\n".join(user_messages)
793-
self._messages.append({"role": "user", "content": f"User messages updates:\n***\n{content}\n***\n"})
794-
795-
# ARE base_agent.py:110-112: "Environment notifications updates:\n***\n{content}\n***\n"
796-
if env_notifications:
797-
content = "\n".join(env_notifications)
798-
self._messages.append({"role": "user", "content": f"Environment notifications updates:\n***\n{content}\n***\n"})
799-
except Exception:
800-
pass # Notification polling is best-effort
805+
# Inject into message history matching ARE's format
806+
# ARE base_agent.py:107: "User messages updates:\n***\n{content}\n***\n"
807+
if user_messages:
808+
content = "\n".join(user_messages)
809+
self._messages.append({"role": "user", "content": f"User messages updates:\n***\n{content}\n***\n"})
810+
811+
# ARE base_agent.py:110-112: "Environment notifications updates:\n***\n{content}\n***\n"
812+
if env_notifications:
813+
content = "\n".join(env_notifications)
814+
self._messages.append({"role": "user", "content": f"Environment notifications updates:\n***\n{content}\n***\n"})
801815

802816
def _check_environment_stop(self) -> bool:
803817
"""Check if the environment has sent a stop message.
804818
805-
Matches ARE's termination_condition_are_simulation check.
819+
Uses a lightweight peek via the ARE notification system's
820+
``has_environment_stop_message()`` when available, falling back to
821+
``poll_notifications()``.
822+
806823
ARE agents/default_agent/termination_methods/are_simulation.py:105-107
807824
808825
Returns:
@@ -811,14 +828,15 @@ def _check_environment_stop(self) -> bool:
811828
if self.environment is None:
812829
return False
813830

831+
# Prefer the non-draining peek when available
814832
notification_system = self.environment.get_notification_system()
815-
if notification_system is None:
816-
return False
833+
if notification_system is not None:
834+
try:
835+
return notification_system.message_queue.has_environment_stop_message()
836+
except Exception:
837+
pass
817838

818-
try:
819-
return notification_system.message_queue.has_environment_stop_message()
820-
except Exception:
821-
return False
839+
return False
822840

823841
def _pause_env(self) -> None:
824842
"""Pause the ARE environment before LLM generation.
@@ -919,23 +937,18 @@ def _react_loop(self) -> str:
919937
if self.verbose >= 1:
920938
print(f"[Iteration {self._iteration_count}] Tool: {tool_name}")
921939

922-
# Check for termination tools
940+
# Check for termination tool (send_message_to_user)
923941
# ARE agents/default_agent/termination_methods/are_simulation.py:76-118
924942
if tool_name in _TERMINATION_TOOLS:
925943
self._terminated = True
926-
927-
# Execute the termination tool
928944
observation = self._execute_tool(tool_name, tool_args)
945+
self._final_message = tool_args.get("content", str(observation)) if isinstance(tool_args, dict) else str(observation)
946+
return self._final_message
929947

930-
# For send_message_to_user, capture the message
931-
if tool_name == "AgentUserInterface__send_message_to_user":
932-
self._final_message = tool_args.get("content", str(observation)) if isinstance(tool_args, dict) else str(observation)
933-
return self._final_message
934-
935-
# For wait_for_notification, return the observation
936-
return str(observation)
937-
938-
# Execute tool
948+
# Execute tool (includes wait_for_notification, which pauses
949+
# the agent while ARE processes events and advances time;
950+
# the next iteration's _pull_notifications() picks up any
951+
# new messages that arrived during the wait)
939952
observation = self._execute_tool(tool_name, tool_args)
940953

941954
# Add observation in ARE's format

tests/test_benchmarks/test_gaia2/conftest.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -468,11 +468,17 @@ def gaia2_model_termination() -> DummyModelAdapter:
468468

469469
@pytest.fixture
470470
def gaia2_model_wait_notification() -> DummyModelAdapter:
471-
"""Create a model that waits for notification."""
471+
"""Create a model that waits for notification then terminates.
472+
473+
wait_for_notification is NOT a termination signal — the agent must
474+
continue its loop. This fixture provides two responses: the wait call
475+
followed by the real termination call (send_message_to_user).
476+
"""
472477
return DummyModelAdapter(
473478
model_id="test-wait-model",
474479
responses=[
475480
'Thought: I need to wait for a notification.\n\nAction:\n{"action": "SystemApp__wait_for_notification", "action_input": {"timeout_seconds": 30}}<end_action>',
481+
'Thought: Done waiting, reporting back.\n\nAction:\n{"action": "AgentUserInterface__send_message_to_user", "action_input": {"content": "Finished waiting for notification."}}<end_action>',
476482
],
477483
)
478484

0 commit comments

Comments
 (0)