maseval
diff --git a/‎ARECRITICALASSESSMENT.md‎
Lines changed: 143 additions & 0 deletions b/‎ARECRITICALASSESSMENT.md‎
Lines changed: 143 additions & 0 deletions
diff --git a/‎AREISSUES.md‎
Lines changed: 109 additions & 0 deletions b/‎AREISSUES.md‎
Lines changed: 109 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 6 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 6 additions & 0 deletions
@@ -0,0 +1,143 @@
+## Critical Assessment: ARE Integration in MASEval
+
+I've read every line of the ARE integration (`are.py`, `are_tool_wrapper.py`), the base `Environment` class, the primary consumer (`Gaia2Environment`), the underlying ARE library (`AppTool`, `AppToolAdapter`, `Environment`, `EnvironmentConfig`), and all tests. Here's my honest assessment.
+
+---
+
+### What works well
+
+**1. The tool wrapper is genuinely useful.** `AREToolWrapper` adds real value: simulation-time-aware tracing, invocation history, JSON schema extraction, and a clean `__call__` interface. Without it, every developer would need to write their own tracing and metadata extraction. The 3-layer design (ARE tool → maseval wrapper → framework-specific) is correct and well-documented.
+
+**2. Notification polling is well-designed.** The two variants (`poll_notifications` for consuming, `get_turn_notifications` for peeking/re-queuing) solve a real problem. Notification handling is one of the trickiest parts of ARE, and having pre-built methods that categorize by `MessageType` saves developers from re-implementing this.
+
+**3. Tracing and config gathering are well-integrated.** `gather_traces()` and `gather_config()` compose cleanly with the base class and feed into maseval's tracing infrastructure. Per-tool invocation history with sim-time metadata is exactly what you'd need for benchmark analysis.
+
+**4. Test coverage is excellent.** Three test files covering unit tests, mock-based tests, and real integration tests against the ARE stack. The integration tests are particularly valuable — they exercise the full lifecycle with real apps.
+
+---
+
+### Problems and concerns
+
+**1. `Gaia2Environment` completely bypasses `AREEnvironment.setup_state()` — raising the question of whether `AREEnvironment` is actually reusable.**
+
+This is the most important finding. `Gaia2Environment.setup_state()` at `maseval/benchmark/gaia2/environment.py:62` does NOT call `super().setup_state()`. It reimplements everything: creating `EnvironmentConfig`, creating the notification system, calling `self._are_env = AREEnv(...)`, calling `self._are_env.run(...)`, setting `self._scenario`. 
+
+The parent class's `setup_state()` was designed to be the reusable path, but the only real consumer skips it entirely because it needs `preprocess_scenario()` (which handles oracle, judge creation, turns, etc.) — things the parent can't anticipate. This means:
+
+- The "shorthand path" (`apps` + `events`) has **zero real-world usage** — it's only tested in tests.
+- The generic `_run_oracle_mode()` in `AREEnvironment` is also unused by Gaia2 (which uses ARE's own `preprocess_scenario()` oracle run).
+- The `_create_notification_system()` helper is bypassed (Gaia2 creates `VerboseNotificationSystem()` directly).
+
+**Implication:** `AREEnvironment` is essentially a bag of utilities (`create_tools()`, `poll_notifications()`, lifecycle methods, accessors) rather than a coherent base class with a usable setup path. The `setup_state()` code is tested but not battle-tested through actual benchmark use.
+
+**2. Invented defaults in the shorthand path violate scientific integrity guidelines.**
+
+At `maseval/interface/environments/are.py:170-173`:
+```python
+duration = environment_data.get("duration", 1800)
+seed = environment_data.get("seed", 0)
+start_time = environment_data.get("start_time", 0)
+time_increment = environment_data.get("time_increment_in_seconds", 1)
+```
+
+Per AGENTS.md's scientific integrity section: *"If a researcher would need to report a parameter in a paper's Experimental Setup section, do not invent a default for it."* Duration and seed are experimental parameters — a researcher would absolutely need to report them. The `1800` default was copied from ARE's `MAX_SCENARIO_DURATION`, but `seed=0` is invented. This is exactly the pattern AGENTS.md warns against.
+
+**3. Excessive use of `getattr` with fallbacks masks errors.**
+
+At `maseval/interface/environments/are.py:134`: `getattr(scenario, "time_increment_in_seconds", 1)` — if the scenario doesn't have this attribute, silently using `1` could lead to incorrect simulation behavior. Same at line 136: `getattr(scenario, "start_time", None)`. Per AGENTS.md: *"Pass through directly, let errors surface."*
+
+This is a pattern throughout: `getattr(scenario, "scenario_id", None)`, `getattr(scenario, "seed", None)`, etc. at lines 147-151. If `scenario` doesn't have these attributes, something is wrong and should fail loudly.
+
+**4. The `filter_aui_tools` flag mutates external state as a side effect.**
+
+At `maseval/interface/environments/are.py:254-255`:
+```python
+if self._filter_aui_tools and hasattr(app, "wait_for_user_response"):
+    app.wait_for_user_response = False
+```
+
+This mutates the ARE app's internal state as a side effect of creating tools. This is surprising — calling `create_tools()` shouldn't modify app behavior. A developer reusing the same app instance elsewhere would get unexpected behavior.
+
+**5. The `AREToolWrapper` duplicates what `AppToolAdapter` already provides.**
+
+Looking at the ARE source, `AppToolAdapter` already:
+- Extracts `name`, `description`, `inputs`, `output_type`
+- Has a `forward()` method that delegates to `AppTool.__call__`
+- Converts types to HuggingFace format
+
+`AREToolWrapper` wraps an `AppTool`, then immediately creates an `AppToolAdapter` just to read metadata from it, then never uses it again. The schema extraction in `_extract_schema()` at `maseval/interface/environments/are_tool_wrapper.py:78-106` duplicates what `AppTool.to_open_ai()` already does natively.
+
+Maseval adds tracing on top, which is good — but the metadata extraction layer is redundant. You could simplify by composing with `AppToolAdapter` rather than duplicating.
+
+**6. `Any` types everywhere erode the type safety promise.**
+
+`_are_env: Any`, `_scenario: Any`, `get_are_environment() -> Any`, `get_scenario() -> Any`, `get_notification_system() -> Any`. The AGENTS.md philosophy says types should "provide better IDE autocomplete and error detection." With `Any` returns, developers get zero autocomplete on the most important objects. 
+
+If the concern is that ARE is an optional dependency, you can use `TYPE_CHECKING` imports (which `are_tool_wrapper.py` already does!) to provide proper types that only resolve during type-checking.
+
+**7. Silent no-ops are dangerous for simulation code.**
+
+Every lifecycle method silently does nothing if `_are_env is None`:
+```python
+def start(self) -> None:
+    if self._are_env is not None and self._scenario is not None:
+        self._are_env.run(...)
+```
+
+If someone calls `start()` on a misconfigured environment, nothing happens and nothing is logged. In simulation code, silent failures can waste hours of debugging. These should either raise or at minimum warn.
+
+**8. The cleanup swallows all exceptions.**
+
+At `maseval/interface/environments/are.py:441-444`:
+```python
+def cleanup(self) -> None:
+    if self._are_env is not None:
+        try:
+            self._are_env.stop()
+        except Exception:
+            pass
+```
+
+Blanket exception swallowing makes debugging impossible. At minimum, log the exception.
+
+---
+
+### Does it solve real developer problems?
+
+**Partially.** The integration solves the right problems in theory:
+- Tool wrapping with tracing (**yes, genuinely helpful**)
+- Notification handling (**yes, saves real boilerplate**)
+- Lifecycle management (**mostly — but lifecycle is already simple in ARE**)
+- Config/trace gathering (**yes, good maseval integration**)
+
+But in practice, the only real consumer (Gaia2) bypasses the setup path entirely, which suggests the abstraction doesn't match what developers actually need when building ARE-based benchmarks. The "reusable base class" framing oversells what you get — what you really get is a collection of useful utilities attached to a partially-useful setup flow.
+
+### Does it give flexibility?
+
+**Mixed.** The shorthand path provides a nice quick-start experience for simple scenarios. But for any real benchmark (like Gaia2), developers need to override `setup_state()` completely because:
+- They need `preprocess_scenario()` (ARE's own setup pipeline)
+- They need custom judge configuration
+- They need benchmark-specific oracle mode handling
+- The parent's oracle mode is too simplistic
+
+The fact that the primary consumer had to bypass the parent's setup entirely suggests the abstraction layer is at the wrong level.
+
+### What would make this better?
+
+1. **Accept that `AREEnvironment` is a utility mixin, not a full base class.** The lifecycle methods, notification polling, tool wrapping, and accessors are genuinely useful. But `setup_state()` should either be truly generic (just "give me an ARE Environment and a Scenario and I'll set up tools + accessors") or be left abstract so subclasses always implement it.
+
+2. **Remove the shorthand path or demote it to a factory function.** It has no real consumers and introduces invented defaults.
+
+3. **Remove `getattr` fallbacks.** Let errors surface.
+
+4. **Use proper types via `TYPE_CHECKING`.** You already do this in the tool wrapper — extend it to `AREEnvironment`.
+
+5. **Don't mutate app state in `create_tools()`.** The AUI filtering should be separate from tool creation.
+
+6. **Make lifecycle methods explicit about failure.** `start()` on a `None` environment should raise, not silently succeed.
+
+---
+
+### Bottom line
+
+The ARE integration is a **solid 70%**. The tool wrapper, notification handling, and tracing infrastructure are genuinely useful and well-tested. But the base class `setup_state()` path is an unused abstraction — the only real consumer bypasses it completely. The code would be stronger if it acknowledged its actual role (utilities + tool wrapping) rather than pretending to be a complete, reusable environment setup pipeline. The invented defaults and silent no-ops are the most actionable issues — they directly contradict the project's own scientific integrity and "let errors surface" guidelines.
@@ -0,0 +1,109 @@
+# AREEnvironment Issues
+
+Issues identified by comparing `AREEnvironment` (generic ARE wrapper in `maseval/interface/environments/`) with `Gaia2Environment` (dataset-specific implementation in `maseval/benchmark/gaia2/`).
+
+Reviewed through scientific coding principles: silent failures that could produce wrong-but-plausible benchmark results are treated as high priority.
+
+**Design principle:** maseval's `Benchmark` class provides `fail_on_task_error`, `fail_on_setup_error`, `fail_on_evaluation_error`, etc. flags that give users explicit control over error tolerance. Environment and tool code must propagate errors so the benchmark runner can classify them (agent fault vs. infrastructure fault) and respect the user's `fail_on_*` settings. Silent swallowing inside environment code bypasses this mechanism entirely — the user asked for strict mode but gets silent degradation.
+
+## High Priority
+
+### 1. Oracle mode silently degrades to empty data
+
+`maseval/interface/environments/are.py:200-203` uses `hasattr` checks that silently fall back to empty dicts/lists:
+
+```python
+oracle_traces = {
+    "apps_state": oracle_env.get_apps_state() if hasattr(oracle_env, "get_apps_state") else {},
+    "world_logs": oracle_env.get_world_logs() if hasattr(oracle_env, "get_world_logs") else [],
+}
+```
+
+If the ARE API changes or the methods are missing, oracle traces will be `{}` and `[]` — structurally valid but scientifically empty. Downstream evaluation code will run without error and produce meaningless scores. This is a textbook "wrong result that looks right" scenario.
+
+`Gaia2Environment` avoids this entirely by delegating to ARE's canonical `preprocess_scenario()`.
+
+No tests cover oracle mode, so this cannot be caught by CI.
+
+**Fix:** Remove the `hasattr` fallbacks. Call the methods directly — if they don't exist, the crash immediately tells you the ARE API changed. Alternatively, delegate to `preprocess_scenario()`. Add oracle mode tests.
+
+### 2. Missing simulation time tracking in AREToolWrapper
+
+`Gaia2GenericTool` records simulation time before/after each tool call (`maseval/benchmark/gaia2/tool_wrapper.py:123-150`). `AREToolWrapper` only records wall-clock time (`maseval/interface/environments/are_tool_wrapper.py:122-128`).
+
+Simulation time is the temporal coordinate of the experiment. Without it, any analysis of agent behavior in time-sensitive ARE scenarios is done against wall-clock time, which conflates LLM latency with simulated environment dynamics. Results would be unreproducible across different hardware or API providers.
+
+**Fix:** Add `_get_simulation_time()` helper and record `simulation_time_before`, `simulation_time_after`, `simulation_time_elapsed` in the invocation `meta` dict, matching `Gaia2GenericTool`.
+
+### 3. Schema extraction silently fabricates defaults
+
+`AREToolWrapper._extract_schema()` (`maseval/interface/environments/are_tool_wrapper.py:90-95`):
+
+```python
+"type": getattr(arg, "arg_type", "string"),   # fabricates "string" if missing
+if not getattr(arg, "has_default", True):      # assumes optional if missing
+```
+
+If an ARE tool arg is missing `arg_type`, it silently becomes `"string"`. If `has_default` is missing, it silently becomes optional. The schema will look valid but will be wrong — agents will receive incorrect type information and parameter requirements, producing tool calls that either fail in hard-to-trace ways or succeed with wrong arguments.
+
+`Gaia2GenericTool` accesses these directly (`arg.arg_type`, `arg.has_default`) — it will crash immediately if the structure is unexpected, which is the correct behavior.
+
+**Fix:** Remove the silent defaults. Access `arg.arg_type` and `arg.has_default` directly. If ARE tools don't have these attributes, that's a real problem that should surface immediately. Additionally, use ARE's `AppToolAdapter` as the canonical source of tool metadata, as `Gaia2GenericTool` does.
+
+### 4. `poll_notifications()` silently swallows all exceptions
+
+Both `AREEnvironment` (`are.py:330-331`) and `Gaia2Environment` (`environment.py:328-329`):
+
+```python
+except Exception:
+    return [], [], False
+```
+
+If the notification system has a bug, returns corrupt data, or the ARE API changes, this catch-all returns "no notifications, simulation not stopped" — a plausible-looking empty result. The agent loop continues as if nothing happened, missing user messages and environment events. The benchmark produces scores based on an agent that never received its inputs.
+
+This is distinct from lifecycle cleanup (where swallowing errors is acceptable). Notification polling is part of the data path — it directly affects what the agent sees and does.
+
+Additionally, this bypasses maseval's `fail_on_*_error` mechanism: even if the user configured strict mode via `fail_on_task_error` or `fail_on_setup_error` to catch infrastructure failures, these errors are swallowed before the benchmark runner ever sees them.
+
+**Fix:** Remove the bare `except Exception`. Catch only the specific exceptions that represent expected transient conditions (if any). Let unexpected errors propagate — the benchmark runner classifies them (`ENVIRONMENT_ERROR` during execution, `SETUP_FAILED` during setup) and the user's `fail_on_*` settings decide whether to abort or continue.
+
+## Medium Priority
+
+### 5. AUI tool filtering missing from AREEnvironment
+
+`Gaia2Environment` filters out `AgentUserInterface` message-retrieval tools and sets `wait_for_user_response = False` (`maseval/benchmark/gaia2/environment.py:178-213`), matching ARE's default agent behavior. `AREEnvironment` doesn't.
+
+Anyone using `AREEnvironment` with ARE's notification-based message delivery will get tools that block waiting for a response or duplicate user messages. This is a correctness issue for interactive scenarios, not just a convenience gap.
+
+**Fix:** Add an opt-in parameter (e.g. `filter_aui_tools=True`) to `AREEnvironment.__init__`, or at minimum document the behavior prominently so users of the generic wrapper know they must handle this themselves.
+
+### 6. Inconsistent error handling in AREEnvironment lifecycle methods
+
+`cleanup()` wraps in try/except (`are.py:360-364`), but `pause()` and `resume_with_offset()` propagate exceptions (`are.py:268-280`). `Gaia2Environment` wraps all lifecycle methods consistently.
+
+For lifecycle methods, the question is: "if this fails, can the experiment still produce correct results?"
+
+- `cleanup()`: Swallowing is acceptable — the task is already done, we're tearing down.
+- `pause()` / `resume_with_offset()`: These control simulation time during agent execution. A failure here means time is advancing when it shouldn't be (or vice versa), directly affecting the experimental conditions. The current inconsistency means `pause()` will crash the benchmark run while `cleanup()` won't — but neither behavior is fully correct.
+
+Crucially, swallowing these errors removes the user's ability to control error tolerance via `fail_on_task_error` / `fail_on_setup_error`. The user opted into strict mode precisely to catch conditions like "simulation time wasn't paused during LLM generation." Silent swallowing overrides that choice.
+
+**Fix:** Let `pause()` and `resume_with_offset()` propagate exceptions. The benchmark runner classifies them and the user's `fail_on_*` settings decide the outcome. `cleanup()` is the only lifecycle method where swallowing is acceptable (teardown after task completion).
+
+### 7. No `get_turn_notifications()` on AREEnvironment
+
+`Gaia2Environment` has `get_turn_notifications()` (`maseval/benchmark/gaia2/environment.py:331-380`) which re-queues environment notifications for the inner agent loop — essential for multi-turn ARE scenarios. Without it, environment notifications are consumed by the outer turn loop and never reach the agent's step loop.
+
+Anyone building a multi-turn agent on `AREEnvironment` will silently lose environment notifications between turns.
+
+**Fix:** Add `get_turn_notifications()` to `AREEnvironment`, or document that `poll_notifications()` alone is insufficient for multi-turn agent loops and that users must implement re-queuing themselves.
+
+## Low Priority
+
+### 8. Metadata extraction should use AppToolAdapter
+
+`Gaia2GenericTool` delegates metadata extraction to ARE's `AppToolAdapter` (`maseval/benchmark/gaia2/tool_wrapper.py:68-75`) — the canonical source of truth for tool name, description, inputs, and output_type. `AREToolWrapper` reads these attributes directly from the ARE tool object.
+
+This works today but couples `AREToolWrapper` to ARE's internal tool structure rather than its public adapter API.
+
+**Fix:** Use `AppToolAdapter` in `AREToolWrapper` for metadata extraction.
@@ -9,8 +9,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+- Generic `AREEnvironment` in `maseval.interface.environments` for Meta's ARE (Agent Research Environments) platform with lifecycle control (start/stop/pause/resume), notification polling, AUI tool filtering, and oracle mode support. Install with `pip install maseval[are]`. (PR: #55)
+- Generic `AREToolWrapper` in `maseval.interface.environments` wraps any ARE `AppTool` with simulation time tracking, invocation history, JSON schema extraction, and MASEval tracing integration. (PR: #55)
+- Optional dependency extra `are`: `pip install maseval[are]` (PR: #55)
+
 ### Changed
 
+- `Gaia2Environment` now inherits from `AREEnvironment` and only retains GAIA2-specific setup (preprocessing + judge) and trace/config gathering as overrides. (PR: #55)
+- `Gaia2GenericTool` is now an alias for `AREToolWrapper`. (PR: #55)
 - Renamed `task_data` parameter to `environment_data` across all environment constructors, test fixtures, and examples for consistency with the base class API. (PR: #58)
 
 ### Fixed