[Feat] EASI-ER CLI#31
Open
oscarqjh wants to merge 242 commits intoEvolvingLMMs-Lab:mainfrom
Open
Conversation
Implement the easi Python library for orchestrating simulator-based embodied reasoning evaluation. Includes subprocess isolation via filesystem IPC, versioned simulator management (conda+uv), agent interface, task/benchmark framework, and CLI. Components: core abstractions, dummy/AI2-THOR simulators, dummy task, dummy agent, LLM client + dummy server, full test suite (44 tests).
Add embodied agent evaluation pipeline with real simulator support: - ReAct agent with multi-action buffering and PromptBuilder protocol - EB-Alfred task support (6 splits via multi-split YAML discovery) - AI2-THOR v2.1.0 bridge with skill-based actions and state tracking - EvaluationRunner with structured output: <output_dir>/<task>/<run_id>/ - Per-episode artifacts: result.json, trajectory.jsonl, rgb_*.png - Centralized logging (print -> logger), --verbosity CLI option - Subprocess observability: bridge output streaming, Ctrl+C cleanup - LLM API client (OpenAI-compatible) and dummy LLM server - 106 tests passing
Split the monolithic bridge into a generic AI2ThorBridge base class (simulator layer) and an EBAlfredBridge subclass (task layer), so that future benchmarks using ai2thor==2.1.0 can reuse the simulator bridge without rewriting controller management, IPC, or navigation helpers. - Extract EB-Alfred goal evaluation and task loading into easi/tasks/ebalfred/thor_utils.py - Trim easi/simulators/ai2thor/v2_1_0/thor_utils.py to generic-only constants and object query utilities - Refactor bridge.py from EBAlfredBridge (1062 lines) to generic AI2ThorBridge (~314 lines) with configurable simulator_kwargs - Create easi/tasks/ebalfred/bridge.py with EBAlfredBridge subclass containing all skill execution, state tracking, and goal evaluation - Add get_bridge_script_path() and simulator_kwargs to BaseTask and TaskProtocol; override in EBAlfredTask - Update EvaluationRunner to prefer task-specific bridge paths and forward simulator_kwargs - Add simulator_kwargs to all EB-Alfred YAML configs - Add 29 tests covering imports, inheritance, method separation, bridge path resolution, and simulator_kwargs
Unified client with generate() and generate_structured() methods, lazy imports, and cumulative usage tracking (tokens + cost).
Manages start/stop, port checking, health polling with timeout, and context manager support. Extensible for future backends.
New arguments for `easi run` to select LLM backend and configure inference server. Backward compatible with existing --llm-url.
Runner now resolves backend, auto-starts vLLM when needed, creates LLMClient for non-legacy backends, wraps structured output, and tracks LLM usage per-episode and per-run.
- Track usage in generate_structured() via instructor's _raw_response - Fix log file handle leak in ServerManager (store and close in stop()) - Remove duplicate agent_config computation in runner._create_agent()
…port Replace the tightly-coupled agent/prompt design with a memory-based architecture where AgentMemory holds shared state, PromptBuilder reads from memory to construct prompts and parse responses, and the agent is a thin orchestrator. Key changes: - AgentMemory + StepRecord dataclasses as shared agent state - New PromptBuilderProtocol: build_messages(memory) + parse_response(response, memory) - Simplified BaseAgent (removed _chat_history, abstract stubs, default act()) - ReActAgent rewritten as thin orchestrator delegating to builder - EBAlfredPromptBuilder gains chat_history=True mode with VLMPlanner parity - json_repair moved to easi/utils/ (old location re-exports) - Removed stateless flag from agent config (builder controls mode)
Builder-owned schema enforcement: prompt builders can now optionally implement get_response_format() to provide a JSON schema dict that gets passed through to litellm. ReActAgent handles fallback automatically when the backend doesn't support response_format. - LLMClient.generate() accepts optional response_format param - ReActAgent._generate_with_fallback() tries schema, caches on failure - EBAlfredPromptBuilder.get_response_format() returns vlm_generation_guide - Remove dead code: instructor dep, Pydantic schemas, monkey-patching
Add BaseTask.on_episode_reset() hook for task-specific post-reset setup. EBAlfredTask overrides it to update agent action space from bridge metadata, removing EB-Alfred-specific logic from the general EvaluationRunner.
…onfig - trajectory.jsonl: add llm_response field to each step entry - result.json: add instruction field for each episode - config.json: include all CLI options and full task YAML config
Retry: LLMClient passes num_retries to litellm.completion() for automatic exponential backoff on transient errors (timeouts, rate limits). Configurable via --max-retries (default 3). Resume: --resume <run_dir> loads config.json from a previous run, skips completed episodes, clears and re-runs the last episode (which may have been interrupted), then continues the remaining episodes. All CLI options are restored from config.json so only --resume is needed.
…ps bug Fix max_steps mismatch where YAML configured 50 but vendor EBAlfEnv hardcoded 30. Now max_steps flows from YAML through simulator_kwargs to the bridge and vendor env. Add per-episode retry in EvaluationRunner: on crash (e.g. AI2-THOR Unity segfault), the episode dir is cleared, the simulator is re-launched, and the episode is retried up to max_retries times. If all retries are exhausted the episode is recorded as failed and the runner continues to the next episode.
Integrate EmbodiedBench EB-Navigation into EASI with vendored env, task bridge, prompt builder, and 5 split configs (ai2thor v5.0.0).
Remove action_space field from all YAML configs and TaskEntry. Tasks now define their action space via _build_action_space() override with caching, eliminating the confusing pattern of empty YAML fields.
…atform Replace stub bridge with working AI2ThorV5Bridge class that starts a real controller, handles scene reset and discrete navigation actions. Switch platform from CloudRendering to Linux64. Increase sim test timeout default from 30s to 200s for THOR startup.
Fix OUTPUT_TEMPLATE trailing spaces on 3 lines and regenerate navigation_examples.json from source to fix line continuation artifact and curly quote mismatch. Verified character-level parity.
- Add habitat_sim:v0_3_0 simulator registration (conda env + manifest) - Vendor EBHabEnv from EmbodiedBench with fixed imports - Add EBHabitatTask with dynamic action space via on_episode_reset hook - Add EBHabitatPromptBuilder matching VLMPlanner prompt construction - Add 6 per-split YAML configs (base, common_sense, complex_instruction, spatial_relationship, visual_appearance, long_horizon) - Add 26 offline tests for actions, task, prompts, and registry - Move EB-Habitat-specific deps (gym, hydra-core, omegaconf, imageio, habitat-lab) from simulator requirements.txt to task YAML additional_deps
Add --redownload to 'easi task download' and 'easi run' to force re-download of cached HuggingFace datasets. Useful when a previous download was interrupted or incomplete.
…ridge Add trajectory visualization hooks to the EB-Habitat bridge using habitat-sim 0.3.0 API (articulated_agent.base_pos). Includes topdown map rendering via pathfinder, start position persistence, and per-step agent position tracking in trajectory info. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…prompt fix - react_agent: force stop after N consecutive fallbacks (configurable via max_consecutive_fallbacks in YAML, default 0 = disabled) - react_agent: reprompt correction message now comes from prompt builder (fixes SFT builder getting JSON-format correction prompt) - metrics: add early_stop_rate to generic_aggregate - runner: record forced_early_stop in per-episode results - llm/client: support skip_special_tokens via extra_body for vLLM backend - llm/models/internvl3: bypass model.chat() when skip_special_tokens=False to preserve SFT action tokens that get stripped by hardcoded decode - llm/models/qwen3_vl: read skip_special_tokens from kwargs - sft prompt builder: add get_reprompt_message() for action token format - _base.yaml: add max_consecutive_fallbacks: 3
- prompt_builder: validate PNG with PIL before base64 encoding, retry with exponential backoff (0.1s, 0.2s, 0.4s) if truncated, send anyway after retries to trigger episode restart - sft prompt builder: switch to interleaved content blocks (image_url at correct positions instead of all-first with <image> in text), matching InternVL chat template convention - test: use real PIL image instead of fake PNG header bytes
…s.py (./datasets)
ReActAgent captures the flattened prompt preview (`--- {role} ---`
blocks with `[img_N]` markers) alongside llm_response each time it
queries the LLM. Runner threads it into trajectory.jsonl under a new
"prompt" field next to "llm_response". Buffered / fallback-only
steps keep prompt = null.
When MIRROR_DEBUG_DIR is set, the mirror prompt builder writes every flipped PNG it emits (current 3 slots + full history buffer) under that directory with step+slot naming. Lets us A/B against sim-rendered originals to verify the mirror transform and slot swap. Paired with debug_mirror_one_episode.sh which runs one episode + stages the dumps into the episode dir next to the originals.
SFT checkpoints emit action tokens <|forward|>/<|left|>/etc. vLLM's OpenAI chat endpoint strips them at decode by default, which causes _parse_sft_response to return empty and every step falls back to move_forward. Set skip_special_tokens=false in _base.yaml generation_kwargs so the tokens survive.
Adds LHPRVLNEnhancedSFTPromptBuilder that applies fantasy-vln's display_env preprocessing (contrast 1.5 + resize 366) at prompt-build time. Subclass of LHPRVLNSFTPromptBuilder; kwargs enhance_contrast and resize_to are yaml-tunable. Ships new tasks for val_filtered and test_filtered splits plus 9 unit tests covering contrast math, resize, RGBA passthrough, and structural parity with the parent builder.
Adds save_rgba + enhance_images toggles to the LHPR-VLN bridge so the enhancement (contrast 1.5 + resize 366) happens BEFORE the PNG hits disk, matching fantasy-vln's display_env pipeline exactly. Paired with the baseline SFT prompt builder (no in-prompt transform), the on-disk image and the VLM input are both the 366x366 RGBA-enhanced frame. Defaults off — existing yamls unaffected.
LHPRVLNBridge.reset() called self._scene_sim.actor('move_forward') to
fetch an initial observation. That call applied a second move_forward
(on top of the one already inside SceneSimulator.__init__), shifting
the agent's starting pose by 0.25 m along its facing direction.
Because actor() on step=-1 returns early without refreshing self.info,
the reset's logged pose stayed stale while the agent had actually
moved, so step 0 -> step 1 in trajectory.jsonl showed a phantom 0.25
m translation on a pure turn_left.
SceneSimulator.__init__ already captures the initial observation; use
it directly. Matches fantasy-vln's agent loop which starts with
task_sim.observations without calling actor() first.
Impact: EASI episodes now start at the same pose as fantasy for a
given scene/seed, enabling apples-to-apples SR comparison. Prior EASI
eval numbers were computed with the +0.25 m bias and are not directly
comparable.
Duplicates the bridge-enhanced val task under a short name so autoeval treats it as a fresh task key and reruns checkpoints against the post-bridge-fix code without colliding with prior bridge_enhanced summaries. Temporary — safe to delete when done.
SceneSimulator.actor() had an early return on self.step == -1 that absorbed the first call without incrementing bookkeeping or running the max_step check. Combined with b6eb3c6 (bridge no longer calls actor on reset), the runner's for step in range(max_steps) loop left self.step one short of max_steps, so the timeout branch never fired and nav_errors / nav_steps were never appended for episodes that ran to the wall clock. aggregate_results then died with IndexError in NavigationMetrics.TAR() because gt_path outlasted error_length. This also silently masked a pre-existing bug: a first-action 'stop' was dropped (skipped sim.step and took the early return), never advancing the stage or recording a nav_error. Fix: initialise self.step = 0 at end of __init__ and remove the early-return block from actor(). Every call now goes through the full step-increment + info-refresh + oracle/stop/max pipeline. Semantic shift: nav_steps[0] is now +1 vs pre-fix runs because it includes the init move_forward; fantasy-vln's nav_steps already carry the same +1 offset (their while-not-episode-over loop makes one extra actor call past max_steps), so post-fix EASI is more comparable to fantasy. Adds tests/test_lhpr_vln_scene_step.py covering timeout length invariants, all-subtasks-success accounting, first-action-stop being honoured, and mixed partial success + timeout — all via a habitat_sim stub so they run offline in the main venv.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds the entire easi-er cli — a unified evaluation framework for embodied AI agents. It introduces subprocess-isolated simulators, multi-split task definitions, LLM-powered agents, and a CLI for running evaluations across multiple benchmarks.
Core Framework
Simulators (6)
Benchmarks (10)
LLM Infrastructure
Standardized Prompt Format
CLI
easi task list / info / download / scaffold
easi env list / install / check
easi sim test
easi start --agent react --backend openai --model gpt-4o
easi start --resume ./logs//<run_id>
Testing