-
Notifications
You must be signed in to change notification settings - Fork 116
OSWorld benchmark #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
OSWorld benchmark #255
Changes from 5 commits
Commits
Show all changes
58 commits
Select commit
Hold shift + click to select a range
adbaf2d
boilerplate
ollmer f29f048
fix import
ollmer 30742cf
default args in the dataclass
ollmer a8b29cc
script to test that osworld works in the ubuntu with docker
ollmer 9e07d8f
update makefile to setup os world development.
amanjaiswal73892 1f5d1da
install osworld through make command instead of requirements as we tw…
ollmer 41e5298
ignore osworld tmp folders
ollmer 5448b48
osworld bench tasks loading
ollmer 77c1d1c
fmt
ollmer 04505ec
osworld eval entrypoint and fixes
ollmer 1b4fccf
osworld action set boilerplate
ollmer 20524da
boilerplate for obs conversion
ollmer 3728a3b
add convert_obs to env reset function
amanjaiswal73892 475a3f3
hardcoded os-world env in tool use agent (Refactor later)
amanjaiswal73892 11edee5
add timing decorator to step function for action execution metrics
amanjaiswal73892 42b38ca
pre-alpha initial working agent on os-world with desktop_env action s…
amanjaiswal73892 928fbaf
Add TODO's
amanjaiswal73892 3bf6c0a
Update TODO's
amanjaiswal73892 725546c
claude and oai config for osworld agent
ollmer c1ec395
enforce format and stricter type checks
ollmer 3608c43
pass action set through the agent config
ollmer 585d9f7
Add set_benchmark for tool-use agent to use os_world obs preprocessor.
amanjaiswal73892 bdb7ab1
Merge remote branch 'main' into osworld
amanjaiswal73892 a40aa42
Update Claude agent config to include axtree and obs history
amanjaiswal73892 400b947
Add osworld axtree preprocessing
amanjaiswal73892 2d7d5a2
Add max_steps parameter to OsworldGym and OsworldEnvArgs for step lim…
amanjaiswal73892 0dbb9dd
Add env.evaluate for episode evaluation
amanjaiswal73892 7f6b6c9
Refactor observation conversion, add axtree and remove Todos.
amanjaiswal73892 ce0b2d0
Add computer_13 action space tools definitions in OsworldGym and remo…
amanjaiswal73892 024935e
update run_osworld to use small test set and one task
amanjaiswal73892 1a0c483
Fix: Update tool call identifier key in Xray [for debugging only]
amanjaiswal73892 912932d
more progress logging
ollmer 8ce45b8
debug parallel task
ollmer 2911da5
Add method to fix settings file path in task configuration
amanjaiswal73892 ddf1d00
7 simple osworld tasks for debug
ollmer 49fac6c
use subset of simple tasks during debug run
ollmer 2b79b50
Temp commit for xray [Update toolagent config to be primitive types]
amanjaiswal73892 815893c
record task video, wait 60 sec after reset just as osworld own agent
ollmer 7449033
put video recording under flag, lint
ollmer d7401bf
lint
ollmer 7387922
Merge branch 'main' into osworld
ollmer cf4b277
refactor: rename use_osworld_obs_preprocessor to skip_preprocessing f…
amanjaiswal73892 bb38053
Remove 'action_set' from index_black_list in load_result_df and make …
amanjaiswal73892 63d141b
fix: rename COMPUTER_13_ACTIONS_OAI_RESPONSE_TOOLS to COMPUTER_13_ACT…
amanjaiswal73892 d36709a
update run_osworld.py with study relaunch capability and setup readme
amanjaiswal73892 9748ec3
update TODO and black refactor
amanjaiswal73892 725854b
Rename tool conversion function
amanjaiswal73892 a22eaed
bug fix to_tool_desc and refactor
amanjaiswal73892 8c2d469
Add tests
amanjaiswal73892 f740812
Black and darglint
amanjaiswal73892 532047a
Merge remote-tracking branch 'origin/main' into osworld
amanjaiswal73892 d2d59bc
more black
amanjaiswal73892 4f14015
Update osworld to be skipped if desktop_env not available
amanjaiswal73892 896e89a
add selective import for osworld module and tests.
amanjaiswal73892 8bee45f
black formatting again
amanjaiswal73892 60d7ce2
Add OSWorld benchmark to README
amanjaiswal73892 2afb28b
Remove commented code.
amanjaiswal73892 b0d4a99
Merge remote-tracking branch 'origin' into osworld
amanjaiswal73892 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| import logging | ||
| from desktop_env.desktop_env import DesktopEnv | ||
|
|
||
| logging.basicConfig( | ||
| level=logging.INFO, | ||
| format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", | ||
| handlers=[logging.StreamHandler()], | ||
| ) | ||
|
|
||
| example = { | ||
| "id": "94d95f96-9699-4208-98ba-3c3119edf9c2", | ||
| "instruction": "I want to install Spotify on my current system. Could you please help me?", | ||
| "config": [ | ||
| { | ||
| "type": "execute", | ||
| "parameters": { | ||
| "command": [ | ||
| "python", | ||
| "-c", | ||
| "import pyautogui; import time; pyautogui.click(960, 540); time.sleep(0.5);" | ||
| ] | ||
| } | ||
| } | ||
| ], | ||
| "evaluator": { | ||
| "func": "check_include_exclude", | ||
| "result": { | ||
| "type": "vm_command_line", | ||
| "command": "which spotify" | ||
| }, | ||
| "expected": { | ||
| "type": "rule", | ||
| "rules": { | ||
| "include": ["spotify"], | ||
| "exclude": ["not found"] | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| env = DesktopEnv(action_space="pyautogui", provider_name="docker", os_type="Ubuntu") | ||
|
|
||
| obs = env.reset(task_config=example) | ||
| obs, reward, done, info = env.step("pyautogui.rightClick()") | ||
| print(obs) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,4 +26,5 @@ matplotlib | |
| ray[default] | ||
| python-slugify | ||
| pillow | ||
| gymnasium>=0.27 | ||
| gymnasium>=0.27 | ||
| desktop-env~=0.1.22 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| import logging | ||
| from dataclasses import dataclass | ||
| from typing import Any | ||
|
|
||
| from desktop_env.desktop_env import DesktopEnv | ||
| from distributed.protocol.cupy import d | ||
|
|
||
| from agentlab.benchmarks.abstract_env import AbstractBenchmark, AbstractEnv, AbstractEnvArgs | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class OsworldGym(AbstractEnv): | ||
| def __init__( | ||
| self, | ||
| task: dict, | ||
| provider_name: str, | ||
| region: str | None, | ||
| path_to_vm: str | None, | ||
| snapshot_name: str, | ||
| action_space: str, | ||
| cache_dir: str, | ||
| screen_size: tuple[int, int], | ||
| headless: bool, | ||
| require_a11y_tree: bool, | ||
| require_terminal: bool, | ||
| os_type: str, | ||
| enable_proxy: bool, | ||
| ): | ||
| self.task = task | ||
| self.env_info = { | ||
| "provider_name": provider_name, | ||
| "region": region, | ||
| "path_to_vm": path_to_vm, | ||
| "snapshot_name": snapshot_name, | ||
| "action_space": action_space, | ||
| "cache_dir": cache_dir, | ||
| "screen_size": screen_size, | ||
| "headless": headless, | ||
| "require_a11y_tree": require_a11y_tree, | ||
| "require_terminal": require_terminal, | ||
| "os_type": os_type, | ||
| "enable_proxy": enable_proxy, | ||
| } | ||
| self.env = DesktopEnv( | ||
| action_space=action_space, | ||
| provider_name=provider_name, | ||
| region=region, # type: ignore | ||
| path_to_vm=path_to_vm, # type: ignore | ||
| snapshot_name=snapshot_name, | ||
| cache_dir=cache_dir, | ||
| screen_size=screen_size, # type: ignore | ||
| headless=headless, | ||
| require_a11y_tree=require_a11y_tree, | ||
| require_terminal=require_terminal, | ||
| os_type=os_type, | ||
| ) | ||
|
|
||
| def reset(self, seed: int | None = None) -> tuple[dict[str, Any], dict[str, Any]]: | ||
| obs = self.env.reset(task_config=self.task, seed=seed) | ||
| return obs, self.env_info | ||
|
|
||
| def step(self, action: str): | ||
| obs, reward, done, info = self.env.step(action) | ||
| truncated = False | ||
| return obs, reward, done, truncated, info | ||
|
|
||
| def close(self): | ||
| return self.env.close() | ||
|
|
||
|
|
||
| @dataclass | ||
| class OsworldEnvArgs(AbstractEnvArgs): | ||
| task: dict[str, Any] | ||
| path_to_vm: str | None = None | ||
| provider_name: str = "vmware" # path to .vmx file | ||
| region: str = "us-east-1" # AWS specific, does not apply to all providers | ||
| snapshot_name: str = "init_state" # snapshot name to revert to | ||
| action_space: str = "computer_13" # "computer_13" | "pyautogui" | ||
| cache_dir: str = "cache" | ||
| screen_size: tuple[int, int] = (1920, 1080) | ||
| headless: bool = False | ||
| require_a11y_tree: bool = True | ||
| require_terminal: bool = False | ||
| os_type: str = "Ubuntu" | ||
| enable_proxy: bool = False | ||
|
|
||
| def make_env(self) -> OsworldGym: | ||
| logger.info(f"Creating OSWorld Gym with task: {self.task}") | ||
| gym = OsworldGym( | ||
| task=self.task, | ||
| provider_name=self.provider_name, | ||
| region=self.region, | ||
| path_to_vm=self.path_to_vm, | ||
| snapshot_name=self.snapshot_name, | ||
| action_space=self.action_space, | ||
| cache_dir=self.cache_dir, | ||
| screen_size=self.screen_size, | ||
| headless=self.headless, | ||
| require_a11y_tree=self.require_a11y_tree, | ||
| require_terminal=self.require_terminal, | ||
| os_type=self.os_type, | ||
| enable_proxy=self.enable_proxy, | ||
| ) | ||
| return gym | ||
|
|
||
|
|
||
| class OsworldBenchmark(AbstractBenchmark): | ||
| name: str = "osworld" | ||
| env_args_list: list[OsworldEnvArgs] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat trick!