AgentNetBench provides an offline evaluator for UI interaction trajectories. It compares model-predicted low-level actions (click, moveTo, write, press, scroll, terminate, etc.) against ground-truth actions and reports detailed metrics.
- qwen25vl (Qwen2.5-VL family)
- aguvis
Qwen matching substrings (case-insensitive): qwen2.5-vl, qwen-vl, qwen25vl, qwen2.5vl. Any model name containing aguvis selects the Aguvis agent.
- Python 3.9+
- Packages:
- openai (>=1.0.0)
- pillow
- editdistance (auto-installed by the evaluator if missing)
Install:
pip install "openai>=1.0.0" pillow editdistancePlace trajectory JSON files under a directory, and their corresponding screenshots under an images/ subfolder.
Example structure:
AgentNetBench/
single_data/
20240917_xxx.json
images/
20240917_xxx_0.png
20240917_xxx_1.png
Minimal trajectory schema (example):
{
"task_id": "20240917_xxx",
"high_level_task_description": "Open the browser and search for ...",
"steps": [
{
"image": "20240917_xxx_0.png",
"ground_truth_actions": [
{ "type": "moveTo", "params": { "position": {"x": 0.42, "y": 0.63} }, "metadata": {"bboxes": []} },
{ "type": "click", "params": { "position": {"x": 0.42, "y": 0.63} }, "metadata": {"bboxes": []} }
]
}
]
}Notes:
- Coordinates are relative (0–1). When metadata bounding boxes are available, evaluation accepts any click/move that falls inside the bbox.
- For write+enter sequences, the evaluator merges them for robust scoring.
python run.py \
--data single_data \
--image_dir single_data/images \
--output output \
--model qwen2.5-vl-7b \
--base_url http://YOUR_OPENAI_COMPATIBLE_SERVER/v1 \
--api_key YOUR_API_KEY \
--num_cores 10Tips:
- Any of
qwen2.5-vl,qwen-vl,qwen25vl,qwen2.5vlin--modelselects Qwen2.5-VL. Includeaguvisto select Aguvis. - Absolute paths are recommended.
Results are written to:
<output>/eval_YYYYMMDD_HHMMSS_<sanitized_model>/
├─ <task_id>.json # per-trajectory step-level results
├─ metric.json # summary metrics
└─ hyperparams.json # run configuration
Per-step result fields include:
- raw_response, parsed_action, predicted_actions
- evaluation.total and evaluation.actions per type
- used_actions and alternative_matched when an alternative ground truth option scores better
You can re-parse and re-score previously saved outputs (useful after evaluator updates):
python reeval.py \
--input_dir output/<eval_dir>- openai import warnings in your IDE: install
openai>=1.0.0. - editdistance missing: it is installed on-the-fly by the evaluator; you can also
pip install editdistancemanually. - No results or low scores: ensure
--image_dirpoints to the correct images and your model name triggers the intended agent.
This repository is intended for research and benchmarking. Please review the project’s root-level license for terms.