Skip to content

Commit e8b2e04

Browse files
author
semantic-release
committed
chore: release 0.18.0
1 parent f60df10 commit e8b2e04

2 files changed

Lines changed: 63 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,68 @@
11
# CHANGELOG
22

33

4+
## v0.18.0 (2026-03-02)
5+
6+
### Features
7+
8+
- Migrate annotation pipeline from openadapt-ml to openadapt-evals
9+
([#64](https://github.com/OpenAdaptAI/openadapt-evals/pull/64),
10+
[`7896051`](https://github.com/OpenAdaptAI/openadapt-evals/commit/7896051e514aedd647faeba0383e4acba9bea5ab))
11+
12+
* feat: migrate annotation pipeline from openadapt-ml to openadapt-evals
13+
14+
Move annotation data classes, prompts, and utilities into openadapt_evals.annotation and consolidate
15+
three separate VLM call implementations into a shared openadapt_evals.vlm module.
16+
17+
- New openadapt_evals/vlm.py: unified vlm_call() supporting consilium council, OpenAI, and
18+
Anthropic; extract_json() for LLM output parsing; image_bytes_from_path() helper - New
19+
openadapt_evals/annotation.py: AnnotatedStep/AnnotatedDemo data classes,
20+
ANNOTATION_SYSTEM_PROMPT/ANNOTATION_STEP_PROMPT constants, parse_annotation_response(),
21+
validate_annotations(), format_annotated_demo() - Updated scripts/record_waa_demos.py
22+
cmd_annotate_waa() to import from openadapt_evals instead of openadapt_ml - Updated
23+
scripts/refine_demo.py to use shared vlm_call/extract_json, refactored message builders to
24+
prompt+images interface - Updated scripts/convert_recording_to_demo.py to use shared vlm_call - 16
25+
new tests in tests/test_annotation.py, all existing tests pass
26+
27+
Closes #59
28+
29+
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
30+
31+
* fix: remove unused import and hoist model resolution in convert_recording_to_demo
32+
33+
- Remove unused `import os` from openadapt_evals/vlm.py - Move `resolved_model` computation before
34+
the for-loop in convert_vlm() so it's computed once instead of redundantly inside each step's try
35+
block
36+
37+
* fix: add timeouts, fix temperature regression, remove dead api_key param
38+
39+
- vlm.py: add timeout=120s to OpenAI/Anthropic SDK clients to prevent indefinite hangs (old code had
40+
explicit timeouts via requests) - vlm.py: pass system prompt separately to consilium
41+
council_query() instead of concatenating into user prompt - refine_demo.py: explicitly pass
42+
temperature=1.0 to vlm_call() in holistic and per-step review to match old behavior (vlm_call
43+
defaults to 0.1 which would be an unintended behavioral change) - refine_demo.py: remove dead
44+
api_key parameter from run_holistic_review, run_per_step_review, refine_recording, and main() —
45+
vlm_call() reads API keys from environment via the SDK
46+
47+
---------
48+
49+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
50+
51+
### Refactoring
52+
53+
- Deduplicate recording artifacts and use JPEG thumbnails
54+
([#65](https://github.com/OpenAdaptAI/openadapt-evals/pull/65),
55+
[`f60df10`](https://github.com/OpenAdaptAI/openadapt-evals/commit/f60df10a56cea3031e254a4f573a8487dc73b5e3))
56+
57+
- Remove docs/artifacts/full/ (was a copy of waa_recordings/ PNGs) - Thumbnails now link to
58+
originals in waa_recordings/ for full-res - Switch thumbnails from PNG to JPEG (1.5 MB vs 3.0 MB
59+
for same images) - Un-gitignore waa_recordings/ (research data, should be tracked) - Gitignore
60+
docs/artifacts/full/ instead (regenerable) - Untrack benchmark_results/ (mock test output, already
61+
gitignored) - Move os import to module level in generate_demo_review.py
62+
63+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
64+
65+
466
## v0.17.1 (2026-03-02)
567

668
### Bug Fixes

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "openadapt-evals"
7-
version = "0.17.1"
7+
version = "0.18.0"
88
description = "Evaluation infrastructure for GUI agent benchmarks"
99
readme = "README.md"
1010
requires-python = ">=3.11"

0 commit comments

Comments
 (0)