Skip to content

Commit 36ac839

Browse files
author
semantic-release
committed
chore: release 0.81.4
1 parent e71ed9f commit 36ac839

2 files changed

Lines changed: 61 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,66 @@
11
# CHANGELOG
22

33

4+
## v0.81.4 (2026-03-29)
5+
6+
### Bug Fixes
7+
8+
- Add truncation warning to TRL generate paths
9+
([#242](https://github.com/OpenAdaptAI/openadapt-evals/pull/242),
10+
[`e71ed9f`](https://github.com/OpenAdaptAI/openadapt-evals/commit/e71ed9fe17168524963b564aa050bb4d4d4d305e))
11+
12+
Add a truncation check after both generation paths (Outlines constrained and HF unconstrained) in
13+
generate_fn. When the output length reaches max_new_tokens - 1, a warning is logged suggesting to
14+
increase max_new_tokens or enable constrained_decoding. This helps diagnose cases where the model
15+
generates excessively long reasoning that gets cut off before producing a parseable action.
16+
17+
Also replaced the tautological truncation tests in test_trl_robustness.py (which reimplemented the
18+
check logic inline) with tests that exercise the actual generate_fn code path by calling it
19+
through the rollout function with mocked torch and model.generate.
20+
21+
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
22+
23+
- Use training-appropriate evaluate timeouts instead of reordering eval
24+
([#246](https://github.com/OpenAdaptAI/openadapt-evals/pull/246),
25+
[`114ad0e`](https://github.com/OpenAdaptAI/openadapt-evals/commit/114ad0e8bdc33c35a966ba820ad958fba4269550))
26+
27+
Reverts the evaluate_dense reordering from #245 (local-first was too aggressive — skipped binary
28+
eval entirely, losing the signal when 5050 IS available).
29+
30+
The actual fix: set evaluate_timeout=15s and evaluate_retries=1 on the WAALiveAdapter in the TRL
31+
wrapper. The evaluate_dense logic stays correct (try binary first, local fallback, take max).
32+
Training speed comes from fast failure, not from skipping evaluation paths.
33+
34+
- Benchmarking: 180s timeout, 3 retries (thorough, one-shot) - Training: 15s timeout, 1 retry (fast
35+
feedback, thousands of evals)
36+
37+
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
38+
39+
### Testing
40+
41+
- Add 10 TRL parity tests for deprecation readiness
42+
([#241](https://github.com/OpenAdaptAI/openadapt-evals/pull/241),
43+
[`6a38956`](https://github.com/OpenAdaptAI/openadapt-evals/commit/6a38956f3da2776701b0b92b94134609e83f4d4d))
44+
45+
Adds tests/test_trl_parity.py with 25 test cases covering the 10 areas identified in
46+
docs/STANDALONE_VS_TRL_COMPARISON.md as needed before the standalone GRPO trainer can be
47+
deprecated:
48+
49+
1. Constrained decoding — Outlines generator build + ACTION_REGEX 2. Constrained decoding
50+
ImportError — returns None, not silent success 3. Prompt format identity — TRL imports
51+
SYSTEM_PROMPT from standalone 4. DSL round-trip parsing — CLICK, TYPE, WAIT, DONE via
52+
parse_action_json 5. Thought-prefix parsing — "Thought: ...\nAction: DSL" format 6. Unsloth
53+
loading — FastVisionModel.from_pretrained + get_peft_model 7. LoRA checkpoint resume —
54+
lora_checkpoint passed through config 8. HookBridge on_step_complete — callback fires with correct
55+
args 9. HookBridge unused hooks — on_before_collect/on_rollout_complete stored 10. _AgentOutput
56+
schema — Pydantic validation, JSON schema, roundtrip
57+
58+
All tests are light (no torch/transformers/trl imports), use unittest.mock, and pass with [dev] deps
59+
only.
60+
61+
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
62+
63+
464
## v0.81.3 (2026-03-29)
565

666
### Bug Fixes

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "openadapt-evals"
7-
version = "0.81.3"
7+
version = "0.81.4"
88
description = "Evaluation infrastructure for GUI agent benchmarks"
99
readme = "README.md"
1010
requires-python = ">=3.10"

0 commit comments

Comments
 (0)