|
1 | 1 | # CHANGELOG |
2 | 2 |
|
3 | 3 |
|
| 4 | +## v0.81.4 (2026-03-29) |
| 5 | + |
| 6 | +### Bug Fixes |
| 7 | + |
| 8 | +- Add truncation warning to TRL generate paths |
| 9 | + ([#242](https://github.com/OpenAdaptAI/openadapt-evals/pull/242), |
| 10 | + [`e71ed9f`](https://github.com/OpenAdaptAI/openadapt-evals/commit/e71ed9fe17168524963b564aa050bb4d4d4d305e)) |
| 11 | + |
| 12 | +Add a truncation check after both generation paths (Outlines constrained and HF unconstrained) in |
| 13 | + generate_fn. When the output length reaches max_new_tokens - 1, a warning is logged suggesting to |
| 14 | + increase max_new_tokens or enable constrained_decoding. This helps diagnose cases where the model |
| 15 | + generates excessively long reasoning that gets cut off before producing a parseable action. |
| 16 | + |
| 17 | +Also replaced the tautological truncation tests in test_trl_robustness.py (which reimplemented the |
| 18 | + check logic inline) with tests that exercise the actual generate_fn code path by calling it |
| 19 | + through the rollout function with mocked torch and model.generate. |
| 20 | + |
| 21 | +Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 22 | + |
| 23 | +- Use training-appropriate evaluate timeouts instead of reordering eval |
| 24 | + ([#246](https://github.com/OpenAdaptAI/openadapt-evals/pull/246), |
| 25 | + [`114ad0e`](https://github.com/OpenAdaptAI/openadapt-evals/commit/114ad0e8bdc33c35a966ba820ad958fba4269550)) |
| 26 | + |
| 27 | +Reverts the evaluate_dense reordering from #245 (local-first was too aggressive — skipped binary |
| 28 | + eval entirely, losing the signal when 5050 IS available). |
| 29 | + |
| 30 | +The actual fix: set evaluate_timeout=15s and evaluate_retries=1 on the WAALiveAdapter in the TRL |
| 31 | + wrapper. The evaluate_dense logic stays correct (try binary first, local fallback, take max). |
| 32 | + Training speed comes from fast failure, not from skipping evaluation paths. |
| 33 | + |
| 34 | +- Benchmarking: 180s timeout, 3 retries (thorough, one-shot) - Training: 15s timeout, 1 retry (fast |
| 35 | + feedback, thousands of evals) |
| 36 | + |
| 37 | +Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 38 | + |
| 39 | +### Testing |
| 40 | + |
| 41 | +- Add 10 TRL parity tests for deprecation readiness |
| 42 | + ([#241](https://github.com/OpenAdaptAI/openadapt-evals/pull/241), |
| 43 | + [`6a38956`](https://github.com/OpenAdaptAI/openadapt-evals/commit/6a38956f3da2776701b0b92b94134609e83f4d4d)) |
| 44 | + |
| 45 | +Adds tests/test_trl_parity.py with 25 test cases covering the 10 areas identified in |
| 46 | + docs/STANDALONE_VS_TRL_COMPARISON.md as needed before the standalone GRPO trainer can be |
| 47 | + deprecated: |
| 48 | + |
| 49 | +1. Constrained decoding — Outlines generator build + ACTION_REGEX 2. Constrained decoding |
| 50 | + ImportError — returns None, not silent success 3. Prompt format identity — TRL imports |
| 51 | + SYSTEM_PROMPT from standalone 4. DSL round-trip parsing — CLICK, TYPE, WAIT, DONE via |
| 52 | + parse_action_json 5. Thought-prefix parsing — "Thought: ...\nAction: DSL" format 6. Unsloth |
| 53 | + loading — FastVisionModel.from_pretrained + get_peft_model 7. LoRA checkpoint resume — |
| 54 | + lora_checkpoint passed through config 8. HookBridge on_step_complete — callback fires with correct |
| 55 | + args 9. HookBridge unused hooks — on_before_collect/on_rollout_complete stored 10. _AgentOutput |
| 56 | + schema — Pydantic validation, JSON schema, roundtrip |
| 57 | + |
| 58 | +All tests are light (no torch/transformers/trl imports), use unittest.mock, and pass with [dev] deps |
| 59 | + only. |
| 60 | + |
| 61 | +Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 62 | + |
| 63 | + |
4 | 64 | ## v0.81.3 (2026-03-29) |
5 | 65 |
|
6 | 66 | ### Bug Fixes |
|
0 commit comments