Skip to content

Commit ee9cb79

Browse files
author
semantic-release
committed
chore: release 0.12.0
1 parent 339e5d3 commit ee9cb79

File tree

2 files changed

+144
-1
lines changed

2 files changed

+144
-1
lines changed

CHANGELOG.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,149 @@
11
# CHANGELOG
22

33

4+
## v0.12.0 (2026-03-03)
5+
6+
### Features
7+
8+
- Add GRPO training module with minimal TRL bridge
9+
([#34](https://github.com/OpenAdaptAI/openadapt-ml/pull/34),
10+
[`339e5d3`](https://github.com/OpenAdaptAI/openadapt-ml/commit/339e5d35f8c7d0c9880ad3bed9cc748ee7e77945))
11+
12+
* docs: add experimental roadmap and evidence context to vision
13+
14+
- Add 2x2 experimental matrix (retrieval × fine-tuning) to Core Thesis - Add evidence context to
15+
benchmark table: note it's an internal synthetic benchmark (~3 UI elements) that validates the
16+
pipeline, not real-world performance. Link to openadapt-evals for ongoing WAA/OSWorld evaluation.
17+
18+
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
19+
20+
* fix: use 46.7% consistently in 2x2 matrix
21+
22+
Was showing 33-47% range which conflated preliminary (n=3) and full (n=45) results. The validated
23+
number is 46.7%.
24+
25+
* feat: add GRPO training module for online RL
26+
27+
Add openadapt_ml/training/grpo/ package with: - GRPOConfig for training hyperparameters -
28+
GRPORolloutCollector connecting to openadapt-evals RLEnvironment - GRPOTrainer implementing custom
29+
GRPO loop for multimodal VLMs - Binary reward function and group-relative advantage computation -
30+
Chain-of-thought warm-up pipeline for SFT pre-training - 20 unit tests passing without GPU
31+
32+
* fix: address review findings in GRPO module
33+
34+
- Replace copy.deepcopy(model) with LoRA state dict snapshot (prevents OOM) - Mark
35+
_compute_rollout_loss as scaffold with dummy forward pass for grad flow - Fix collect_rollout call
36+
to match RLEnvironment API (task_id in signature) - Add model.eval()/model.train() toggling around
37+
rollout/training phases - Remove unused gradient_accumulation_steps config field - Use actual
38+
screen_size from RLEnvironment instead of hardcoded 1920x1200 - Clamp CLICK coordinates to [0.0,
39+
1.0] to prevent invalid pixel values - Validate task_ids non-empty at start of train() - Export
40+
CoT warmup functions from package __init__ - Add BenchmarkAction fallback when openadapt-evals not
41+
installed - Add 9 new tests: action parser (8) + empty task_ids validation (1) - All 29 tests
42+
passing
43+
44+
* feat: implement GRPO loss computation and fix cot_warmup dependency
45+
46+
Implement the core _compute_rollout_loss method that was previously a NotImplementedError scaffold.
47+
The implementation:
48+
49+
- Reconstructs VLM prompts from rollout observations - Formats actions back to DSL text via new
50+
_format_action_as_text helper - Computes log-probabilities of action tokens under current policy -
51+
Computes reference policy log-probs via PEFT disable_adapter() with fallback to manual LoRA weight
52+
swapping - Returns GRPO loss: -advantage * log_prob + kl_coef * KL penalty
53+
54+
Also adds get_api_adapter() factory function to api_adapter.py, fixing the broken import in
55+
cot_warmup.py's generate_cot_annotations().
56+
57+
Additional review fixes from prior session: - Initialize _is_unsloth and _ref_lora_state in __init__
58+
- Remove dead else branch for task_id selection - Fix total_loss device placement - LoRA-only
59+
fallback save in checkpoint - TYPE regex accepts single quotes - Coordinate clamping in
60+
_parse_vlm_output_to_action
61+
62+
40 tests passing (10 new: 8 format_action + 1 roundtrip + 1 api_adapter).
63+
64+
* refactor: deduplicate GRPO prompts via shared _build_agent_messages
65+
66+
Extract prompt construction into _build_agent_messages() which imports SYSTEM_PROMPT from
67+
next_action.py (the SFT training prompt). This ensures the GRPO agent uses the same prompt
68+
distribution the model was warm-started on, and guarantees _make_agent_fn and
69+
_compute_rollout_loss use identical prompts (critical for correct log-prob computation).
70+
71+
* fix(grpo): address critical review findings in GRPO loss computation
72+
73+
- C-01: Store raw model output on action._grpo_raw_text for accurate loss - C-02: Separate
74+
tokenization of prompt/action with concatenation to fix BPE boundary alignment - I-01: Prefer LoRA
75+
weight swapping over disable_adapter() for reference policy (captures initial LoRA state after SFT
76+
warm-start) - I-03: Per-step gradient accumulation via immediate backward() to prevent OOM from
77+
building computation graph over all rollout steps - I-04: Fix unescape order in TYPE parser
78+
(backslash before quotes) - M-03: Pass model_name through get_api_adapter to ApiVLMAdapter - M-07:
79+
Case-insensitive CLICK/TYPE regex in _parse_vlm_output_to_action - L-01: Extract
80+
DEFAULT_SCREEN_SIZE constant, replace all hardcoded values
81+
82+
* fix(grpo): fix instruction propagation, screen size, weight swap safety
83+
84+
- CR-01: Task instruction was never populated during GRPO rollouts.
85+
WAALiveAdapter._get_observation() does not populate raw_observation, so the agent prompt said
86+
"Goal: " with nothing after it. Fix: store instruction on Rollout dataclass (populated from
87+
env._current_task in collector), use it in both agent_fn and _compute_rollout_loss. - IM-01:
88+
Change DEFAULT_SCREEN_SIZE from 1920x1200 to 1920x1080 for consistency with baselines module and
89+
standard VM configurations. Add screen_size field to GRPOConfig so it is configurable. - IM-02:
90+
Add try/finally around LoRA weight swap in _compute_ref_log_probs. Without this, an exception
91+
during the reference forward pass permanently corrupts the model state.
92+
93+
* fix(grpo): remove unused torch import in _setup_model
94+
95+
The import torch at line 121 was flagged by ruff (F401) as unused. The surrounding code only calls
96+
.detach().clone() on tensor objects, which does not require the torch module directly.
97+
98+
* style(grpo): apply ruff formatting to GRPO module files
99+
100+
Run ruff format on cot_warmup.py, rollout_collector.py, and trainer.py to satisfy the CI ruff
101+
formatter check.
102+
103+
* refactor(grpo): replace custom trainer with minimal TRL bridge
104+
105+
Replace 809-line custom GRPO trainer with ~280 lines that: - Use standard HuggingFace
106+
AutoModelForVision2Seq + AutoProcessor + PEFT LoraConfig instead of Unsloth monkey-patching -
107+
Implement standalone GRPO loss in ~15 lines of PyTorch (clipped surrogate) instead of custom
108+
policy gradient + KL penalty - Use beta=0.0 (no KL penalty, no reference model) per DAPO/Open-
109+
Reasoner-Zero literature, eliminating weight-swap complexity - Keep per-step backward to avoid OOM
110+
on long trajectories - Use standard model.save_pretrained() for checkpointing - Document WHY
111+
standalone GRPO math vs TRL GRPOTrainer (VLM multi-turn image pixel_values not stored in token
112+
IDs) and WHEN to switch
113+
114+
Preserves all public API: GRPOTrainer, _parse_vlm_output_to_action, _format_action_as_text,
115+
_build_agent_messages, DEFAULT_SCREEN_SIZE. All 50 tests pass (44 existing + 6 new for grpo_loss
116+
and trainer internals).
117+
118+
* feat(grpo): add E2E tests with artifact generation and architecture docs
119+
120+
- tests/test_grpo_e2e.py: 5 E2E tests (training loop, rollout collection, loss convergence, weight
121+
diff, mathematical properties) using tiny mock VLM. Produces 65+ artifacts (JSON traces, PNGs,
122+
checkpoints, summaries). - scripts/grpo_e2e_report.py: CLI report generator for test artifacts
123+
(text + optional HTML output). - docs/grpo_e2e_test_design.md: design rationale for E2E test
124+
approach - docs/grpo_architecture_analysis.md: analysis of custom vs TRL-based GRPO -
125+
docs/grpo_trl_rewrite_draft.py: TRL v0.29.0 integration research -
126+
docs/strategic_analysis_evals_ml_synergy.md: business/economics analysis
127+
128+
* fix(grpo): address self-review findings (BUG-01, CLEAN-01 through -05)
129+
130+
- Rename grpo_loss to policy_gradient_loss with honest docstring: single-epoch on-policy means
131+
ratio=1.0, clipping never fires, this is REINFORCE with group-relative advantages. Keep grpo_loss
132+
as backwards-compatible alias. - Add public aliases: parse_vlm_output_to_action,
133+
format_action_as_text (drop underscore prefix for public API) - Export policy_gradient_loss and
134+
public functions from __init__.py - Remove unused config fields: kl_coef (was 0.01 but never used
135+
with beta=0), max_seq_length (never referenced) - Fix model_name default:
136+
Qwen/Qwen2.5-VL-7B-Instruct (not unsloth variant) - Fix trivial test assertion: grad_norm > 0 (was
137+
>= 0, always true) - Update loss tests to verify gradient direction, not just loss sign - Add
138+
test_public_api_exports for new public names
139+
140+
56 tests pass (51 unit + 5 E2E).
141+
142+
---------
143+
144+
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
145+
146+
4147
## v0.11.2 (2026-02-25)
5148

6149
### Bug Fixes

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "openadapt-ml"
3-
version = "0.11.2"
3+
version = "0.12.0"
44
description = "Model-agnostic, domain-agnostic ML engine for GUI automation agents"
55
readme = "README.md"
66
requires-python = ">=3.10"

0 commit comments

Comments
 (0)