|
1 | 1 | # CHANGELOG |
2 | 2 |
|
3 | 3 |
|
| 4 | +## v0.12.0 (2026-03-03) |
| 5 | + |
| 6 | +### Features |
| 7 | + |
| 8 | +- Add GRPO training module with minimal TRL bridge |
| 9 | + ([#34](https://github.com/OpenAdaptAI/openadapt-ml/pull/34), |
| 10 | + [`339e5d3`](https://github.com/OpenAdaptAI/openadapt-ml/commit/339e5d35f8c7d0c9880ad3bed9cc748ee7e77945)) |
| 11 | + |
| 12 | +* docs: add experimental roadmap and evidence context to vision |
| 13 | + |
| 14 | +- Add 2x2 experimental matrix (retrieval × fine-tuning) to Core Thesis - Add evidence context to |
| 15 | + benchmark table: note it's an internal synthetic benchmark (~3 UI elements) that validates the |
| 16 | + pipeline, not real-world performance. Link to openadapt-evals for ongoing WAA/OSWorld evaluation. |
| 17 | + |
| 18 | +Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
| 19 | + |
| 20 | +* fix: use 46.7% consistently in 2x2 matrix |
| 21 | + |
| 22 | +Was showing 33-47% range which conflated preliminary (n=3) and full (n=45) results. The validated |
| 23 | + number is 46.7%. |
| 24 | + |
| 25 | +* feat: add GRPO training module for online RL |
| 26 | + |
| 27 | +Add openadapt_ml/training/grpo/ package with: - GRPOConfig for training hyperparameters - |
| 28 | + GRPORolloutCollector connecting to openadapt-evals RLEnvironment - GRPOTrainer implementing custom |
| 29 | + GRPO loop for multimodal VLMs - Binary reward function and group-relative advantage computation - |
| 30 | + Chain-of-thought warm-up pipeline for SFT pre-training - 20 unit tests passing without GPU |
| 31 | + |
| 32 | +* fix: address review findings in GRPO module |
| 33 | + |
| 34 | +- Replace copy.deepcopy(model) with LoRA state dict snapshot (prevents OOM) - Mark |
| 35 | + _compute_rollout_loss as scaffold with dummy forward pass for grad flow - Fix collect_rollout call |
| 36 | + to match RLEnvironment API (task_id in signature) - Add model.eval()/model.train() toggling around |
| 37 | + rollout/training phases - Remove unused gradient_accumulation_steps config field - Use actual |
| 38 | + screen_size from RLEnvironment instead of hardcoded 1920x1200 - Clamp CLICK coordinates to [0.0, |
| 39 | + 1.0] to prevent invalid pixel values - Validate task_ids non-empty at start of train() - Export |
| 40 | + CoT warmup functions from package __init__ - Add BenchmarkAction fallback when openadapt-evals not |
| 41 | + installed - Add 9 new tests: action parser (8) + empty task_ids validation (1) - All 29 tests |
| 42 | + passing |
| 43 | + |
| 44 | +* feat: implement GRPO loss computation and fix cot_warmup dependency |
| 45 | + |
| 46 | +Implement the core _compute_rollout_loss method that was previously a NotImplementedError scaffold. |
| 47 | + The implementation: |
| 48 | + |
| 49 | +- Reconstructs VLM prompts from rollout observations - Formats actions back to DSL text via new |
| 50 | + _format_action_as_text helper - Computes log-probabilities of action tokens under current policy - |
| 51 | + Computes reference policy log-probs via PEFT disable_adapter() with fallback to manual LoRA weight |
| 52 | + swapping - Returns GRPO loss: -advantage * log_prob + kl_coef * KL penalty |
| 53 | + |
| 54 | +Also adds get_api_adapter() factory function to api_adapter.py, fixing the broken import in |
| 55 | + cot_warmup.py's generate_cot_annotations(). |
| 56 | + |
| 57 | +Additional review fixes from prior session: - Initialize _is_unsloth and _ref_lora_state in __init__ |
| 58 | + - Remove dead else branch for task_id selection - Fix total_loss device placement - LoRA-only |
| 59 | + fallback save in checkpoint - TYPE regex accepts single quotes - Coordinate clamping in |
| 60 | + _parse_vlm_output_to_action |
| 61 | + |
| 62 | +40 tests passing (10 new: 8 format_action + 1 roundtrip + 1 api_adapter). |
| 63 | + |
| 64 | +* refactor: deduplicate GRPO prompts via shared _build_agent_messages |
| 65 | + |
| 66 | +Extract prompt construction into _build_agent_messages() which imports SYSTEM_PROMPT from |
| 67 | + next_action.py (the SFT training prompt). This ensures the GRPO agent uses the same prompt |
| 68 | + distribution the model was warm-started on, and guarantees _make_agent_fn and |
| 69 | + _compute_rollout_loss use identical prompts (critical for correct log-prob computation). |
| 70 | + |
| 71 | +* fix(grpo): address critical review findings in GRPO loss computation |
| 72 | + |
| 73 | +- C-01: Store raw model output on action._grpo_raw_text for accurate loss - C-02: Separate |
| 74 | + tokenization of prompt/action with concatenation to fix BPE boundary alignment - I-01: Prefer LoRA |
| 75 | + weight swapping over disable_adapter() for reference policy (captures initial LoRA state after SFT |
| 76 | + warm-start) - I-03: Per-step gradient accumulation via immediate backward() to prevent OOM from |
| 77 | + building computation graph over all rollout steps - I-04: Fix unescape order in TYPE parser |
| 78 | + (backslash before quotes) - M-03: Pass model_name through get_api_adapter to ApiVLMAdapter - M-07: |
| 79 | + Case-insensitive CLICK/TYPE regex in _parse_vlm_output_to_action - L-01: Extract |
| 80 | + DEFAULT_SCREEN_SIZE constant, replace all hardcoded values |
| 81 | + |
| 82 | +* fix(grpo): fix instruction propagation, screen size, weight swap safety |
| 83 | + |
| 84 | +- CR-01: Task instruction was never populated during GRPO rollouts. |
| 85 | + WAALiveAdapter._get_observation() does not populate raw_observation, so the agent prompt said |
| 86 | + "Goal: " with nothing after it. Fix: store instruction on Rollout dataclass (populated from |
| 87 | + env._current_task in collector), use it in both agent_fn and _compute_rollout_loss. - IM-01: |
| 88 | + Change DEFAULT_SCREEN_SIZE from 1920x1200 to 1920x1080 for consistency with baselines module and |
| 89 | + standard VM configurations. Add screen_size field to GRPOConfig so it is configurable. - IM-02: |
| 90 | + Add try/finally around LoRA weight swap in _compute_ref_log_probs. Without this, an exception |
| 91 | + during the reference forward pass permanently corrupts the model state. |
| 92 | + |
| 93 | +* fix(grpo): remove unused torch import in _setup_model |
| 94 | + |
| 95 | +The import torch at line 121 was flagged by ruff (F401) as unused. The surrounding code only calls |
| 96 | + .detach().clone() on tensor objects, which does not require the torch module directly. |
| 97 | + |
| 98 | +* style(grpo): apply ruff formatting to GRPO module files |
| 99 | + |
| 100 | +Run ruff format on cot_warmup.py, rollout_collector.py, and trainer.py to satisfy the CI ruff |
| 101 | + formatter check. |
| 102 | + |
| 103 | +* refactor(grpo): replace custom trainer with minimal TRL bridge |
| 104 | + |
| 105 | +Replace 809-line custom GRPO trainer with ~280 lines that: - Use standard HuggingFace |
| 106 | + AutoModelForVision2Seq + AutoProcessor + PEFT LoraConfig instead of Unsloth monkey-patching - |
| 107 | + Implement standalone GRPO loss in ~15 lines of PyTorch (clipped surrogate) instead of custom |
| 108 | + policy gradient + KL penalty - Use beta=0.0 (no KL penalty, no reference model) per DAPO/Open- |
| 109 | + Reasoner-Zero literature, eliminating weight-swap complexity - Keep per-step backward to avoid OOM |
| 110 | + on long trajectories - Use standard model.save_pretrained() for checkpointing - Document WHY |
| 111 | + standalone GRPO math vs TRL GRPOTrainer (VLM multi-turn image pixel_values not stored in token |
| 112 | + IDs) and WHEN to switch |
| 113 | + |
| 114 | +Preserves all public API: GRPOTrainer, _parse_vlm_output_to_action, _format_action_as_text, |
| 115 | + _build_agent_messages, DEFAULT_SCREEN_SIZE. All 50 tests pass (44 existing + 6 new for grpo_loss |
| 116 | + and trainer internals). |
| 117 | + |
| 118 | +* feat(grpo): add E2E tests with artifact generation and architecture docs |
| 119 | + |
| 120 | +- tests/test_grpo_e2e.py: 5 E2E tests (training loop, rollout collection, loss convergence, weight |
| 121 | + diff, mathematical properties) using tiny mock VLM. Produces 65+ artifacts (JSON traces, PNGs, |
| 122 | + checkpoints, summaries). - scripts/grpo_e2e_report.py: CLI report generator for test artifacts |
| 123 | + (text + optional HTML output). - docs/grpo_e2e_test_design.md: design rationale for E2E test |
| 124 | + approach - docs/grpo_architecture_analysis.md: analysis of custom vs TRL-based GRPO - |
| 125 | + docs/grpo_trl_rewrite_draft.py: TRL v0.29.0 integration research - |
| 126 | + docs/strategic_analysis_evals_ml_synergy.md: business/economics analysis |
| 127 | + |
| 128 | +* fix(grpo): address self-review findings (BUG-01, CLEAN-01 through -05) |
| 129 | + |
| 130 | +- Rename grpo_loss to policy_gradient_loss with honest docstring: single-epoch on-policy means |
| 131 | + ratio=1.0, clipping never fires, this is REINFORCE with group-relative advantages. Keep grpo_loss |
| 132 | + as backwards-compatible alias. - Add public aliases: parse_vlm_output_to_action, |
| 133 | + format_action_as_text (drop underscore prefix for public API) - Export policy_gradient_loss and |
| 134 | + public functions from __init__.py - Remove unused config fields: kl_coef (was 0.01 but never used |
| 135 | + with beta=0), max_seq_length (never referenced) - Fix model_name default: |
| 136 | + Qwen/Qwen2.5-VL-7B-Instruct (not unsloth variant) - Fix trivial test assertion: grad_norm > 0 (was |
| 137 | + >= 0, always true) - Update loss tests to verify gradient direction, not just loss sign - Add |
| 138 | + test_public_api_exports for new public names |
| 139 | + |
| 140 | +56 tests pass (51 unit + 5 E2E). |
| 141 | + |
| 142 | +--------- |
| 143 | + |
| 144 | +Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
| 145 | + |
| 146 | + |
4 | 147 | ## v0.11.2 (2026-02-25) |
5 | 148 |
|
6 | 149 | ### Bug Fixes |
|
0 commit comments