Skip to content

fix: use build_agent_messages for TRL prompt + fix 4x over-generation#247

Merged
abrichr merged 1 commit into
mainfrom
fix/trl-user-message-construction
Mar 29, 2026
Merged

fix: use build_agent_messages for TRL prompt + fix 4x over-generation#247
abrichr merged 1 commit into
mainfrom
fix/trl-user-message-construction

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 29, 2026

Summary

Two critical fixes for the client's ongoing TRL testing:

1. Garbage output persists after #236

The system prompt was fixed (#236) but the user message construction was still different. The standalone trainer wraps the instruction with:

  • Goal: prefix
  • Format guidance ("Look at the screenshot...\nThought: [...]\nAction: [...]")
  • {"type": "image"} placeholder (Qwen's expected format)

The TRL path was passing just the raw instruction string. Now imports and uses build_agent_messages() from the standalone trainer so both paths produce identical messages.

Also adds a one-time log of the first 300 chars of the rendered prompt so operators can verify the correct format.

2. 4x over-generation

per_device_train_batch_size=num_gen with dataset padding caused 4 identical prompts × 4 generations = 16 rollouts per step. The standalone trainer does 4.

Fix: per_device_train_batch_size=1, generation_batch_size=num_gen. One unique prompt per step, num_gen rollouts. No dataset padding. Matches standalone behavior exactly.

Test plan

  • 32 TRL tests pass
  • Client re-test — should see DSL output and 4 rollouts per step (not 16)

🤖 Generated with Claude Code

Two critical fixes:

1. Garbage output root cause: TRL constructed user messages differently
   from the standalone trainer. Standalone wraps instruction with
   "Goal:" prefix, format guidance, and {"type": "image"} placeholder.
   TRL passed raw instruction text. Now imports build_agent_messages
   from standalone.prompt so both paths produce identical messages.

2. 4x over-generation: batch_size=num_gen with padded dataset caused
   4 identical prompts × 4 generations = 16 rollouts (standalone does 4).
   Now: batch_size=1, generation_batch_size=num_gen. One unique prompt
   per step with num_gen rollouts. No dataset padding needed.

Also adds one-time prompt logging for operator verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abrichr abrichr merged commit 1d94899 into main Mar 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant