Skip to content

fix: comprehensive prompt diagnostics for debugging garbage output#248

Merged
abrichr merged 1 commit into
mainfrom
fix/trl-comprehensive-prompt-debug
Mar 29, 2026
Merged

fix: comprehensive prompt diagnostics for debugging garbage output#248
abrichr merged 1 commit into
mainfrom
fix/trl-comprehensive-prompt-debug

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 29, 2026

Summary

The garbage output persists after #247. Need more diagnostic data to isolate the root cause. This adds comprehensive one-time logging:

  1. Raw messages — role, content types, text preview (before chat template)
  2. Full rendered prompt — 2000 chars (was 300)
  3. Image metadata — mode, size, format
  4. Generation config — max_new_tokens, temperature, constrained, model type, device
  5. First generation output — 500 chars + token count
  6. Input tensor shapes — input_ids, attention_mask, pixel_values, image_grid_thw

Key hypothesis: If pixel_values is MISSING from the inputs, the model isn't seeing the screenshot. This would explain degenerate output regardless of prompt correctness — the model is effectively blind.

What to look for in the logs

TRL prompt msg[0] role=system content=You are a GUI automation agent...
TRL prompt msg[1] role=user content_types=['image', 'text'] text=Goal: create-desktop-folder...
TRL prompt text_input (N chars): <|im_start|>system\nYou are a GUI...
TRL prompt image: mode=RGB size=(1920, 1080) format=PNG
TRL generation config: max_new_tokens=512 temperature=0.7...
TRL first generation output (N tokens): Thought: # # # # #...
TRL input shapes: input_ids=torch.Size([1, N]) pixel_values=torch.Size([1, ...]) ...

If pixel_values shows MISSING, that's the bug.

🤖 Generated with Claude Code

Adds detailed one-time logging to help debug the persistent garbage
output issue:

1. Raw messages (role, content types, text preview) before chat template
2. Full rendered text_input (2000 chars, not 300)
3. Image metadata (mode, size, format)
4. Generation config (max_new_tokens, temperature, constrained, model type)
5. First generation output (500 chars + token count)
6. Input tensor shapes (input_ids, attention_mask, pixel_values, image_grid_thw)

The tensor shape logging is critical: if pixel_values is MISSING, the
model isn't seeing the screenshot — which would explain degenerate output
regardless of prompt correctness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abrichr abrichr merged commit 8e3bc45 into main Mar 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant