fix: comprehensive prompt diagnostics for debugging garbage output by abrichr · Pull Request #248 · OpenAdaptAI/openadapt-evals

abrichr · 2026-03-29T21:38:11Z

Summary

The garbage output persists after #247. Need more diagnostic data to isolate the root cause. This adds comprehensive one-time logging:

Raw messages — role, content types, text preview (before chat template)
Full rendered prompt — 2000 chars (was 300)
Image metadata — mode, size, format
Generation config — max_new_tokens, temperature, constrained, model type, device
First generation output — 500 chars + token count
Input tensor shapes — input_ids, attention_mask, pixel_values, image_grid_thw

Key hypothesis: If pixel_values is MISSING from the inputs, the model isn't seeing the screenshot. This would explain degenerate output regardless of prompt correctness — the model is effectively blind.

What to look for in the logs

TRL prompt msg[0] role=system content=You are a GUI automation agent...
TRL prompt msg[1] role=user content_types=['image', 'text'] text=Goal: create-desktop-folder...
TRL prompt text_input (N chars): <|im_start|>system\nYou are a GUI...
TRL prompt image: mode=RGB size=(1920, 1080) format=PNG
TRL generation config: max_new_tokens=512 temperature=0.7...
TRL first generation output (N tokens): Thought: # # # # #...
TRL input shapes: input_ids=torch.Size([1, N]) pixel_values=torch.Size([1, ...]) ...

If pixel_values shows MISSING, that's the bug.

🤖 Generated with Claude Code

Adds detailed one-time logging to help debug the persistent garbage output issue: 1. Raw messages (role, content types, text preview) before chat template 2. Full rendered text_input (2000 chars, not 300) 3. Image metadata (mode, size, format) 4. Generation config (max_new_tokens, temperature, constrained, model type) 5. First generation output (500 chars + token count) 6. Input tensor shapes (input_ids, attention_mask, pixel_values, image_grid_thw) The tensor shape logging is critical: if pixel_values is MISSING, the model isn't seeing the screenshot — which would explain degenerate output regardless of prompt correctness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abrichr merged commit 8e3bc45 into main Mar 29, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: comprehensive prompt diagnostics for debugging garbage output#248

fix: comprehensive prompt diagnostics for debugging garbage output#248
abrichr merged 1 commit into
mainfrom
fix/trl-comprehensive-prompt-debug

abrichr commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Mar 29, 2026

Summary

What to look for in the logs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant