Skip to content

Add time/per_group_wall_time metric#1656

Open
finbarrtimbers wants to merge 8 commits into
mainfrom
finbarr/better-inference-timing
Open

Add time/per_group_wall_time metric#1656
finbarrtimbers wants to merge 8 commits into
mainfrom
finbarr/better-inference-timing

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a new time/per_group_wall_time metric: the mean of per-group generation wall-clock times across the batch.
  • Distinct from existing time/getting_response, which is the max of per-group times and gets dominated by tail-inflated groups when many requests are in flight (high --num_async_steps, many engines, etc.).
  • Underlying timestamps are already captured (metadata["start_time"] in add_request; current_time in finalize_completed_request); per-group generation_time is already on TokenStatistics. This change just plumbs the mean through make_batch_from_groups and emits it as a new metric.

Applies to both grpo_fast.py and grpo.py since they share DataPreparationActor.

Test plan

  • scripts/train/debug/single_gpu_grpo.sh succeeds; time/per_group_wall_time appears in wandb.
  • scripts/train/qwen/qwen3_4b_dapo_math_oc.sh (50 steps): inspect time/per_group_wall_time vs time/getting_response; mean should be <= max.

🤖 Generated with Claude Code

…ime). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the time/per_group_wall_time metric, which calculates the mean generation time across a batch to provide a more consistent timing signal. The implementation updates the TokenStatistics data structure, the data loader's batch aggregation logic, and the training loop's metric logging. Review feedback suggests moving the changelog entry to the 'Added' section for better organization and optimizing the mean calculation by using a running sum instead of a list to improve memory efficiency and simplify the code.

Comment thread CHANGELOG.md Outdated
Comment thread open_instruct/data_loader.py Outdated
Comment thread open_instruct/data_loader.py Outdated
…ntry to Added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tchData. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@finbarrtimbers finbarrtimbers enabled auto-merge May 5, 2026 18:08
@finbarrtimbers finbarrtimbers requested a review from hamishivi May 6, 2026 18:33
…llatedBatchData. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>"

This reverts commit e22337b.
…nvariant. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>"

This reverts commit d94a564.
total_tokens = result.token_statistics.num_prompt_tokens + result.token_statistics.num_response_tokens
step_metrics["val/actor_tokens_per_second"] = total_tokens / result.token_statistics.generation_time
step_metrics["time/getting_response"] = result.token_statistics.generation_time
step_metrics["time/per_group_wall_time"] = result.token_statistics.mean_per_group_wall_time
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would median or mean be better? I worry about weird outlier group times?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like mean for exactly that reason, it'll capture the aggregate times. I worry with the median we won't see the worst times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants