Skip to content

Track consumed samples in DataLoader; skip per-step total-token reduction#1652

Merged
jayhenry merged 14 commits intoInternLM:mainfrom
jayhenry:skip_reduce_tokens
Apr 7, 2026
Merged

Track consumed samples in DataLoader; skip per-step total-token reduction#1652
jayhenry merged 14 commits intoInternLM:mainfrom
jayhenry:skip_reduce_tokens

Conversation

@jayhenry
Copy link
Copy Markdown
Collaborator

@jayhenry jayhenry commented Apr 2, 2026

This PR refactors how total consumed samples are tracked and resumed:

  • Move ownership of total consumed samples to DataLoader (with updated save/restore paths); remove the older resume helpers from resume.py.
  • Fix Token accounting: previously, total consumed tokens used a global all-reduce, which over-counted tokens when sequence parallel (SP) is enabled. Totals are now reduced on the dp_mesh only, matching DP-local semantics.
  • Stop reducing total tokens on every training step to speedup e2e tgs

@jayhenry jayhenry changed the title Skip reduce total tokens in every step Skip reduce total tokens in every step & Fix total samples for sp resume Apr 3, 2026
@jayhenry jayhenry force-pushed the skip_reduce_tokens branch from f049f4b to 9a81d9c Compare April 7, 2026 06:17
@jayhenry jayhenry changed the title Skip reduce total tokens in every step & Fix total samples for sp resume Track consumed samples in DataLoader; skip per-step total-token reduction Apr 7, 2026
@jayhenry jayhenry merged commit 147cb2e into InternLM:main Apr 7, 2026
5 checks passed
@jayhenry jayhenry deleted the skip_reduce_tokens branch April 8, 2026 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants