Skip to content

fix: handle empty colocated weight buckets#3

Draft
EazyReal wants to merge 1 commit into
upstream-main-20260626from
codex/empty-colocated-weight-bucket-20260626
Draft

fix: handle empty colocated weight buckets#3
EazyReal wants to merge 1 commit into
upstream-main-20260626from
codex/empty-colocated-weight-bucket-20260626

Conversation

@EazyReal

@EazyReal EazyReal commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes colocated raw weight sync when a tensor-parallel rank has no HF tensors for a given bucket. The previous path still constructed FlattenedTensorBucket(named_tensors=[]), which can fail before the rank participates in the gather.

The fix keeps the gather collective intact, skips bucket construction for local empty inputs, and has the source rank pad missing per-rank bucket entries with a serialized empty flattened tensor before calling the rollout engine.

Why this matters

This is a small weight-sync correctness bug that becomes visible with PP/EP/MoE or model converters that produce uneven raw HF tensor buckets. It is independent of Gemma4, but the Gemma4 MoE proof exercises this path.

Validation

  • uv run --with pytest python -m pytest tests/test_empty_colocated_weight_bucket.py -q
  • uv run --with ruff ruff check slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py tests/test_empty_colocated_weight_bucket.py
  • uv run --with black black --check slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py tests/test_empty_colocated_weight_bucket.py
  • python3 -m py_compile slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py tests/test_empty_colocated_weight_bucket.py && git diff --check

Notes

This PR is prepared in the EazyReal fork for review before any upstream THUDM/slime PR is opened.

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling of empty weight buckets during distributed updates, making synchronized updates more reliable when some ranks have no data.
    • Added safer fallback behavior so missing bucket entries are padded consistently instead of causing failures.
  • Tests

    • Added coverage for empty-bucket gathering and mixed bucket scenarios to prevent regressions.
    • Updated the CPU test suite to include the new empty-bucket test.

@EazyReal EazyReal changed the base branch from main to upstream-main-20260626 June 26, 2026 17:08
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The weight update path now skips empty multi-dtype payloads and pads missing gathered buckets with an empty flattened tensor before remote updates. New tests cover empty-bucket gather and padding behavior, and the CPU unittest workflow includes the new test file.

Changes

Empty colocated bucket handling

Layer / File(s) Summary
Empty bucket padding in weight updates
slime/backends/megatron_utils/.../update_weight_from_tensor.py
The multi-dtype branch leaves converted_named_tensors_by_dtypes empty when no tensors exist, and colocated IPC updates pad missing bucket entries with an empty flattened tensor before each remote call.
Empty bucket test harness
tests/test_empty_colocated_weight_bucket.py
The test module installs stub torch, ray, and megatron modules, defines a fake FlattenedTensorBucket, and dynamically loads the weight-update module under test.
Empty bucket test coverage
tests/test_empty_colocated_weight_bucket.py, .github/workflows/pr-test.yml.j2, .github/workflows/pr-test.yml
Two tests cover empty-bucket gather participation and padding before the remote update call, and the CPU unittest workflow matrix adds the new test file.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I hopped by the bucket, so empty and light,
Then padded the gaps till the buckets were right.
One gather, one update, no fluff left behind,
CPU tests now bink in a neat little line.
🐇🥕 Hooray for tensors that land where they should!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main fix: handling empty colocated weight buckets.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/empty-colocated-weight-bucket-20260626

Comment @coderabbitai help to get the list of available commands.

@EazyReal EazyReal force-pushed the codex/empty-colocated-weight-bucket-20260626 branch 2 times, most recently from 3309275 to c30e877 Compare June 26, 2026 17:30
@EazyReal

Copy link
Copy Markdown
Owner Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_empty_colocated_weight_bucket.py`:
- Around line 153-198: The new tests in test_empty_colocated_weight_bucket.py
are only being defined, so when cpu-unittest runs the file directly with python
they never execute. Add a module entrypoint that invokes pytest for this file
when run as a script, or otherwise make the test functions run under direct
execution, and keep the existing test names like
test_empty_colocated_bucket_still_participates_in_gather and
test_source_rank_pads_empty_colocated_bucket_entries unchanged so CI actually
exercises them.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ef0a6e64-5602-4e68-b865-ccc31512edd8

📥 Commits

Reviewing files that changed from the base of the PR and between a897e1f and c30e877.

📒 Files selected for processing (4)
  • .github/workflows/pr-test.yml
  • .github/workflows/pr-test.yml.j2
  • slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py
  • tests/test_empty_colocated_weight_bucket.py

Comment thread tests/test_empty_colocated_weight_bucket.py
@EazyReal EazyReal force-pushed the codex/empty-colocated-weight-bucket-20260626 branch from c30e877 to 7117b7a Compare June 26, 2026 17:56
@EazyReal

Copy link
Copy Markdown
Owner Author

@coderabbitai full review

1 similar comment
@EazyReal

Copy link
Copy Markdown
Owner Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test_empty_colocated_weight_bucket.py (1)

15-18: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Make the fake bucket mode overridable.

With supports_multi_dtypes pinned to True, this file never exercises the multi-bucket max(len(tensors)) padding path. A follow-up test that flips it to False and feeds mismatched dtype buckets would cover the later-index padding this PR now relies on.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_empty_colocated_weight_bucket.py` around lines 15 - 18, The fake
bucket in _FakeFlattenedTensorBucket currently hardcodes supports_multi_dtypes
to True, so the test never covers the single-bucket padding branch. Make this
mode overridable on the fake class or via its constructor, then add a test path
that sets it to False and uses mismatched dtype buckets so the max(len(tensors))
padding behavior in the colocated weight bucket logic is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/test_empty_colocated_weight_bucket.py`:
- Around line 15-18: The fake bucket in _FakeFlattenedTensorBucket currently
hardcodes supports_multi_dtypes to True, so the test never covers the
single-bucket padding branch. Make this mode overridable on the fake class or
via its constructor, then add a test path that sets it to False and uses
mismatched dtype buckets so the max(len(tensors)) padding behavior in the
colocated weight bucket logic is exercised.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2581ea15-d37e-42b7-b276-e59b74624d4e

📥 Commits

Reviewing files that changed from the base of the PR and between a897e1f and 7117b7a.

📒 Files selected for processing (4)
  • .github/workflows/pr-test.yml
  • .github/workflows/pr-test.yml.j2
  • slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py
  • tests/test_empty_colocated_weight_bucket.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant