fix: handle empty colocated weight buckets by EazyReal · Pull Request #3 · EazyReal/slime

EazyReal · 2026-06-26T17:06:37Z

Summary

Fixes colocated raw weight sync when a tensor-parallel rank has no HF tensors for a given bucket. The previous path still constructed FlattenedTensorBucket(named_tensors=[]), which can fail before the rank participates in the gather.

The fix keeps the gather collective intact, skips bucket construction for local empty inputs, and has the source rank pad missing per-rank bucket entries with a serialized empty flattened tensor before calling the rollout engine.

Why this matters

This is a small weight-sync correctness bug that becomes visible with PP/EP/MoE or model converters that produce uneven raw HF tensor buckets. It is independent of Gemma4, but the Gemma4 MoE proof exercises this path.

Validation

uv run --with pytest python -m pytest tests/test_empty_colocated_weight_bucket.py -q
uv run --with ruff ruff check slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py tests/test_empty_colocated_weight_bucket.py
uv run --with black black --check slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py tests/test_empty_colocated_weight_bucket.py
python3 -m py_compile slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py tests/test_empty_colocated_weight_bucket.py && git diff --check

Notes

This PR is prepared in the EazyReal fork for review before any upstream THUDM/slime PR is opened.

Summary by CodeRabbit

Bug Fixes
- Improved handling of empty weight buckets during distributed updates, making synchronized updates more reliable when some ranks have no data.
- Added safer fallback behavior so missing bucket entries are padded consistently instead of causing failures.
Tests
- Added coverage for empty-bucket gathering and mixed bucket scenarios to prevent regressions.
- Updated the CPU test suite to include the new empty-bucket test.

coderabbitai · 2026-06-26T17:08:08Z

📝 Walkthrough

Walkthrough

The weight update path now skips empty multi-dtype payloads and pads missing gathered buckets with an empty flattened tensor before remote updates. New tests cover empty-bucket gather and padding behavior, and the CPU unittest workflow includes the new test file.

Changes

Empty colocated bucket handling

Layer / File(s)	Summary
Empty bucket padding in weight updates `slime/backends/megatron_utils/.../update_weight_from_tensor.py`	The multi-dtype branch leaves `converted_named_tensors_by_dtypes` empty when no tensors exist, and colocated IPC updates pad missing bucket entries with an empty flattened tensor before each remote call.
Empty bucket test harness `tests/test_empty_colocated_weight_bucket.py`	The test module installs stub `torch`, `ray`, and `megatron` modules, defines a fake `FlattenedTensorBucket`, and dynamically loads the weight-update module under test.
Empty bucket test coverage `tests/test_empty_colocated_weight_bucket.py`, `.github/workflows/pr-test.yml.j2`, `.github/workflows/pr-test.yml`	Two tests cover empty-bucket gather participation and padding before the remote update call, and the CPU unittest workflow matrix adds the new test file.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I hopped by the bucket, so empty and light,
Then padded the gaps till the buckets were right.
One gather, one update, no fluff left behind,
CPU tests now bink in a neat little line.
🐇🥕 Hooray for tensors that land where they should!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the main fix: handling empty colocated weight buckets.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/empty-colocated-weight-bucket-20260626

_{Comment @coderabbitai help to get the list of available commands.}

EazyReal · 2026-06-26T17:32:08Z

@coderabbitai full review

coderabbitai · 2026-06-26T17:32:13Z

✅ Action performed

Full review finished.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_empty_colocated_weight_bucket.py`:
- Around line 153-198: The new tests in test_empty_colocated_weight_bucket.py
are only being defined, so when cpu-unittest runs the file directly with python
they never execute. Add a module entrypoint that invokes pytest for this file
when run as a script, or otherwise make the test functions run under direct
execution, and keep the existing test names like
test_empty_colocated_bucket_still_participates_in_gather and
test_source_rank_pads_empty_colocated_bucket_entries unchanged so CI actually
exercises them.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ef0a6e64-5602-4e68-b865-ccc31512edd8

📥 Commits

Reviewing files that changed from the base of the PR and between a897e1f and c30e877.

📒 Files selected for processing (4)

.github/workflows/pr-test.yml
.github/workflows/pr-test.yml.j2
slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py
tests/test_empty_colocated_weight_bucket.py

EazyReal · 2026-06-26T17:57:44Z

@coderabbitai full review

EazyReal · 2026-06-26T18:16:43Z

@coderabbitai full review

coderabbitai · 2026-06-26T18:16:49Z

✅ Action performed

Full review finished.

coderabbitai

🧹 Nitpick comments (1)

tests/test_empty_colocated_weight_bucket.py (1)
15-18: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Make the fake bucket mode overridable.

With supports_multi_dtypes pinned to True, this file never exercises the multi-bucket max(len(tensors)) padding path. A follow-up test that flips it to False and feeds mismatched dtype buckets would cover the later-index padding this PR now relies on.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_empty_colocated_weight_bucket.py` around lines 15 - 18, The fake
bucket in _FakeFlattenedTensorBucket currently hardcodes supports_multi_dtypes
to True, so the test never covers the single-bucket padding branch. Make this
mode overridable on the fake class or via its constructor, then add a test path
that sets it to False and uses mismatched dtype buckets so the max(len(tensors))
padding behavior in the colocated weight bucket logic is exercised.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/test_empty_colocated_weight_bucket.py`:
- Around line 15-18: The fake bucket in _FakeFlattenedTensorBucket currently
hardcodes supports_multi_dtypes to True, so the test never covers the
single-bucket padding branch. Make this mode overridable on the fake class or
via its constructor, then add a test path that sets it to False and uses
mismatched dtype buckets so the max(len(tensors)) padding behavior in the
colocated weight bucket logic is exercised.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2581ea15-d37e-42b7-b276-e59b74624d4e

📥 Commits

Reviewing files that changed from the base of the PR and between a897e1f and 7117b7a.

📒 Files selected for processing (4)

.github/workflows/pr-test.yml
.github/workflows/pr-test.yml.j2
slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py
tests/test_empty_colocated_weight_bucket.py

EazyReal changed the base branch from main to upstream-main-20260626 June 26, 2026 17:08

EazyReal force-pushed the codex/empty-colocated-weight-bucket-20260626 branch 2 times, most recently from 3309275 to c30e877 Compare June 26, 2026 17:30

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread tests/test_empty_colocated_weight_bucket.py

fix: handle empty colocated weight buckets

7117b7a

EazyReal force-pushed the codex/empty-colocated-weight-bucket-20260626 branch from c30e877 to 7117b7a Compare June 26, 2026 17:56

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

EazyReal mentioned this pull request Jun 26, 2026

fix: handle empty colocated weight buckets THUDM/slime#2134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle empty colocated weight buckets#3

fix: handle empty colocated weight buckets#3
EazyReal wants to merge 1 commit into
upstream-main-20260626from
codex/empty-colocated-weight-bucket-20260626

EazyReal commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

EazyReal commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

EazyReal commented Jun 26, 2026

Uh oh!

EazyReal commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EazyReal commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this matters

Validation

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

EazyReal commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EazyReal commented Jun 26, 2026

Uh oh!

EazyReal commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EazyReal commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading