-
Notifications
You must be signed in to change notification settings - Fork 447
feat: data plane transfer queue integration #2439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
160 commits
Select commit
Hold shift + click to select a range
8411f6f
plan
ZhiyuLi-Nvidia 85acfdb
plan: align Stage 4 with rl-arena/verl 1-hop pattern
ZhiyuLi-Nvidia 9a46c43
feat(data-plane): TransferQueue integration for GRPO with driver-side…
ZhiyuLi-Nvidia bcb451a
refactor(data-plane): extract driver-side balanced packing into presh…
ZhiyuLi-Nvidia 196b6bb
feat(data-plane): AsyncTrajectoryCollector writes rollouts to TQ when…
ZhiyuLi-Nvidia 0c216f4
feat(data-plane): wire async-on-TQ end-to-end with driver-side balanc…
ZhiyuLi-Nvidia bf092f7
fix(data-plane): preserve sample order and FLOPs semantics on @dp_dis…
ZhiyuLi-Nvidia a28b46d
feat(data-plane): grpo_sync routes logprob/ref-logprob through @dp_di…
ZhiyuLi-Nvidia c1bb667
refactor(data-plane): replace @dp_dispatch with TQPolicy subclass; ad…
ZhiyuLi-Nvidia 67b242b
fix(data-plane): VLM extras, async fan-out, cleanup-on-failure
ZhiyuLi-Nvidia d05ad3f
docs(data-plane): add API lifecycle doc with verl comparison
ZhiyuLi-Nvidia 9da2ec9
feat(data-plane): sync 1-hop trajectory collector + per-sample key li…
ZhiyuLi-Nvidia a7f4bcc
refactor(data-plane): extract make_actor_runtime_env, fix N² list copy
ZhiyuLi-Nvidia fc6ceea
feat(data-plane): jagged tensors on TQ wire + naming/factory cleanup
ZhiyuLi-Nvidia 520bfef
refactor(data-plane): KVBatchMeta.subset/slice/concat methods
ZhiyuLi-Nvidia b732afe
Mooncake cpu backend
ZhiyuLi-Nvidia dcd62d8
Readability Refactor
ZhiyuLi-Nvidia fba1f32
wip test mooncake
ZhiyuLi-Nvidia 69b09c1
refactor(data-plane): drop dead set_wire_format/_PACK_JAGGED + adapte…
ZhiyuLi-Nvidia a55ad5c
refactor(ray.sub): drop NETWORK_INIT_CMDS — MC_TCP_BIND_ADDRESS suffices
ZhiyuLi-Nvidia 703bd36
docs(data-plane): consolidate README; drop stale plan/verl refs
ZhiyuLi-Nvidia d42e7b2
feat(data-plane): non-tensor object support on TQ wire
ZhiyuLi-Nvidia a8ff04e
feat(grpo-sync): equivalency fixes + content via TQ object column
ZhiyuLi-Nvidia 77b0f6a
style: fix ruff lint errors and apply ruff format
ZhiyuLi-Nvidia d86aed2
style: apply pre-commit auto-fixes (ruff)
ZhiyuLi-Nvidia 41258a4
chore(pyrefly): whitelist all new data_plane files + fix type errors
ZhiyuLi-Nvidia 2b58c02
remove unnecessary script
ZhiyuLi-Nvidia 1347c88
feat(data-plane): decompose message_log at wire boundary
ZhiyuLi-Nvidia 1903125
refactor(data-plane): rename DataPlaneClient.get_meta → claim_meta
ZhiyuLi-Nvidia f527f77
docs(data-plane): tighten DataPlaneClient boundary docstring
ZhiyuLi-Nvidia 0dea433
fix(data-plane): treat DataPlaneConfig.enabled as required field
ZhiyuLi-Nvidia de28a19
docs(data-plane): make build_data_plane_client docstring backend-agno…
ZhiyuLi-Nvidia 0f710a4
refactor(data-plane): promote codec imports to module top-level
ZhiyuLi-Nvidia fe2aa71
refactor(data-plane): rename driver_io → column_io
ZhiyuLi-Nvidia e02d3c7
refactor(data-plane): validate dp_world at TQPolicy config time
ZhiyuLi-Nvidia 0c985d4
refactor(data-plane): centralize packing-meta keys in schema.py
ZhiyuLi-Nvidia c5cf807
refactor(data-plane): drop redundant dp_world assert in shard_meta_fo…
ZhiyuLi-Nvidia 734a01a
refactor(data-plane): move DP_SEED_FIELDS to schema.py as DP_TRAIN_FI…
ZhiyuLi-Nvidia 5a6d53d
fix(data-plane): reject empty meta in shard_meta_for_dp
ZhiyuLi-Nvidia 44c82aa
refactor(data-plane): print_event → log_event via stdlib logging
ZhiyuLi-Nvidia 379dae1
style(data-plane): match repo logger naming convention
ZhiyuLi-Nvidia 475b703
refactor(data-plane): convert DataPlaneStats to @dataclass
ZhiyuLi-Nvidia 27f1d77
refactor(data-plane): type DataPlaneEvent as TypedDict
ZhiyuLi-Nvidia 5a8e8a7
refactor(data-plane): drop placeholder 0s from _run; make sizes kw-only
ZhiyuLi-Nvidia e93fe5f
fix(data-plane): route check_consumption_status through _run
ZhiyuLi-Nvidia 5d12647
fix(data-plane): route close() through _run
ZhiyuLi-Nvidia 6ca3b47
perf(data-plane): single sync in to_nested_by_length
ZhiyuLi-Nvidia 0d690f9
docs(data-plane): convert codec.py docstrings to Google style
ZhiyuLi-Nvidia d22709f
refactor(data-plane): centralize Layout type alias in schema.py
ZhiyuLi-Nvidia e23c400
fix(data-plane): validate pad_to_multiple >= 1 in materialize
ZhiyuLi-Nvidia e491025
fix(data-plane): fail fast on empty local IP at Mooncake bootstrap
ZhiyuLi-Nvidia f3dc3ee
fix(data-plane): surface chmod failure when mooncake_master is not exec
ZhiyuLi-Nvidia a8de1df
refactor(data-plane): scope mooncake_cpu 1D workaround to TQDataPlane…
ZhiyuLi-Nvidia 8758b3a
docs(data-plane): clarify TQ module vs client access convention
ZhiyuLi-Nvidia 245f04c
docs(data-plane): note trust boundary at pack_object_array pickle site
ZhiyuLi-Nvidia 739c837
refactor(data-plane): drop codec pickle, use TQ-native NonTensorStack
ZhiyuLi-Nvidia f4b647f
refactor(data-plane): drop dead object-array codec helpers
ZhiyuLi-Nvidia 4fa8a11
refactor(data-plane): centralize _meta_idx sentinel in schema.py
ZhiyuLi-Nvidia 38921eb
docs(data-plane): convert interfaces.py docstrings to Google style
ZhiyuLi-Nvidia 48802b0
refactor(data-plane): align schema constant names with their values
ZhiyuLi-Nvidia a65abaf
docs(data-plane): tighten preshard.py docstring to Google style
ZhiyuLi-Nvidia 2aeb292
docs(data-plane): convert column_io.py docstrings to Google style
ZhiyuLi-Nvidia b44c0f4
docs(data-plane): convert factory.py docstring to Google style
ZhiyuLi-Nvidia 0455e2e
docs(data-plane): add Args/Returns blocks to observability.py docstrings
ZhiyuLi-Nvidia c39e313
docs(data-plane): tighten transfer_queue.py docstrings, add Args/Retu…
ZhiyuLi-Nvidia 2c12afd
docs(data-plane): add Args/Returns to worker_mixin.py docstrings
ZhiyuLi-Nvidia 650e142
docs(data-plane): add Args/Returns blocks to tq_policy.py docstrings
ZhiyuLi-Nvidia db312b6
docs(data-plane): convert sync_rollout_actor.py docstrings to Google …
ZhiyuLi-Nvidia dbef790
docs(data-plane): add Args/Returns to grpo_sync.py dynamic-sampling h…
ZhiyuLi-Nvidia cb1dc34
refactor(data-plane): drop _to_wire's redundant promote_1d kwarg
ZhiyuLi-Nvidia b9c15ed
fix(data-plane): survive TQ simple-backend NonTensorData wire-strip
ZhiyuLi-Nvidia 47d2f7f
build(data-plane): pin mooncake-transfer-engine-cuda13 wheel for cu13…
ZhiyuLi-Nvidia de22c5c
chore: ruff auto-fix and ruff-format pass
ZhiyuLi-Nvidia 908ed7f
chore(pyrefly): rename driver_io → column_io in whitelist
ZhiyuLi-Nvidia 356d166
chore(pyrefly): silence 5 latent type errors with targeted ignore com…
ZhiyuLi-Nvidia 6666a89
chore(pyrefly): whitelist nemo_rl/data_plane/schema.py
ZhiyuLi-Nvidia 5dbc600
fix(data-plane): preserve object-column identity through TQ wire
ZhiyuLi-Nvidia b9154bc
fix(data-plane): gate TQ write-back on TP×CP×PP leader to avoid dupli…
ZhiyuLi-Nvidia cab4bc0
chore: ruff auto-fix and D205 docstring fixes
ZhiyuLi-Nvidia db31b12
refactor(data-plane): drop async-grpo TQ scaffolding from sync PR
ZhiyuLi-Nvidia 351916b
refactor(data-plane): consolidate producer codec, caller mints keys
ZhiyuLi-Nvidia 53be031
test(data-plane): align codec tests with current contract
ZhiyuLi-Nvidia 09099f0
refactor(grpo_sync): drop dead batch_cache; make TQPolicy attrs public
ZhiyuLi-Nvidia 660dd89
refactor(data-plane): extract calibration field filter into named sch…
ZhiyuLi-Nvidia dabe37b
refactor(data-plane): make kv_batch_get(select_fields) required
ZhiyuLi-Nvidia d9258cd
refactor(sync-rollout-actor): remove unused wrappers; document full l…
ZhiyuLi-Nvidia 1a937aa
test(data-plane): move data_plane unit tests under tests/unit/ for CI…
ZhiyuLi-Nvidia 4cfd120
test(data-plane): apply ruff --fix and import-sort to data_plane unit…
ZhiyuLi-Nvidia 534fb07
docs: fix broken nemo-gym Core Components link
ZhiyuLi-Nvidia e49b1ca
chore(grpo): drop stale mypy comments; rename TQPolicy ctor->actor
ZhiyuLi-Nvidia 5d8de41
fix(data-plane): reject loopback IP; resolve TQ runtime_env pin from …
ZhiyuLi-Nvidia b512927
docs(data-plane): rewrite README around sync flow + async proposal
ZhiyuLi-Nvidia 791671e
docs(data-plane): clarify partition scope and TQ mental model
ZhiyuLi-Nvidia 30d6ccc
refactor(data-plane): per-row tags on KVBatchMeta; rename slice → dri…
ZhiyuLi-Nvidia 0f01865
perf(sync-rollout-actor): subset driver_carry via carry_keys
ZhiyuLi-Nvidia 1bbaa17
refactor(grpo-sync): apply overlong filter post-dynamic-sampling
ZhiyuLi-Nvidia 52c1394
refactor(grpo-sync): isolate TQ ops behind TQPolicy/KVBatchMeta façades
ZhiyuLi-Nvidia 63ea762
refactor(data-plane): YAML-only defaults for TQ config (terryk §9)
ZhiyuLi-Nvidia 1d025f4
docs(data-plane): refresh README around encapsulated TQ path
ZhiyuLi-Nvidia c6d0d30
chore: ruff format + pyrefly ignore + underscore-md rename
ZhiyuLi-Nvidia 1f637ea
docs(data-plane): drop api-lifecycle doc; realistic concrete examples
ZhiyuLi-Nvidia b4497f0
docs: align nemo-gym Core Components link with main
ZhiyuLi-Nvidia 0d0d36b
fix(data-plane): close grad_norm collapse + NCCL desync in DP fsdp2 path
ZhiyuLi-Nvidia fb6ccef
refactor(data-plane): drop _tq() lazy wrapper; fail-fast in check_con…
ZhiyuLi-Nvidia 28e634b
refactor(grpo-sync): mint uids in rollout actor (verl-style per-promp…
ZhiyuLi-Nvidia c3c2866
refactor(data-plane): rename KVBatchMeta.keys -> sample_ids (Phase A)
ZhiyuLi-Nvidia 935c1b5
refactor(data-plane): rename DataPlaneClient kwarg keys -> sample_ids…
ZhiyuLi-Nvidia 14e75cf
test(data-plane): update KVBatchMeta schema-pin to sample_ids
ZhiyuLi-Nvidia 23d4353
refactor(data-plane): rename DataPlaneClient verbs kv_batch_* -> {put…
ZhiyuLi-Nvidia 9474196
refactor(data-plane): tighten clear_samples(None) contract; warn on s…
ZhiyuLi-Nvidia fdfade3
chore(data-plane): apply ruff format
ZhiyuLi-Nvidia be54ac6
feat(data-plane): align seq-dim across DP ranks via meta-stamped glob…
ZhiyuLi-Nvidia 2c6c022
test(data-plane): add missing DataPlaneConfig keys to test_seqpack_eq…
ZhiyuLi-Nvidia a6b4ab8
refactor(data-plane): remove _PartitionRecord from TQ adapter
ZhiyuLi-Nvidia f3a4a04
test(data-plane): remove empty tests/unit/data_plane/conftest.py
ZhiyuLi-Nvidia 1c8a470
revert(test): restore NUM_MINUTES=150 in prorlv2 recipe sh
ZhiyuLi-Nvidia 04f410a
test(data-plane): drop test_tq_multinode.py
ZhiyuLi-Nvidia 9c6d0de
docs(data-plane): document DP-aligned forward pad seqlen in README
ZhiyuLi-Nvidia 450f8d9
test(data-plane): drop stale import-isolation tests; merge codec_obje…
ZhiyuLi-Nvidia 0d5bb92
refactor(data-plane): drop drive-by edits from PR scope
ZhiyuLi-Nvidia 4b866cd
test(data-plane): accept attribute-style data_plane access in invariant
ZhiyuLi-Nvidia 4c252c6
refactor(data-plane): use attribute-style access on MasterConfig
ZhiyuLi-Nvidia d4d9c7c
refactor(data-plane): replace run_grpo dispatch grep with behavioral …
ZhiyuLi-Nvidia a775aee
fix(data-plane): use attribute access for loss_fn KL penalty assert
ZhiyuLi-Nvidia cd45f8f
fix(data-plane): pre-register fields to dodge TQ controller race
ZhiyuLi-Nvidia 1e1f0f2
fix(configs): set truncated_importance_sampling_type=tis on recipes t…
ZhiyuLi-Nvidia 5980c8e
refactor(data-plane): close four cross-boundary leaks
ZhiyuLi-Nvidia f1bc4fa
chore(data-plane): apply ruff format to discard_samples
ZhiyuLi-Nvidia c34ba36
test(data-plane): consolidate suite under tests/unit/data_plane
ZhiyuLi-Nvidia 80b5760
fix(data-plane): shrink mooncake_cpu segment defaults to fit CI runners
ZhiyuLi-Nvidia 90d32a4
test(data-plane): update _apply_dynamic_sampling tests for policy= param
ZhiyuLi-Nvidia f6477a4
fix(data-plane): apply pad_to_seqlen to ALL 2D+ tensors in materialize
ZhiyuLi-Nvidia 2d8115c
test(data-plane): add missing DataPlaneConfig keys to _TQ_CFG in chao…
ZhiyuLi-Nvidia 3e3e3be
test(data-plane): remove storage-actor-kill chaos test
ZhiyuLi-Nvidia 1c7d246
fix(data-plane): exclude MESSAGE_LOG_BULK_FIELDS from FP8 calib request
ZhiyuLi-Nvidia 32be65a
test(data-plane): pin MESSAGE_LOG_BULK_FIELDS in DP_CALIB_EXCLUDED_FI…
ZhiyuLi-Nvidia 56b78cd
test(data-plane): add missing DataPlaneConfig keys to tq_lifecycle fi…
ZhiyuLi-Nvidia 42606b6
feat(data-plane): route FP8 KV scales through TQ (sync first cut)
ZhiyuLi-Nvidia 45233e6
Revert "feat(data-plane): route FP8 KV scales through TQ (sync first …
ZhiyuLi-Nvidia 0fe15b1
refactor(data-plane): flip calib filter to positive include-list
ZhiyuLi-Nvidia ccf5eb8
test(data-plane): add realistic-shape rollout fixtures + cross-file d…
ZhiyuLi-Nvidia c958c2a
chore(test): apply ruff isort + blank-line fixes
ZhiyuLi-Nvidia 68206ef
fix(data-plane): override _is_writeback_leader in DTensor V1 worker
ZhiyuLi-Nvidia fb54dc7
test(data-plane): sync grpo_math_1B reference config buffer sizes
ZhiyuLi-Nvidia e84b25d
test(data-plane): slim test_architecture_invariants to 2 behavioral t…
ZhiyuLi-Nvidia 6afdc98
undo unnecessary change
ZhiyuLi-Nvidia 1a38153
build: resolve mooncake-transfer-engine-cuda13 from PyPI instead of G…
ZhiyuLi-Nvidia 4183e63
perf(data-plane): skip Ray return of per-token logprob tensors
ZhiyuLi-Nvidia ed45e8c
perf(data-plane): worker-side suppress per-token logprob Ray return
ZhiyuLi-Nvidia 35bb085
refactor(data-plane): drop aggregator path now that logprob workers r…
ZhiyuLi-Nvidia e908738
refactor(data-plane): make Ray worker_coords the single source of tru…
ZhiyuLi-Nvidia 98bf3be
Revert "refactor(data-plane): make Ray worker_coords the single sourc…
ZhiyuLi-Nvidia 079979a
fix(data-plane): unify leader-gate on NamedSharding.is_axis_zero; fix…
ZhiyuLi-Nvidia 2b504b5
chore: ruff auto-fix and ruff-format pass post-rebase
ZhiyuLi-Nvidia bfb261f
undo unnecessary change
ZhiyuLi-Nvidia 3dedfd9
build: remove unnecessary setuptools packages.find filter
ZhiyuLi-Nvidia ed24395
fix(data-plane): preserve non-tensor leaves in mooncake_cpu 1D wire-p…
ZhiyuLi-Nvidia 26179fd
chore: ruff-format pass on test_leader_broadcast.py
ZhiyuLi-Nvidia 7341341
chore: ruff-format test_leader_broadcast.py
ZhiyuLi-Nvidia b63c18f
fix(deps): include aarch64 in mooncake-cuda13 marker
ZhiyuLi-Nvidia File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.