Skip to content

fix: correct per-PP-stage CPU/SSD layer offset and add transfer correctness test suite#171

Draft
zhjc1124 wants to merge 1 commit into
taco-project:feat/layerwise_rebasefrom
zhjc1124:fix/single-node-pp-cpu-pool
Draft

fix: correct per-PP-stage CPU/SSD layer offset and add transfer correctness test suite#171
zhjc1124 wants to merge 1 commit into
taco-project:feat/layerwise_rebasefrom
zhjc1124:fix/single-node-pp-cpu-pool

Conversation

@zhjc1124
Copy link
Copy Markdown
Contributor

Summary

Fix the per-PP-stage layer offset logic so that each worker advances its CPU/SSD base pointer by start_layer_id * layer_stride, ensuring the C++ kernel always addresses GPU tensors with layer_id=0 (PP-stage-local) while writing to the correct slice of the shared per-node CPU/SSD pool.

Unify the offset implementation and comment style across all five worker types (GPUCPUTransferWorker, tpGPUCPUTransferWorker, GDSTransferWorker, tpGDSTransferWorker, LayerwiseTransferWorker) so the same design principle is consistently expressed regardless of whether the offset is applied via tensor slicing, raw pointer arithmetic, layer_id_list, or a scalar kernel argument.

Add tests/test_transfer_correctness.py, a parametrized correctness test suite covering CPU-only, CPU+SSD, GDS, layerwise-CPU-only and layerwise-CPU+SSD backends across same-node PP, simulated cross-node PP, same-node TP, simulated cross-node TP, PP+TP and cross-node PP+TP parallel configurations. Tests automatically downgrade to single-GPU simulation when physical GPUs are insufficient and skip unavailable backends.

…ctness test suite

Fix the per-PP-stage layer offset logic so that each worker advances its CPU/SSD
base pointer by start_layer_id * layer_stride, ensuring the C++ kernel always
addresses GPU tensors with layer_id=0 (PP-stage-local) while writing to the
correct slice of the shared per-node CPU/SSD pool.

Unify the offset implementation and comment style across all five worker types
(GPUCPUTransferWorker, tpGPUCPUTransferWorker, GDSTransferWorker,
tpGDSTransferWorker, LayerwiseTransferWorker) so the same design principle is
consistently expressed regardless of whether the offset is applied via tensor
slicing, raw pointer arithmetic, layer_id_list, or a scalar kernel argument.

Add tests/test_transfer_correctness.py, a parametrized correctness test suite
covering CPU-only, CPU+SSD, GDS, layerwise-CPU-only and layerwise-CPU+SSD
backends across same-node PP, simulated cross-node PP, same-node TP,
simulated cross-node TP, PP+TP and cross-node PP+TP parallel configurations.
Tests automatically downgrade to single-GPU simulation when physical GPUs are
insufficient and skip unavailable backends.
@zhjc1124 zhjc1124 force-pushed the fix/single-node-pp-cpu-pool branch from 22222ed to 92b4c49 Compare May 21, 2026 03:58
@zhjc1124 zhjc1124 marked this pull request as draft May 22, 2026 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant