fix: correct per-PP-stage CPU/SSD layer offset and add transfer correctness test suite by zhjc1124 · Pull Request #171 · taco-project/FlexKV

zhjc1124 · 2026-05-21T03:55:44Z

Summary

Fix the per-PP-stage layer offset logic so that each worker advances its CPU/SSD base pointer by start_layer_id * layer_stride, ensuring the C++ kernel always addresses GPU tensors with layer_id=0 (PP-stage-local) while writing to the correct slice of the shared per-node CPU/SSD pool.

Unify the offset implementation and comment style across all five worker types (GPUCPUTransferWorker, tpGPUCPUTransferWorker, GDSTransferWorker, tpGDSTransferWorker, LayerwiseTransferWorker) so the same design principle is consistently expressed regardless of whether the offset is applied via tensor slicing, raw pointer arithmetic, layer_id_list, or a scalar kernel argument.

Add tests/test_transfer_correctness.py, a parametrized correctness test suite covering CPU-only, CPU+SSD, GDS, layerwise-CPU-only and layerwise-CPU+SSD backends across same-node PP, simulated cross-node PP, same-node TP, simulated cross-node TP, PP+TP and cross-node PP+TP parallel configurations. Tests automatically downgrade to single-GPU simulation when physical GPUs are insufficient and skip unavailable backends.

…ctness test suite Fix the per-PP-stage layer offset logic so that each worker advances its CPU/SSD base pointer by start_layer_id * layer_stride, ensuring the C++ kernel always addresses GPU tensors with layer_id=0 (PP-stage-local) while writing to the correct slice of the shared per-node CPU/SSD pool. Unify the offset implementation and comment style across all five worker types (GPUCPUTransferWorker, tpGPUCPUTransferWorker, GDSTransferWorker, tpGDSTransferWorker, LayerwiseTransferWorker) so the same design principle is consistently expressed regardless of whether the offset is applied via tensor slicing, raw pointer arithmetic, layer_id_list, or a scalar kernel argument. Add tests/test_transfer_correctness.py, a parametrized correctness test suite covering CPU-only, CPU+SSD, GDS, layerwise-CPU-only and layerwise-CPU+SSD backends across same-node PP, simulated cross-node PP, same-node TP, simulated cross-node TP, PP+TP and cross-node PP+TP parallel configurations. Tests automatically downgrade to single-GPU simulation when physical GPUs are insufficient and skip unavailable backends.

zhjc1124 force-pushed the fix/single-node-pp-cpu-pool branch from 22222ed to 92b4c49 Compare May 21, 2026 03:58

zhjc1124 marked this pull request as draft May 22, 2026 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct per-PP-stage CPU/SSD layer offset and add transfer correctness test suite#171

fix: correct per-PP-stage CPU/SSD layer offset and add transfer correctness test suite#171
zhjc1124 wants to merge 1 commit into
taco-project:feat/layerwise_rebasefrom
zhjc1124:fix/single-node-pp-cpu-pool

zhjc1124 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhjc1124 commented May 21, 2026

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant