Add caller-side byte-range selection (sub-file reads) + EP-slice demo…#81
Add caller-side byte-range selection (sub-file reads) + EP-slice demo…#81gitbisector wants to merge 2 commits into
Conversation
…nstrator Prototype for foundation-model-stack#71: let a caller restrict which byte ranges of a shard are actually read, so an expert-parallel rank can read only the experts it owns instead of reading the whole file and discarding the rest. - SafeTensorsMetadata.select_byte_ranges(keep_tensor): merged [start,end) file byte-ranges covering exactly the kept tensors (adjacent kept tensors are coalesced within a page to limit the number of reads). - CopierInterface.set_byte_ranges(): read only those runs; the device buffer is still allocated for the full data section so tensor offsets are unchanged (byte_ranges=None reproduces the original full-file read). Implemented for the nogds and unified copiers; gds/threefs inherit the base no-op (full read). - SafeTensorsFileLoader.set_tensor_filter(keep_tensor): wire a keep(name) predicate through copy_files_to_device to the copier. - ParallelLoader/PipelineParallel: tensor_filter= passthrough, and all_local= on ParallelLoader (loads via a single-process group so a per-rank filter is not broken by get_tensor's cross-rank broadcast). - fastsafetensors.ep_slice: owned_expert_range / expert_parallel_filter (contiguous-block "linear" expert assignment; no external dependency) plus expert_parallel_filter_from_env. - tests/unit/test_ep_slice.py: range math, select_byte_ranges coverage, and byte-identical partial-read + full-read-unchanged regressions for both the nogds and unified copiers (the unified pair is GPU-gated). Motivation/measurements: on 2x DGX Spark (GB10), reading only owned experts cut DeepSeek-V4-Flash weight load ~17.6s -> ~13.0s under EP=2 (~half the bytes per rank). This is the I/O-side counterpart of the memory-side sub-file batching in this issue; both want the same byte-range read primitive. Verified byte-identical on stock 0.3.2 (GB10): copier partial read, a 2-rank gloo ParallelLoader all-local run, and a real DeepSeek-V4-Flash shard (256 experts, EP=2) where both the nogds and unified copiers load the rank's owned experts byte-identical while reading ~52% of the bytes (no broadcast). Signed-off-by: git bisector <gitbisector@gmail.com>
77a7ebf to
eb9d8af
Compare
|
@gitbisector Right now So, please update the code to pass a test like this: def test_tensor_filter_hides_skipped_tensors(fstcpp_log, input_files, framework):
device, _ = get_and_check_device(framework)
meta = SafeTensorsMetadata.from_file(input_files[0], framework)
kept = set(sorted(meta.tensors.keys())[::2])
keep = lambda name: name in kept
skipped = next(name for name in meta.tensors if name not in kept)
loader = SafeTensorsFileLoader(
pg=SingleGroup(),
device=device.as_str(),
framework=framework.get_name(),
nogds=True,
)
loader.set_tensor_filter(keep)
loader.add_filenames({0: [input_files[0]]})
bufs = loader.copy_files_to_device()
assert set(loader.get_keys()) == kept
assert skipped not in bufs.key_to_rank_lidx
with pytest.raises(ValueError):
bufs.get_tensor(skipped) |
set_tensor_filter() previously only narrowed which bytes the nogds/unified copiers read; the public API still advertised filtered-out tensors and get_tensor / get_sharded returned views onto uninitialized device memory rather than raising. Reported by @takeshi-yoshimura in foundation-model-stack#81 review. * SafeTensorsFileLoader.get_keys() honors _tensor_filter at read time so set_tensor_filter() may still be called before or after add_filenames(). * FilesBufferOnDevice gains an optional keep_tensor kwarg and excludes filtered keys from key_to_rank_lidx at construction time. SafeTensorsFileLoader.copy_files_to_device() threads its own _tensor_filter through. get_tensor / get_sharded therefore hit the existing _get_rank_lidx ValueError instead of returning uninitialized data; fastsafe_open.keys() inherits this because it iterates key_to_rank_lidx. * ParallelLoader.iterate_weights() already iterates list(fb.key_to_rank_lidx.keys()), so it skips filtered-out tensors with no further change. Adds two tests to tests/unit/test_ep_slice.py covering the new contract: the exact test from the PR review (get_keys / key_to_rank_lidx / get_tensor) plus a ParallelLoader.iterate_weights variant.
|
@takeshi-yoshimura Thanks for the careful review.
I added your exact test as Let me know if you'd like the iterate-weights test trimmed or any other adjustments. (this comment mostly from claude) |
set_tensor_filter() previously only narrowed which bytes the nogds/unified copiers read; the public API still advertised filtered tensors and get_tensor / get_sharded returned views onto uninitialized device memory. Reported by @takeshi-yoshimura in foundation-model-stack#81 review. * BaseSafeTensorsFileLoader.get_keys() and get_shape() consult _tensor_filter at read time (set_tensor_filter can be called before or after add_filenames). * FilesBufferOnDevice takes an optional keep_tensor kwarg that excludes filtered keys from key_to_rank_lidx. copy_files_to_device() threads self._tensor_filter through. get_tensor / get_sharded then hit the existing _get_rank_lidx ValueError; fastsafe_open.keys() inherits the same narrowing. * FilesBufferOnDevice.get_filename now raises ValueError via _get_rank_lidx for unreachable tensors (previously returned ""). The behavior change also affects unknown names; the tgis_weight example is updated to probe key_to_rank_lidx directly. * ParallelLoader.iterate_weights() iterates fb.key_to_rank_lidx.keys(), so it skips filtered tensors automatically. Adds two tests in test_fastsafetensors.py covering the API-surface contract (so they run under the existing test-torch CI job). Signed-off-by: git bisector <gitbisector@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
260af04 to
4a8b4a4
Compare
Prototype for #71, from the I/O angle rather than the memory angle.
Background
This came out of optimizing DeepSeek-V4-Flash cold-start load on 2× DGX Spark (GB10) under expert parallelism. It needs exactly the sub-file / byte-range read primitive proposed in #71, so I'm sharing it as @takeshi-yoshimura suggested — a second use case and a concrete data point from the I/O side.
The I/O problem (sibling of the memory problem in this issue)
With file-granular loading, every expert-parallel rank reads the whole shard and then discards the experts it doesn't own.
BaseSafeTensorsFileLoaderregisters and copies at file granularity, so there's no way to read less. On a 2-way EP split that's ~2× the necessary bytes per rank. Same root cause as the memory overhead here (the file is the minimum unit) — it just costs redundant I/O instead of peak VRAM.What this adds (minimal primitive + a demonstrator)
SafeTensorsMetadata.select_byte_ranges(keep_tensor)— merged[start, end)file byte-ranges covering exactly the tensors a predicate keeps (adjacent kept tensors coalesced within a page to limit read count).CopierInterface.set_byte_ranges(...)— read only those runs. The device buffer is still allocated for the full data section, so tensor offsets are unchanged and unread regions just stay uninitialized (the caller must not request skipped tensors).Nonereproduces the original full-file read. Implemented for thenogdsandunifiedcopiers;gds/threefsinherit the base no-op (full read).SafeTensorsFileLoader.set_tensor_filter(keep_tensor)— wire akeep(name)predicate throughcopy_files_to_deviceto the copier.fastsafetensors.ep_slice—expert_parallel_filter(num_experts, ep_size, ep_rank)(keeps non-expert tensors + this rank's owned experts; owned range matches vLLM'sdetermine_expert_map("linear"), no vLLM import), plusexpert_parallel_filter_from_env().ParallelLoader/PipelineParallel—tensor_filter=andall_local=passthroughs so EP-slice works through the high-level pipelined loader too (the path vLLM is moving toward in [Core] Use fastsafetensors ParallelLoader for weight loading vllm-project/vllm#40183), not only the low-level loader.Using it
Low-level loader:
Or via
ParallelLoader, which now takes the filter directly:all_local=Truemakes every rank read its own shards without any cross-rank broadcast. Required, because under the default file-shard+broadcast assignment a per-rank filter would broadcast unread bytes to the rank that owns those tensors. vLLM already constructs these loaders, so enabling EP-slice is just passing these two extra args (it already hasep_size/ep_rank/num_experts).Out of scope
max_batch_bytes/ batch loop, that's the chunking design in [Discussion] Sub-file batching — Decouple GPU memory usage from model file size #71. This is only the byte-range read primitive both features need;select_byte_ranges/set_byte_rangescan re-shape onto the API the chunking work settles on.nogdsandunifiedcopiers (unifiedis the default on GB10 — it's where the benchmarks below run).gds/threefsinherit the baseset_byte_rangesno-op and read the full file; the sameselect_byte_rangesoutput can drive them too as a follow-up.GB10 benchmarks
DeepSeek-V4-Flash (~159.6 GB / 46 shards), 2× DGX Spark, TP=2 + EP=2, cold page cache. (Times below were measured with the separate O_DIRECT reader noted under the table; the bytes read / rank column is what this PR changes.)
98% of V4-Flash's keys are per-expert tensors. Under linear EP each rank's owned set is a contiguous block and
select_byte_rangescoalesces it into a handful of large runs.unifiedpath as-is. The EP-slice mechanism this PR adds — reading only the owned byte ranges, ~half the bytes per rank — runs on the stockunified/nogdscopiers and is the portable win. On the stockpin_memorypath this PR ships, the same EP=2 load is ~19.7 s (the separate O_DIRECT reader brings it to ~13.2 s).Validation
tests/unit/test_ep_slice.py): EP-range math,select_byte_rangescoverage, and byte-identical partial-read + full-read-unchanged regressions for thenogdsandunifiedcopiers (the unified pair is GPU-gated). All 10 pass on a GB10.ParallelLoaderrun (all_local=True): each rank reads only its owned experts, no broadcast, byte-identical to a full load.set_byte_ranges(None)reproduces the full load exactly.