Skip to content

Add caller-side byte-range selection (sub-file reads) + EP-slice demo…#81

Draft
gitbisector wants to merge 2 commits into
foundation-model-stack:mainfrom
gitbisector:ep-slice-prototype
Draft

Add caller-side byte-range selection (sub-file reads) + EP-slice demo…#81
gitbisector wants to merge 2 commits into
foundation-model-stack:mainfrom
gitbisector:ep-slice-prototype

Conversation

@gitbisector

@gitbisector gitbisector commented May 25, 2026

Copy link
Copy Markdown
Contributor

Prototype for #71, from the I/O angle rather than the memory angle.

Background

This came out of optimizing DeepSeek-V4-Flash cold-start load on 2× DGX Spark (GB10) under expert parallelism. It needs exactly the sub-file / byte-range read primitive proposed in #71, so I'm sharing it as @takeshi-yoshimura suggested — a second use case and a concrete data point from the I/O side.

The I/O problem (sibling of the memory problem in this issue)

With file-granular loading, every expert-parallel rank reads the whole shard and then discards the experts it doesn't own. BaseSafeTensorsFileLoader registers and copies at file granularity, so there's no way to read less. On a 2-way EP split that's ~2× the necessary bytes per rank. Same root cause as the memory overhead here (the file is the minimum unit) — it just costs redundant I/O instead of peak VRAM.

What this adds (minimal primitive + a demonstrator)

  • SafeTensorsMetadata.select_byte_ranges(keep_tensor) — merged [start, end) file byte-ranges covering exactly the tensors a predicate keeps (adjacent kept tensors coalesced within a page to limit read count).
  • CopierInterface.set_byte_ranges(...) — read only those runs. The device buffer is still allocated for the full data section, so tensor offsets are unchanged and unread regions just stay uninitialized (the caller must not request skipped tensors). None reproduces the original full-file read. Implemented for the nogds and unified copiers; gds/threefs inherit the base no-op (full read).
  • SafeTensorsFileLoader.set_tensor_filter(keep_tensor) — wire a keep(name) predicate through copy_files_to_device to the copier.
  • fastsafetensors.ep_sliceexpert_parallel_filter(num_experts, ep_size, ep_rank) (keeps non-expert tensors + this rank's owned experts; owned range matches vLLM's determine_expert_map("linear"), no vLLM import), plus expert_parallel_filter_from_env().
  • ParallelLoader / PipelineParalleltensor_filter= and all_local= passthroughs so EP-slice works through the high-level pipelined loader too (the path vLLM is moving toward in [Core] Use fastsafetensors ParallelLoader for weight loading vllm-project/vllm#40183), not only the low-level loader.

Using it

Low-level loader:

loader = SafeTensorsFileLoader(pg, device, nogds=True)
loader.set_tensor_filter(expert_parallel_filter(num_experts=256, ep_size=2, ep_rank=rank))
loader.add_filenames({rank: shards})
fb = loader.copy_files_to_device()

Or via ParallelLoader, which now takes the filter directly:

pl = ParallelLoader(pg, shards, device=device, nogds=True,
                    all_local=True,
                    tensor_filter=expert_parallel_filter(256, ep_size=2, ep_rank=rank))
for name, tensor in pl.iterate_weights():
    ...

all_local=True makes every rank read its own shards without any cross-rank broadcast. Required, because under the default file-shard+broadcast assignment a per-rank filter would broadcast unread bytes to the rank that owns those tensors. vLLM already constructs these loaders, so enabling EP-slice is just passing these two extra args (it already has ep_size/ep_rank/num_experts).

Out of scope

  • No max_batch_bytes / batch loop, that's the chunking design in [Discussion] Sub-file batching — Decouple GPU memory usage from model file size #71. This is only the byte-range read primitive both features need; select_byte_ranges / set_byte_ranges can re-shape onto the API the chunking work settles on.
  • Range reads are implemented for the nogds and unified copiers (unified is the default on GB10 — it's where the benchmarks below run). gds / threefs inherit the base set_byte_ranges no-op and read the full file; the same select_byte_ranges output can drive them too as a follow-up.

GB10 benchmarks

DeepSeek-V4-Flash (~159.6 GB / 46 shards), 2× DGX Spark, TP=2 + EP=2, cold page cache. (Times below were measured with the separate O_DIRECT reader noted under the table; the bytes read / rank column is what this PR changes.)

config weight load bytes read / rank
full-file read (baseline) 17.6 s 159 GB
EP-slice (owned experts only) 13.0 s ~84 GB
EP-slice + skip views for unowned experts 12.2 s ~84 GB

98% of V4-Flash's keys are per-expert tensors. Under linear EP each rank's owned set is a contiguous block and select_byte_ranges coalesces it into a handful of large runs.

  • The absolute times include an extra multithreaded O_DIRECT throughput optimization layered on the unified copier (a separate change I can send on its own), so they're not from the stock unified path as-is. The EP-slice mechanism this PR adds — reading only the owned byte ranges, ~half the bytes per rank — runs on the stock unified/nogds copiers and is the portable win. On the stock pin_memory path this PR ships, the same EP=2 load is ~19.7 s (the separate O_DIRECT reader brings it to ~13.2 s).
  • The extra 13.0 → 12.2 s comes from also skipping dlpack view creation for unowned experts; I left that out here to keep the diff focused.

Validation

  • New unit tests (tests/unit/test_ep_slice.py): EP-range math, select_byte_ranges coverage, and byte-identical partial-read + full-read-unchanged regressions for the nogds and unified copiers (the unified pair is GPU-gated). All 10 pass on a GB10.
  • 2-rank gloo ParallelLoader run (all_local=True): each rank reads only its owned experts, no broadcast, byte-identical to a full load.
  • Verified on a real DeepSeek-V4-Flash shard on GB10 (256 experts, EP=2): rank 0's filter coalesces into 29 runs reading 52% of the bytes (its 128 owned experts + the non-expert tensors); set_byte_ranges(None) reproduces the full load exactly.

@gitbisector gitbisector marked this pull request as draft May 25, 2026 17:12
…nstrator

Prototype for foundation-model-stack#71: let a caller restrict which byte ranges of a shard are
actually read, so an expert-parallel rank can read only the experts it owns
instead of reading the whole file and discarding the rest.

- SafeTensorsMetadata.select_byte_ranges(keep_tensor): merged [start,end) file
  byte-ranges covering exactly the kept tensors (adjacent kept tensors are
  coalesced within a page to limit the number of reads).
- CopierInterface.set_byte_ranges(): read only those runs; the device buffer is
  still allocated for the full data section so tensor offsets are unchanged
  (byte_ranges=None reproduces the original full-file read). Implemented for the
  nogds and unified copiers; gds/threefs inherit the base no-op (full read).
- SafeTensorsFileLoader.set_tensor_filter(keep_tensor): wire a keep(name)
  predicate through copy_files_to_device to the copier.
- ParallelLoader/PipelineParallel: tensor_filter= passthrough, and all_local=
  on ParallelLoader (loads via a single-process group so a per-rank filter is
  not broken by get_tensor's cross-rank broadcast).
- fastsafetensors.ep_slice: owned_expert_range / expert_parallel_filter
  (contiguous-block "linear" expert assignment; no external dependency) plus
  expert_parallel_filter_from_env.
- tests/unit/test_ep_slice.py: range math, select_byte_ranges coverage, and
  byte-identical partial-read + full-read-unchanged regressions for both the
  nogds and unified copiers (the unified pair is GPU-gated).

Motivation/measurements: on 2x DGX Spark (GB10), reading only owned experts cut
DeepSeek-V4-Flash weight load ~17.6s -> ~13.0s under EP=2 (~half the bytes per
rank). This is the I/O-side counterpart of the memory-side sub-file batching in
this issue; both want the same byte-range read primitive. Verified byte-identical
on stock 0.3.2 (GB10): copier partial read, a 2-rank gloo ParallelLoader
all-local run, and a real DeepSeek-V4-Flash shard (256 experts, EP=2) where both
the nogds and unified copiers load the rank's owned experts byte-identical while
reading ~52% of the bytes (no broadcast).

Signed-off-by: git bisector <gitbisector@gmail.com>
@gitbisector gitbisector force-pushed the ep-slice-prototype branch from 77a7ebf to eb9d8af Compare May 25, 2026 17:25
@takeshi-yoshimura

Copy link
Copy Markdown
Collaborator

@gitbisector
Thank you for your great contribution! The code overall looks nice and set_tensor_filter() reduces I/O, but skipped tensors are still exposed through the APIs.

Right now loader.get_keys() / FilesBufferOnDevice / ParallelLoader.iterate_weights() do not seem filter-aware, so a skipped tensor can still be accessed and may read back uninitialized bytes.

So, please update the code to pass a test like this:

def test_tensor_filter_hides_skipped_tensors(fstcpp_log, input_files, framework):
    device, _ = get_and_check_device(framework)
    meta = SafeTensorsMetadata.from_file(input_files[0], framework)

    kept = set(sorted(meta.tensors.keys())[::2])
    keep = lambda name: name in kept
    skipped = next(name for name in meta.tensors if name not in kept)

    loader = SafeTensorsFileLoader(
        pg=SingleGroup(),
        device=device.as_str(),
        framework=framework.get_name(),
        nogds=True,
    )
    loader.set_tensor_filter(keep)
    loader.add_filenames({0: [input_files[0]]})
    bufs = loader.copy_files_to_device()

    assert set(loader.get_keys()) == kept
    assert skipped not in bufs.key_to_rank_lidx
    with pytest.raises(ValueError):
        bufs.get_tensor(skipped)

gitbisector added a commit to gitbisector/fastsafetensors that referenced this pull request May 31, 2026
set_tensor_filter() previously only narrowed which bytes the nogds/unified
copiers read; the public API still advertised filtered-out tensors and
get_tensor / get_sharded returned views onto uninitialized device memory
rather than raising. Reported by @takeshi-yoshimura in foundation-model-stack#81 review.

  * SafeTensorsFileLoader.get_keys() honors _tensor_filter at read time
    so set_tensor_filter() may still be called before or after
    add_filenames().
  * FilesBufferOnDevice gains an optional keep_tensor kwarg and excludes
    filtered keys from key_to_rank_lidx at construction time.
    SafeTensorsFileLoader.copy_files_to_device() threads its own
    _tensor_filter through. get_tensor / get_sharded therefore hit the
    existing _get_rank_lidx ValueError instead of returning uninitialized
    data; fastsafe_open.keys() inherits this because it iterates
    key_to_rank_lidx.
  * ParallelLoader.iterate_weights() already iterates
    list(fb.key_to_rank_lidx.keys()), so it skips filtered-out tensors
    with no further change.

Adds two tests to tests/unit/test_ep_slice.py covering the new contract:
the exact test from the PR review (get_keys / key_to_rank_lidx /
get_tensor) plus a ParallelLoader.iterate_weights variant.
@gitbisector

gitbisector commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

@takeshi-yoshimura Thanks for the careful review.

  • SafeTensorsFileLoader.get_keys() now honors _tensor_filter (read-time, so set_tensor_filter() can still be called before or after add_filenames).
  • FilesBufferOnDevice.__init__ gains an optional keep_tensor kwarg; copy_files_to_device() threads self._tensor_filter through, so filtered tensors no longer enter key_to_rank_lidx.
  • bufs.get_tensor(skipped) / bufs.get_sharded(skipped) therefore hit the existing _get_rank_lidx ValueError rather than returning uninitialized device memory — no new guard needed.
  • ParallelLoader.iterate_weights() already iterates list(fb.key_to_rank_lidx.keys()), so it skips filtered tensors automatically.
  • fastsafe_open.keys() also iterates key_to_rank_lidx, so it inherits the same behavior.

I added your exact test as test_tensor_filter_hides_skipped_tensors plus a sibling test_tensor_filter_iterate_weights_hides_skipped that drives ParallelLoader over the same gpt2 fixture and asserts the yielded set equals kept. Both pass against this branch on CPU and on a GB10 and the full test_ep_slice.py + test_fastsafetensors.py + test_config.py + test_auto_loader.py suite stays green (127 passed, 2 skipped) on CPU-only. set_tensor_filter() docstring also updated.

Let me know if you'd like the iterate-weights test trimmed or any other adjustments.

(this comment mostly from claude)

set_tensor_filter() previously only narrowed which bytes the nogds/unified
copiers read; the public API still advertised filtered tensors and
get_tensor / get_sharded returned views onto uninitialized device memory.
Reported by @takeshi-yoshimura in foundation-model-stack#81 review.

  * BaseSafeTensorsFileLoader.get_keys() and get_shape() consult
    _tensor_filter at read time (set_tensor_filter can be called before
    or after add_filenames).
  * FilesBufferOnDevice takes an optional keep_tensor kwarg that excludes
    filtered keys from key_to_rank_lidx. copy_files_to_device() threads
    self._tensor_filter through. get_tensor / get_sharded then hit the
    existing _get_rank_lidx ValueError; fastsafe_open.keys() inherits the
    same narrowing.
  * FilesBufferOnDevice.get_filename now raises ValueError via
    _get_rank_lidx for unreachable tensors (previously returned ""). The
    behavior change also affects unknown names; the tgis_weight example
    is updated to probe key_to_rank_lidx directly.
  * ParallelLoader.iterate_weights() iterates fb.key_to_rank_lidx.keys(),
    so it skips filtered tensors automatically.

Adds two tests in test_fastsafetensors.py covering the API-surface
contract (so they run under the existing test-torch CI job).

Signed-off-by: git bisector <gitbisector@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
@gitbisector gitbisector force-pushed the ep-slice-prototype branch from 260af04 to 4a8b4a4 Compare May 31, 2026 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants