Skip to content

Fall back to nogds at runtime when cuFile handle registration fails#87

Open
gitbisector wants to merge 1 commit into
foundation-model-stack:mainfrom
gitbisector:gds-runtime-fallback
Open

Fall back to nogds at runtime when cuFile handle registration fails#87
gitbisector wants to merge 1 commit into
foundation-model-stack:mainfrom
gitbisector:gds-runtime-fallback

Conversation

@gitbisector

Copy link
Copy Markdown
Contributor

cuFile can probe as available (is_gds_supported() / is_cufile_found() pass) yet fail cuFileHandleRegister at I/O time — compat-mode hosts without the nvidia-fs kernel module, checkpoints on filesystems cuFile can't register (e.g. overlayfs on CI runners), etc. Today that surfaces as a hard RuntimeError: raw_gds_file_handle: cuFileHandleRegister returned an error = 5027 from GdsFileCopier.submit_io, and every consumer ends up carrying its own gds→nogds retry wrapper — e.g. vllm-project/vllm#40183 needed exactly that fix when its CI runner hit 5027 (weights on overlayfs, probe says GDS is fine).

This catches the registration failure inside GdsFileCopier.submit_io, warns once, and transparently delegates the copier to the nogds bounce path. The fallback copier (and its bounce-buffer reader) lives only for that file's submit/wait cycle, so no pinned memory outlives the load.

Test: monkeypatched gds_file_handle raising the 5027 error → load completes via the fallback, tensors byte-identical to safetensors.load_file, and the bounce buffer is released afterwards.

make lint clean; unit suite green on CPU and on a CUDA host where cuFile genuinely fails registration (the fallback turns that previously-failing environment green).

@takeshi-yoshimura

Copy link
Copy Markdown
Collaborator

Thanks for the contribution. The change looks good to me, but please add signed-off-by line in your commit. Then, I will merge this after the change #81.

@gitbisector gitbisector force-pushed the gds-runtime-fallback branch from e5b8794 to 96ff25f Compare July 3, 2026 04:46
cuFile can probe as available yet fail cuFileHandleRegister at I/O time
(compat-mode hosts without nvidia-fs, checkpoints on overlayfs, CI runners).
Catch the failure in submit_io, warn once, and transparently delegate the
copier to the nogds bounce path -- so every consumer stops carrying its own
gds->nogds retry wrapper. The fallback (and its bounce-buffer reader) lives
only for the file's submit/wait cycle.

Signed-off-by: git bisector <gitbisector@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
@gitbisector gitbisector force-pushed the gds-runtime-fallback branch from 96ff25f to 7775b5e Compare July 3, 2026 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants