Skip to content

[for 26.04_linux-nvidia-bos]: NVIDIA: SAUCE: vfio/nvgrace-gpu: Add Blackwell-Next GPU readiness check via CXL DVSEC#382

Closed
nirmoy wants to merge 1 commit into
NVIDIA:26.04_linux-nvidia-bosfrom
nirmoy:nvgrace_readiness_for_7.0_bos
Closed

[for 26.04_linux-nvidia-bos]: NVIDIA: SAUCE: vfio/nvgrace-gpu: Add Blackwell-Next GPU readiness check via CXL DVSEC#382
nirmoy wants to merge 1 commit into
NVIDIA:26.04_linux-nvidia-bosfrom
nirmoy:nvgrace_readiness_for_7.0_bos

Conversation

@nirmoy
Copy link
Copy Markdown
Collaborator

@nirmoy nirmoy commented Apr 20, 2026

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-7.0/+bug/2148701

Summary

Backport of the vfio/nvgrace-gpu Blackwell-Next (VR) GPU readiness check (v3) from LKML to 26.04_linux-nvidia-bos.

Patch: https://lore.kernel.org/all/20260416014504.63067-1-ankita@nvidia.com/

Blackwell-Next (VR) GPUs report device readiness via the CXL DVSEC Range 1 Low register instead of the BAR0 HBM training register used by GB200. Adds runtime detection by checking the presence of the DVSEC register and routes to the new method if present, otherwise falls back to the legacy approach.

Jira: https://jirasw.nvidia.com/browse/DGX-16091

Tested with: GPU passthrough test on Blackwell-Next (VR) hardware.

…ck via CXL DVSEC

Add a CXL DVSEC-based readiness check for Blackwell-Next GPUs alongside
the existing legacy BAR0 polling path. On probe and after reset, the
driver reads the CXL Device DVSEC capability to determine whether the
GPU memory is valid. This is checked by polling on the Memory_Active bit
based on the Memory_Active_Timeout. Also check if MEM_INFO_VALID is set
within 1 second per CXL spec 4.0 Tables 8-13. If not, return error.

A static inline wrapper dispatches to the appropriate readiness check
based on whether the CXL DVSEC capability is present.

Add PCI_DVSEC_CXL_MEM_ACTIVE_TIMEOUT to pci_regs.h for the timeout
field encoding.

cc: Kevin Tian <kevin.tian@intel.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
(backported from https://lore.kernel.org/all/20260416014504.63067-1-ankita@nvidia.com/)
[nirmoy: kept both egm_node (existing EGM SAUCE) and cxl_dvsec in struct to avoid conflict with EGM backport]
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 20, 2026

✅ Patchscan: No Missing Fixes

All cherry-picked commits have been checked — no missing upstream fixes found.

@nirmoy nirmoy marked this pull request as ready for review April 20, 2026 13:12
@nirmoy nirmoy force-pushed the nvgrace_readiness_for_7.0_bos branch from 77eb99a to 0224667 Compare April 20, 2026 13:12
@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Apr 20, 2026

Verified this matches v3 on LKML.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@clsotog
Copy link
Copy Markdown
Collaborator

clsotog commented Apr 20, 2026

Acked-by: Carol L Soto <csoto@nvidia.com>

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Apr 20, 2026

Applied:

❯ git lgo 8279f2b6cb5a -1
8279f2b6cb5a (nresolute/nvidia-bos-next) NVIDIA: SAUCE: vfio/nvgrace-gpu: Add Blackwell-Next GPU readiness check via CXL DVSEC

Closing PR.

@nvmochs nvmochs closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants