Skip to content

[for 26.04_linux-nvidia]: NVIDIA: SAUCE: vfio/nvgrace-gpu: Add Blackwell-Next GPU readiness check via CXL DVSEC#380

Closed
nirmoy wants to merge 1 commit into
NVIDIA:26.04_linux-nvidiafrom
nirmoy:nvgrace_readiness_for_7.0_hwe
Closed

[for 26.04_linux-nvidia]: NVIDIA: SAUCE: vfio/nvgrace-gpu: Add Blackwell-Next GPU readiness check via CXL DVSEC#380
nirmoy wants to merge 1 commit into
NVIDIA:26.04_linux-nvidiafrom
nirmoy:nvgrace_readiness_for_7.0_hwe

Conversation

@nirmoy
Copy link
Copy Markdown
Collaborator

@nirmoy nirmoy commented Apr 17, 2026

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-7.0/+bug/2148701

Summary

Backport of the vfio/nvgrace-gpu Blackwell-Next GPU readiness check (v3) from LKML to 26.04_linux-nvidia.

Patch: https://lore.kernel.org/all/20260416014504.63067-1-ankita@nvidia.com/

Blackwell-Next GPUs report device readiness via the CXL DVSEC Range 1 Low register (offset 0x1C) instead of the BAR0 HBM training register used by GB200. Adds runtime detection by checking the presence of the DVSEC register and routes to the new method if present, otherwise falls back to the legacy approach.

One minor conflict resolved: the nvgrace_gpu_pci_core_device struct already had int egm_node from the existing EGM SAUCE in this branch; both fields are retained.

Test

Patch applies cleanly on top of 26.04_linux-nvidia. Boot test on Vera pending — this requires Blackwell-Next hardware for functional validation.
Todo:
[ ] VM passthrough test

…ck via CXL DVSEC

Blackwell-Next GPUs report device readiness via the CXL DVSEC Range 1 Low
register (offset 0x1C) instead of the BAR0 HBM training register used by
GB200. The GPU memory readiness is checked by polling for the Memory_Active
bit (bit 1) for the Memory_Active_Timeout (bits 15:13).

Add runtime detection by checking the presence of the DVSEC register.
Route to the new method if present, otherwise continue using the legacy
approach.

(backported from https://lore.kernel.org/all/20260416014504.63067-1-ankita@nvidia.com/)
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Patchscan: No Missing Fixes

All cherry-picked commits have been checked — no missing upstream fixes found.

@nirmoy nirmoy marked this pull request as draft April 17, 2026 17:45
@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Apr 17, 2026

@nirmoy I know this is in draft state but took a look and have a few comments:

  • Targeting the wrong branch, this should target 7.0-bos
  • The "backported from" tag (and your SOB) should come after the SOBs from the LKML posting
  • The patch is missing a context note about the collision that required fixup
    • The PR description is also a bit inconsistent as the "Test" section says that the patch applies cleanly when it required a context adjustment
  • The commit message from the patch does not match v3 on LKML

@nirmoy
Copy link
Copy Markdown
Collaborator Author

nirmoy commented Apr 20, 2026

Closing — wrong target branch. Replaced by #382 targeting 26.04_linux-nvidia-bos.

@nirmoy nirmoy closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants