Skip to content

[26.04_linux-nvidia-bos] VFIO: Add CXL Type-2 passthrough support for BOS#448

Closed
kobak2026 wants to merge 27 commits into
NVIDIA:26.04_linux-nvidia-bosfrom
kobak2026:dgx16138-vfio-cxl-26.04-bos
Closed

[26.04_linux-nvidia-bos] VFIO: Add CXL Type-2 passthrough support for BOS#448
kobak2026 wants to merge 27 commits into
NVIDIA:26.04_linux-nvidia-bosfrom
kobak2026:dgx16138-vfio-cxl-26.04-bos

Conversation

@kobak2026
Copy link
Copy Markdown
Collaborator

@kobak2026 kobak2026 commented Jun 2, 2026

Summary

Backport the VFIO CXL Type-2 passthrough stack onto 26.04_linux-nvidia-bos.

This series adds VFIO CXL device detection, CXL UAPI exposure, DVSEC/HDM emulation, DPA and component register region plumbing, reset handling, selftest coverage, and enables CONFIG_VFIO_CXL_CORE for BOS amd64 and arm64.

The BOS port also carries a conservative DPA mmap policy: DPA is exposed as read/write but not mmap-capable until CXL core can prove CPU-readable backing for the CXL HPA. This avoids a reproducible userspace SIGBUS in cxl_type2.dpa_mmap_fault while preserving the rest of VFIO CXL functionality.

Validation

  • Verified CONFIG_VFIO_CXL_CORE=y for:
    • arm64-nvidia-bos-64k
    • amd64-nvidia-bos
  • Built BOS arm64 raw kernel:
    • 7.0.0-2008-nvidia-bos-64k
  • Installed and booted on:
    • vr-nvl72-eba-l10-comp071
  • Confirmed CXL Type-2 endpoints are active:
    • FBSta: Cache+ IO+ Mem+
  • Ran vfio_cxl_type2_test 0002:81:00.0.

Validated selftest cases:

  • cxl_type2.device_info_has_cxl
  • cxl_type2.device_info_cap_cxl_payload
  • cxl_type2.region_info_cxl_capability
  • cxl_type2.component_bar_sparse_mmap
  • cxl_type2.dpa_region_info
  • cxl_type2.comp_regs_region_info
  • cxl_type2.dpa_mmap_fault
  • cxl_type2.comp_regs_no_mmap
  • cxl_type2.comp_reg_mmap_blocked
  • cxl_type2.hdm_cap_read
  • cxl_type2.hdm_ctrl_commit_to_committed
  • cxl_type2.dvsec_lock_semantics

Final result after the DPA mmap guard:

  • pass: 10
  • fail: 0
  • skip: 1
  • error: 0

cxl_type2.dpa_mmap_fault is skipped intentionally because the DPA region no longer advertises VFIO_REGION_INFO_FLAG_MMAP. Earlier testing showed that direct userspace CPU loads from the DPA PFNMAP can SIGBUS even when the CXL HPA is present in the host resource map. The BOS backport therefore exposes DPA as read/write only until CXL core can prove CPU-readable backing for DPA mmap.

Notes

The skipped DPA mmap case is intentional for this BOS backport. The platform advertises an active CXL HPA and VFIO can insert the PFN, but direct userspace CPU load from the DPA mapping can SIGBUS. DPA mmap is therefore withheld until CPU-readable backing can be proven by CXL core.


LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-bos/+bug/2152222

mmhonap added 5 commits June 2, 2026 21:22
cxl_probe_component_regs() finds the HDM decoder block during device probe
and caches its location, but does not record the decoder count and does
not expose the result outside drivers/cxl/.

vfio-cxl needs the decoder count and the byte offset and size of the HDM
block without re-running the probe sequence. Record decoder_cnt in
rmap->count when parsing the HDM capability in cxl_probe_component_regs(),
extend struct cxl_reg_map with a count member, and add cxl_get_hdm_info()
to return offset, size, and count from the cached map.

Export under the CXL namespace; stub to -EOPNOTSUPP when CONFIG_CXL_BUS
is off.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit fd317b8 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[kobak: Added the target-local private drivers/cxl/cxl.h cxl_get_hdm_info() prototype because drivers/cxl/core/pci.c includes the private CXL header in addition to the public include/cxl/cxl.h declaration.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
…ader

vfio-cxl lives outside drivers/cxl/ but still needs to locate the
component register block and fill cxl_component_reg_map. BOS already
has cxl_find_regblock() in include/cxl/pci.h, but
cxl_probe_component_regs() was still private to drivers/cxl/cxl.h.

Declare cxl_probe_component_regs() in include/cxl/pci.h next to the
existing register-block helpers so VFIO CXL can use the parsed component
register map.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit e02c1b7 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Move cxl_probe_component_regs() to include/cxl/pci.h instead of include/cxl/cxl.h to align with existing Srirangan/Alejandro convention; skip cxl_find_regblock() move as it is already in include/cxl/pci.h; add struct cxl_component_reg_map forward declaration]
[kobak: Kept the target's private drivers/cxl/cxl.h declarations while adding the public include/cxl/pci.h header expected by VFIO CXL.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
…xl/cxl_regs.h

VFIO and other code outside the CXL core needs the same offset/mask
constants the core uses for the component register block and HDM
decoders.

Pull them into a new include/uapi/cxl/cxl_regs.h
(GPL-2.0 WITH Linux-syscall-note) and include it from
include/cxl/cxl.h. Use uapi-friendly __GENMASK helpers for masks and _BITUL() for
single-bit flags because UAPI headers cannot depend on kernel-internal BIT().
Section comments in the new file reference CXL spec r4.0 numbering.

For UAPI change, replaced the SZ_64K with actual size as the macro
will not be available for userspace programs.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit 52ead24 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Remove defines from include/cxl/cxl.h instead of drivers/cxl/cxl.h as they were already moved there by Srirangan's SAUCE commit, Add #include <asm/bitsperlong.h> needed by __GENMASK() in uapi header]
Signed-off-by: Koba Ko <kobak@nvidia.com>
…dy wait

Before accessing CXL device memory after reset/power-on, the driver
must ensure media is ready. Not every CXL device implements the CXL
Memory Device register group (many Type-2 devices do not).
cxl_await_media_ready() reads cxlds->regs.memdev. Access to the
memory device registers on a Type-2 device may result in kernel
panic.

Split the HDM DVSEC range-active poll out of cxl_await_media_ready()
into a new function, cxl_await_range_active(). Type-2 devices often
lack the CXLMDEV status register, so they need the range check
without the memdev read. cxl_await_media_ready() now calls
cxl_await_range_active() for the DVSEC poll, then reads the memory
device status as before.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit 023bae3 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Add cxl_await_range_active() declaration to include/cxl/pci.h unconditionally instead of include/cxl/cxl.h with CONFIG_CXL_BUS guards, consistent with existing convention]
[kobak: Folded the private drivers/cxl/cxl.h cxl_await_range_active() prototype into this helper commit because drivers/cxl/core/pci.c includes the private CXL header.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
The Register Locator DVSEC (CXL 4.0 8.1.9) describes register blocks
by BAR index (BIR) and offset within the BAR. CXL core currently only
stores the resolved HPA (resource + offset) in struct cxl_register_map,
so callers that need to use pci_iomap() or report the BAR to userspace
must reverse-engineer the BAR from the HPA.

Add bar_index and bar_offset to struct cxl_register_map and fill them
in cxl_decode_regblock() when the regblock is BAR-backed (BIR 0-5).
Add cxl_regblock_get_bar_info() so callers (e.g. vfio-cxl) can get BAR
index and offset directly and use pci_iomap() instead of ioremap(HPA).
Return -EINVAL if the map is not BAR-backed.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit 947749b from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Add cxl_regblock_get_bar_info() declaration to include/cxl/pci.h unconditionally instead of include/cxl/cxl.h with CONFIG_CXL_BUS guards, consistent with existing convention, Add BIR range validation (reject BIR >= PCI_STD_NUM_BARS) plus a bar_index bounds check in cxl_regblock_get_bar_info()]
[kobak: Added the target-local private drivers/cxl/cxl.h cxl_regblock_get_bar_info() prototype; struct cxl_register_map carries bar_index/bar_offset in include/cxl/cxl.h.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
@nirmoy nirmoy added the help wanted Extra attention is needed label Jun 2, 2026
@kobak2026 kobak2026 requested a review from nvmochs June 2, 2026 16:45
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ✅ All checks passed

Details
Checking 27 commits...

Cherry-pick digest:
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject                              │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 895c1ebe8f42 │ [SAUCE] config: enable config_vfio_cxl_core for cxl type-2 passt │ N/A        │ N/A     │ jan, kobak                │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 6210cdba4d52 │ [SAUCE] vfio/cxl: implement vfio_cxl_reset()                     │ N/A        │ N/A     │ mhonap, jan, kobak        │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ fa1a71daf72e │ [SAUCE] vfio/cxl: virtualize dvsec status2 register in vconfig s │ N/A        │ N/A     │ mhonap, jan, kobak        │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ e01a12635b1a │ [SAUCE] vfio/cxl: preserve hdm decoder base addresses across res │ N/A        │ N/A     │ mhonap, jan, kobak        │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ c9f21b2e8bb4 │ [SAUCE] vfio/cxl: ensure pci memory space is enabled before post │ N/A        │ N/A     │ mhonap, jan, kobak        │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ f343cb0e198a │ [SAUCE] vfio/pci: wire cxl dpa reset handling                    │ N/A        │ N/A     │ mhonap, jan, kobak        │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 1d20dc9fe4f4 │ [SAUCE] cxl: export the cxl reset helpers for vfio users         │ N/A        │ N/A     │ mhonap, jan, kobak        │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ f67d68e00f71 │ selftests/vfio: add cxl type-2 vfio assignment test              │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 9cb1e5fc156d │ docs: vfio-pci: document cxl type-2 device passthrough           │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 4e091e25b379 │ vfio/cxl: provide opt-out for cxl feature                        │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 15013024d194 │ vfio/pci: advertise cxl cap and sparse component bar to userspac │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 06bc89012fbc │ vfio/cxl: register regions with vfio layer                       │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ ece8f2aed035 │ vfio/cxl: virtualize cxl dvsec config writes                     │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ bc9c716e7c8c │ vfio/cxl: dpa vfio region with demand fault mmap and reset zap   │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ c65f8e0f1ada │ vfio/cxl: cxl region management support                          │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 9a5e95fce78d │ vfio/cxl: wait for hdm ranges and create memdev                  │ match      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ f03e73f99ac7 │ vfio/cxl: introduce hdm decoder register emulation framework     │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 580aa50e7e59 │ vfio/pci: export config access helpers                           │ match      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ d861172fa218 │ vfio/cxl: detect cxl dvsec and probe hdm block                   │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 5f26208e06bb │ vfio/pci: add config_vfio_cxl_core and stub cxl hooks            │ match      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 34c597dd1715 │ vfio/pci: add cxl state to vfio_pci_core_device                  │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ af092226d62f │ vfio: uapi for cxl-capable pci device assignment                 │ match      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ cf3237d4c9f3 │ cxl: record bir and bar offset in cxl_register_map               │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ c02525adf150 │ cxl: split cxl_await_range_active() from media-ready wait        │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 58bf39cdf749 │ cxl: move component/hdm register defines to uapi/cxl/cxl_regs.h  │ noted      │ found   │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ f368bb07b2ef │ cxl: declare cxl_probe_component_regs in public header           │ no-match   │ not fou │ ok, backporter: kobak     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 470d1aa9e566 │ cxl: add cxl_get_hdm_info() for hdm decoder metadata             │ noted      │ found   │ ok, backporter: kobak     │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

@nirmoy
Copy link
Copy Markdown
Collaborator

nirmoy commented Jun 2, 2026

Boro review

Summary

Reviewed 27 commit(s); 28 finding(s) recorded.

Findings: Critical: 0, High: 0, Medium: 15, Low: 13

Latest watcher review: open review

Kernel deb build: successful (download debs, 4 files)

Head: 895c1ebe8f42

This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher publishes a newer review.


return 0;

failed:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one minor finding + a suggested fix:

Note - low severity, no shipping path triggers this today, but worth fixing.

The failed: block mishandles the region/decoder objects on two paths:

  1. Non-precommitted branch (currently dead code: precommitted is always true at
    probe, since capacity==0 bails at regs_failed before the helper is called).
    vfio_cxl_create_cxl_region() transfers ownership of all three objects to cxl
    via no_free_ptr(); if cxl_get_region_range() then fails, failed: only
    unregisters the region and NULLs cxled/cxlrd without cxl_dpa_free() /
    cxl_put_root_decoder() - leaking the DPA allocation and the root-decoder ref.

  2. Precommitted branch (the LIVE path). Here region/cxled are borrowed references
    owned by the cxl core, so they must not be unregistered or freed. But failed:
    calls cxl_unregister_region() unconditionally, so if it were ever reached it
    would tear down a borrowed region. It is not reached today only because
    cxl_get_region_range() fails solely when region->params.res == NULL, which does
    not happen for a firmware-committed region - so the live path is correct by
    luck of the data, not by construction.

Reusing the existing teardown helper fixes both: it already encodes the
borrowed-vs-owned distinction (frees all three only on !precommitted) and is the
canonical region teardown (also called from vfio_pci_cxl_cleanup()). It is defined
just above at line 407.

    failed:
    	vfio_cxl_destroy_cxl_region(cxl);
    	return ret;

This is safe: the failure cleanup in detect_and_init (regs_failed -> clean_virt_regs
-> dev_state_free) does not free these objects, and vdev->cxl is unset on failure so
vfio_pci_cxl_cleanup() early-returns - so destroy_cxl_region() runs exactly once on
the failure path (no double-free). It is also idempotent.

@clsotog
Copy link
Copy Markdown
Collaborator

clsotog commented Jun 2, 2026

Some findings from Codex:

  1. drivers/vfio/pci/cxl/vfio_cxl_config.c:24-53 does unaligned 16/32-bit loads and stores into vdev->vconfig by casting u8 * to u16 * / u32 *. DVSEC capabilities are only guaranteed to be 4-byte aligned at the capability base; the register-relative offsets here include 2-byte fields and byte-granular accesses, so dvsec + off is not guaranteed to be naturally aligned for every helper call. On architectures with strict alignment this can trap in the guest config path. The file already includes <linux/unaligned.h> and even uses get_unaligned_le16() later, so these helpers should be converted to the unaligned accessors as well.

Suggested fix: Replace all direct casts on vdev->vconfig with unaligned helpers.
Specifically:
- dvsec_virt_read16() -> get_unaligned_le16(...)
- dvsec_virt_write16() -> put_unaligned_le16(...)
- dvsec_virt_read32() -> get_unaligned_le32(...)
- dvsec_virt_write32() -> put_unaligned_le32(...)
That is the correct kernel-safe pattern and matches the existing include/use of <linux/unaligned.h>.

  1. drivers/vfio/pci/cxl/vfio_cxl_config.c:515-527 drops sub-dword writes to RANGE{1,2}BASE{LOW,HIGH} completely. The dispatcher only updates those virtualized 32-bit fields when count == 4, but VFIO config-space accesses are byte/word/dword and guests are allowed to program PCI config registers with 8- or 16-bit writes. With the current code, any VMM or guest firmware that touches those DVSEC base registers in smaller chunks will observe stale shadow state and the emulated range base will silently diverge from what was written.
    Practical fix:

    • Read current 32-bit shadow with dvsec_virt_read32(vdev, reg_start)
    • Merge incoming 1/2/4-byte payload at byte_in_reg
    • For *_BASE_LOW, apply the reserved-bit mask after merge
    • Write back with dvsec_virt_write32(...)

    That makes the region behave like normal PCI config space, where sub-dword writes are legal.

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Jun 2, 2026

@kobak2026

Carol and Jamie already covered the primary findings with codex (I found the same during my review). But I had a few more comments about the patch provenance and notes...

General nit: For patches picked from Jiandi’s branch, can you preserve Jiandi’s
existing provenance/trailers as-is, then append your pick/backport trailer,
annotation notes, and Signed-off-by? Also, for patches that are not exact picks,
please use “backported from” instead of “cherry-picked from” (I think I pointed
out the patches below where this is needed). The current back-to-back
provenance tags are hard to follow.


95fee0c NVIDIA: VR: SAUCE: cxl: Export the CXL reset helpers for VFIO users

This backport note looks too broad/inaccurate. The target already had the
PCI save/restore prototypes and CXL reset/cache DVSEC definitions, so this
commit should not add duplicate pci_dev_save_and_disable()/pci_dev_restore()
declarations. The real BOS adaptation appears to be handling the target
cxl_pci_functions_reset_prepare() error-return flow and adding any needed
target-local CXL reset helper prototypes. Please update the note accordingly
and drop the duplicate PCI prototypes.


ab234d6 selftests/vfio: Add CXL Type-2 VFIO assignment test

Nit: Use “backported from” instead of “cherry-picked from”

Please also include an annotation note describing that the test was changed to
treat DPA mmap as optional because the BOS backport intentionally withholds
VFIO_REGION_INFO_FLAG_MMAP for the DPA region.


557da55 NVIDIA: VR: SAUCE: docs: vfio-pci: Document CXL Type-2 device passthrough

Please update this doc for the BOS DPA mmap policy introduced later in the
series. It still says the DPA region advertises READ | WRITE | MMAP and shows
userspace mmaping it unconditionally, but the BOS backport intentionally
withholds VFIO_REGION_INFO_FLAG_MMAP.


8248c3d NVIDIA: VR: SAUCE: vfio/cxl: Register regions with VFIO layer

Nit: Use “backported from” instead of “cherry-picked from"


b1ab4d1 NVIDIA: VR: SAUCE: vfio/cxl: CXL region management support

Nit: Use “backported from” instead of “cherry-picked from"


1414625 NVIDIA: VR: SAUCE: cxl: Record BIR and BAR offset in cxl_register_map

This annotation looks slightly inaccurate:

[kobak: Folded the private drivers/cxl/cxl.h BAR index/offset fields ...]

The BAR index/offset fields are added to struct cxl_register_map in
include/cxl/cxl.h, not drivers/cxl/cxl.h. The private-header delta is only the
cxl_regblock_get_bar_info() prototype. Please update the note.


f307a44 NVIDIA: VR: SAUCE: cxl: Add cxl_get_hdm_info() for HDM decoder metadata

This annotation looks inaccurate:

[kobak: Kept the source branch's public include/cxl/cxl.h representation
because the target base does not carry that header path.]

The target does carry include/cxl/cxl.h. The actual delta appears to be adding
the private drivers/cxl/cxl.h prototype in addition to the public header. Please
clarify or drop this note.

@kobak2026 kobak2026 force-pushed the dgx16138-vfio-cxl-26.04-bos branch from d05d751 to 573f955 Compare June 3, 2026 02:39
@kobak2026
Copy link
Copy Markdown
Collaborator Author

Thanks @jamie, @Carol, @nvmochs, and @nirmoy. I reviewed the feedback and folded the accepted fixes into the related commits instead of adding standalone fixups.

Summary of updates:

  • @jamie: accepted. Replaced the open-coded CXL region failure cleanup with vfio_cxl_destroy_cxl_region(), so owned vs borrowed CXL objects follow the existing teardown helper.
  • @Carol / Codex: accepted. Converted vconfig helpers to unaligned accessors and fixed sub-dword writes to DVSEC RANGE base registers by merging byte/word/dword writes into the 32-bit shadow.
  • @nvmochs: accepted the provenance/note/doc comments. Cleaned up backported-vs-cherry-picked wording, corrected inaccurate [kobak] notes, removed duplicate PCI prototype handling, and updated docs for the BOS DPA mmap policy.
  • @nirmoy / Boro: accepted the concrete low-risk wording/comment/doc findings. Rejected only items that match expected internal NVIDIA/BOS patch style, such as internal tags and required provenance trailers.

Validation after the review fixes:

  • git diff --check passed.
  • checkpatch --strict passed on the touched review-fix diffs.
  • Required pre-push whole-kernel arm64 build passed for nvidia-bos.
  • PR branch updated to 573f955.

Functional validation from earlier still stands:

  • Built, installed, and booted 7.0.0-2008-nvidia-bos-64k on vr-nvl72-eba-l10-comp071.
  • CONFIG_VFIO_CXL_CORE=y.
  • CXL Type-2 endpoints active with FBSta: Cache+ IO+ Mem+.
  • vfio_cxl_type2_test result after the DPA mmap guard: pass:10 fail:0 skip:1 error:0.
  • The one skip is cxl_type2.dpa_mmap_fault. This is intentional for the BOS backport because DPA no longer advertises VFIO_REGION_INFO_FLAG_MMAP after reproducing SIGBUS from userspace CPU loads on the DPA PFNMAP path.

@jamieNguyenNVIDIA
Copy link
Copy Markdown
Collaborator

The trailers on commits 3f7b52cbd22d and ea0f506d77be look incorrect:

# 3f7b52cbd22d
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Rename vfio_cxl_zap_region_locked to vfio_cxl_prepare_reset and vfio_cxl_reactivate_region to vfio_cxl_finish_reset in docs]

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
...

# ea0f506d77be
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Check HDM COMMITTED bit before activating DPA region on precommitted decoders, add pm_runtime/memory-enabled gate in fault and rw paths, split vfio_cxl_zap_dpa() from prepare_reset(), add DPA zap in vfio_pci_zap_and_down_write_memory_lock(), add hot-reset CXL prepare/finish passes]

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>

Somehow, the "backported from" and Jiandi's backport notes got rearranged above the SOBs.

@jamieNguyenNVIDIA
Copy link
Copy Markdown
Collaborator

FYI: This lint error is nit-picky:

E: 5a27d6728ff5 ("NVIDIA: VR: SAUCE: vfio: UAPI for CXL-ca"): diff MISMATCH with lore patch (add [Author: reason] annotation if intentional)

It looks like a typo fix is what triggers it:

lore:  * To find the HDM decoder count, pread the HDM Decoder Capability register
PR:    * To find the HDM decoder count, read  the HDM Decoder Capability register

mmhonap added 13 commits June 3, 2026 11:26
Vendor GPUs and accelerators can expose CXL.mem (HDM-D or HDM-DB)
without using PCI class code 0x0502. VMMs need a stable way to learn
DPA sizing, firmware commit state, and where the extra VFIO regions live.

Add VFIO_DEVICE_FLAGS_CXL (bit 9) and VFIO_DEVICE_INFO_CAP_CXL (cap ID 6).
The capability struct carries:

  hdm_regs_bar_index       PCI BAR containing the component register block
  hdm_regs_offset          byte offset within that BAR to the CXL.mem area
                           (comp_reg_offset + CXL_CM_OFFSET)
  dpa_region_index         VFIO region index for the DPA window
  comp_regs_region_index   VFIO region index for the emulated COMP_REGS

HDM decoder count and the HDM block offset within COMP_REGS are
intentionally absent; both are derivable from the CXL Capability Array at
COMP_REGS offset 0. Locate cap ID 0x5 (HDM) and read bits[31:20] of its
entry for the byte offset. Then read bits[3:0] of the HDM Decoder Capability
register for the count: count = (field == 0) ? 1 : field * 2.

Two flags accompany the capability:

  VFIO_CXL_CAP_FIRMWARE_COMMITTED
    A decoder covering @dpa_size bytes was programmed and committed by
    platform firmware before device open. The VMM can use the DPA region
    immediately without re-committing.

  VFIO_CXL_CAP_CACHE_CAPABLE
    The device is HDM-DB (CXL.mem + CXL.cache). HDM-DB requires a
    Write-Back Invalidation sequence before FLR to flush dirty cache
    lines; HDM-D (CXL.mem only) does not. QEMU uses this flag to
    schedule WBI and to report Back-Invalidation capability accurately
    in the virtual CXL topology. Mirrors the Cache_Capable bit from
    the CXL DVSEC Capability register.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(cherry-picked from commit c0f4d24 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
Signed-off-by: Koba Ko <kobak@nvidia.com>
Add struct vfio_pci_cxl_state and hang a pointer to it off
vfio_pci_core_device.  vdev->cxl stays NULL for non-CXL devices, so
existing vfio-pci-core paths just pay a NULL check.

The new struct embeds struct cxl_dev_state by value (CXL core uses
container_of() against this field) and stores pointers to the
cxl_memdev, root decoder, and endpoint decoder that the CXL core
owns.  cxl_region is not introduced here; it is added later when
region management lands.

The series builds the CXL Type-2 passthrough path inside
vfio-pci-core rather than in a separate variant driver.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit 87b80cc from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Resolve context mismatch in vfio_pci_core.h; add #include <cxl/pci.h> to vfio_cxl_priv.h for cxl_find_regblock/cxl_probe_component_regs declarations]
[kobak: Preserved existing VFIO PCI DMABUF forward declarations while adding the CXL state forward declaration.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
Introduce the Kconfig option CONFIG_VFIO_CXL_CORE and the necessary
build rules to compile CXL.mem passthrough infrastructure for
vendor-specific CXL devices into the vfio-pci-core module.  The new
option depends on VFIO_PCI_CORE, CXL_BUS and CXL_MEM.

Wire up the detection and cleanup entry-point stubs in
vfio_pci_core_register_device() and vfio_pci_core_unregister_device()
so that subsequent patches can fill in the CXL-specific logic without
touching the vfio-pci-core flow again.

The vfio_cxl_core.c file added here is an empty skeleton; the actual
CXL detection and initialisation code is introduced in the following
patch to keep this build-system patch reviewable on its own.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit 336a144 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Resolve context mismatches in Kconfig, Makefile, and vfio_pci_priv.h due to missing upstream xe/dmabuf support in NV-Kernels base]
[kobak: Preserved existing VFIO PCI DMABUF declarations while adding VFIO CXL stubs.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
Detect a vendor-specific CXL device at vfio-pci bind time and probe
its HDM decoder register block.

vfio_cxl_create_device_state() allocates per-device state via devm,
reads the DVSEC length from PCI_DVSEC_HEADER1, and records MEM_CAPABLE
and CACHE_CAPABLE from the CXL DVSEC.

vfio_cxl_setup_regs() locates the component register block, claims and
maps that BAR window, calls cxl_probe_component_regs() to find the HDM
block, then unmaps and releases the window on all paths.

vfio_pci_cxl_detect_and_init() enables PCI memory decoding for the probe,
chains these setup steps, disables the device again, and leaves vdev->cxl
NULL on failure so the device falls back to plain vfio-pci.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(cherry-picked from commit 939ebb7 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Use pci_get_dsn() instead of pdev->dev.id for cxlds serial; expand comment explaining why]
Signed-off-by: Koba Ko <kobak@nvidia.com>
Promote vfio_raw_config_write() and vfio_raw_config_read() to non-static so
that the CXL DVSEC write handler in the next patch can call them.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(cherry-picked from commit 07d7141 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
Signed-off-by: Koba Ko <kobak@nvidia.com>
… framework

Add HDM decoder register emulation for CXL devices assigned to a guest.

New file vfio_cxl_emu.c allocates comp_reg_virt[] covering the full
component register block (CXL_COMPONENT_REG_BLOCK_SIZE), snapshots it
from MMIO after probe, and registers a VFIO device region
(VFIO_REGION_SUBTYPE_CXL_COMP_REGS) with read/write ops but no mmap,
so every access hits the emulated buffer and write dispatchers.

vfio_cxl_setup_virt_regs() is called from the tail of
vfio_cxl_setup_regs(); vfio_cxl_clean_virt_regs() runs on cleanup.

HDM decoder register defines come from include/uapi/cxl/cxl_regs.h.
Bits with no hardware equivalent stay in vfio_cxl_priv.h.

hdm_decoder_n_ctrl_write() allows the guest to clear the LOCK bit.
A firmware-committed decoder arrives with LOCK=1; the guest driver
must clear it before reprogramming BASE and SIZE with the VM's GPA.
Such a write clears the bit in the shadow while preserving all other
fields.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit 4ab4955 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Resolve Makefile context mismatch due to missing upstream dmabuf support in NV-Kernels base, Add CTRL LOCK enforcement in BASE_LO/SIZE_LO writes, BI bit masking for non-cache-capable devices, pass max_size to vfio_cxl_setup_virt_regs() for bounds check, add vfio_pci_cxl_cleanup() in registration error path]
Signed-off-by: Koba Ko <kobak@nvidia.com>
After HDM registers are mapped, call cxl_await_range_active() so we
only proceed when DVSEC ranges report active, avoiding access to the
memdev register group that Type-2 devices may lack.

This wait is required before re-snapshotting component registers: firmware
commits final HDM decoder values such as SIZE_HIGH only after MEM_ACTIVE.
Once cxl_await_range_active() confirms that state, re-read component regs
with vfio_cxl_reinit_comp_regs() so those committed values land in
comp_reg_virt.

Read committed decoder size from hardware, set capacity via
cxl_set_capacity(), and devm_cxl_add_memdev().

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(cherry-picked from commit 537d8a2 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Line offset adjustments only (cascading from 0011 changes)]
Signed-off-by: Koba Ko <kobak@nvidia.com>
Region Management makes use of APIs provided by CXL_CORE as below:

CREATE_REGION flow:
1. Validate request (size, decoder availability)
2. Allocate HPA via cxl_get_hpa_freespace()
3. Allocate DPA via cxl_request_dpa()
4. Create region via cxl_create_region() - commits HDM decoder
5. Get HPA range via cxl_get_region_range()

DESTROY_REGION flow:
1. Detach decoder via cxl_decoder_detach()
2. Free DPA via cxl_dpa_free()
3. Release root decoder via cxl_put_root_decoder()

Use DEFINE_FREE scope helpers so error paths unwind cleanly.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit 799c46d from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Add borrowed-reference comment for precommitted decoders, init region to NULL, do not unregister precommitted regions in teardown]
[kobak: Restored BOS CXL helper providers/exports and vfio-pci-core CXL namespace import so the region-management backport builds against BOS CXL core.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
…nd reset zap

Wire the CXL DPA range up as a VFIO demand-paged region so QEMU can
mmap guest device memory directly. Faults call vmf_insert_pfn() to
insert one PFN at a time rather than mapping the full range upfront.

CXL region lifecycle:
- The CXL memory region is registered with VFIO layer during
  vfio_pci_open_device
- mmap() establishes the VMA with vm_ops but inserts no PTEs
- Each guest page fault calls vfio_cxl_region_page_fault() which
  inserts a single PFN under the memory_lock read side
- On device reset, vfio_cxl_zap_region_locked() sets region_active=false
  and calls unmap_mapping_range() to invalidate all DPA PTEs atomically
  while holding memory_lock for writing
- Faults racing with reset see region_active==false and return
  VM_FAULT_SIGBUS
- vfio_cxl_reactivate_region() restores region_active after successful
  hardware reset

Also integrate the zap/reactivate calls into vfio_pci_ioctl_reset() so
that FLR correctly invalidates DPA mappings and restores them on success.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit f5e4191 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Resolve context mismatches in vfio_pci_core.c and vfio_pci_priv.h due to missing upstream dmabuf support in NV-Kernels base, Add vdev back-pointer in cxl_state, hold memory_lock read-side in fault/rw paths, advance *ppos in region rw, add vfio_direct_config_read export and use it instead of vfio_raw_config_read in DVSEC fallback]
[kobak: Preserved existing VFIO PCI DMABUF reset movement while adding CXL DPA zap/reactivation.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
CXL devices expose DVSEC registers in PCI configuration space.  Several
of them affect device behavior (CXL.io/CXL.mem/CXL.cache enables, lock
state, range bases) and must be virtualised so the guest cannot disturb
host-owned policy.

Add CXL-aware read and write handlers that operate on vdev->vconfig:

  - DVSEC reads come back from the vconfig shadow that vfio_config_init()
    already populates via vfio_ecap_init().
  - DVSEC writes go through per-register handlers (cxl_dvsec_*_write)
    which apply the spec-defined reserved-bit and lock-bit masking
    before updating the shadow.
  - The handlers are wired in via vdev->dvsec_readfn / dvsec_writefn,
    which the global ecap_perms[PCI_EXT_CAP_ID_DVSEC] dispatcher routes
    to when the device is a CXL device.  Non-CXL devices with a DVSEC
    capability fall through to direct hardware access.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit 3ff6c19 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Resolve context mismatches in Makefile and vfio_pci_core.h due to missing upstream dmabuf/p2pdma forward declarations in NV-Kernels base, Carry Disable_Caching into Cache WBI hardware write, use vfio_direct_config_read fallback, add byte-aligned read/write routing for DVSEC registers, handle partial-byte W1C writes for STATUS/STATUS2, add PM_INIT_COMPLETION RW1CS handling]
Signed-off-by: Koba Ko <kobak@nvidia.com>
Register the DPA and component register region with VFIO layer.
Region indices for both these regions are cached for quick lookup.

vfio_cxl_register_cxl_region()
- memremap(WB) the region HPA (treat CXL.mem as RAM, not MMIO)
- Register VFIO_REGION_SUBTYPE_CXL
- Records dpa_region_idx.

vfio_cxl_register_comp_regs_region()
- Registers VFIO_REGION_SUBTYPE_CXL_COMP_REGS with size
  hdm_reg_offset + hdm_reg_size
- Records comp_reg_region_idx.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit 6e2d9e5 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Check HDM COMMITTED bit before activating DPA region on precommitted decoders, add pm_runtime/memory-enabled gate in fault and rw paths, split vfio_cxl_zap_dpa() from prepare_reset(), add DPA zap in vfio_pci_zap_and_down_write_memory_lock(), add hot-reset CXL prepare/finish passes]
[kobak: Withheld DPA mmap advertisement on BOS until CPU-readable backing for CXL DPA PFNMAP can be proven; DPA fd read/write remains advertised.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
…AR to userspace

Expose CXL device capability through the VFIO device info ioctl and give
userspace mmap access to the GPU/accelerator register windows in the
component BAR while keeping the CXL component register block off-limits
to user mappings.

vfio_cxl_get_info() fills VFIO_DEVICE_INFO_CAP_CXL with the HDM register
BAR index and byte offset, commit flags, and VFIO region indices for the
DPA and COMP_REGS regions.  HDM decoder count and the HDM block offset
within COMP_REGS are not populated; both are derivable from the CXL
Capability Array in the COMP_REGS region itself.

vfio_cxl_get_region_info() handles VFIO_DEVICE_GET_REGION_INFO for the
component register BAR.  It builds a sparse-mmap capability that
advertises only the GPU/accelerator register windows, carving out the
CXL component register block.  Three physical layouts are handled:

  Topology A  comp block at BAR end:    one area [0, comp_reg_offset)
  Topology B  comp block at BAR start:  one area [comp_end, bar_len)
  Topology C  comp block in the middle: two areas, one on each side

vfio_cxl_mmap_overlaps_comp_regs() checks whether an mmap request overlaps
[comp_reg_offset, comp_reg_offset + comp_reg_size).  vfio_pci_core_mmap()
calls it to reject mmap of the component register block while allowing
mmap of the GPU register windows in the sparse capability.  This replaces
the earlier blanket rejection of any mmap on the component BAR index.

vfio_pci_bar_rw() applies the same overlap check, so fd pread()/pwrite()
on the component BAR is also rejected when it would touch the component
register subrange.  All access to those registers goes through the
dedicated COMP_REGS region, where the emulated HDM shadow lives.

Hook both helpers into vfio_pci_ioctl_get_info() and
vfio_pci_ioctl_get_region_info() in vfio_pci_core.c.

The component BAR cannot be claimed exclusively since the CXL subsystem
holds persistent sub-range iomem claims during HDM decoder setup.
pci_request_selected_regions() returns EBUSY; pass bars=0 to skip the
request and map directly via pci_iomap().  Physical ownership is assured
by driver binding.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(cherry-picked from commit 9cd9248 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Add BAR bounds check for component block, handle full-BAR component reg case, add bar_mmap_supported gate, block BAR fd read/write and ioeventfd in component reg subrange]
Signed-off-by: Koba Ko <kobak@nvidia.com>
This commit provides an opt-out mechanism to disable the CXL
support from vfio module. The opt-out is provided both
build time and module load time.

Build time option CONFIG_VFIO_CXL_CORE is used to enable/disable
CXL support in vfio-pci module.

For runtime disabling the CXL support, use the module parameter
disable_cxl. The bare vfio-pci driver copies that parameter into the
per-device core state before registration. Variant drivers own their
probe policy and must set vdev->disable_cxl explicitly before registering
the core device.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit 595c1ad from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Resolve context mismatch in vfio_pci.c probe function due to missing upstream pci_ops assignment in NV-Kernels base, Wrap disable_cxl field in #if IS_ENABLED(CONFIG_VFIO_CXL_CORE), update MODULE_PARM_DESC wording]
[kobak: Preserved existing vfio-pci pci_ops assignment while wiring the CXL opt-out parameter.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
mmhonap and others added 9 commits June 3, 2026 11:26
…ough

Add Documentation/driver-api/vfio-pci-cxl.rst describing the architecture,
VFIO interfaces, and operational constraints for CXL Type-2 (cache-coherent
accelerator) passthrough via vfio-pci-core, and link it from the driver-api
index.

The document covers:
- VFIO_DEVICE_FLAGS_CXL and VFIO_DEVICE_INFO_CAP_CXL: what the capability
  struct contains and what the FIRMWARE_COMMITTED and CACHE_CAPABLE flags mean
- How to derive hdm_decoder_offset and hdm_count from the COMP_REGS region
  by traversing the CXL Capability Array to find cap ID 0x5 and reading the
  HDM Decoder Capability register
- Topology-aware sparse mmap on the component BAR (topologies A, B, C
  covering comp block at end, start, or middle of the BAR)
- Two extra VFIO device regions: COMP_REGS for the emulated HDM register
  state and the DPA memory window
- DVSEC config write virtualization: what the guest sees vs. hardware
- FLR coordination: DPA PTEs zapped before reset, restored after

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
(backported from commit 696f0b1 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[jan: Rename vfio_cxl_zap_region_locked to vfio_cxl_prepare_reset and vfio_cxl_reactivate_region to vfio_cxl_finish_reset in docs]
[kobak: Document BOS DPA policy as READ|WRITE without MMAP while preserving fd read/write support.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
Add vfio_cxl_type2_test and build it from the vfio selftest Makefile. The
binary expects a PCI BDF (argv or VFIO_SELFTESTS_BDF) with the device
already on vfio-pci and CONFIG_VFIO_CXL_CORE enabled.

It exercises:
- VFIO_DEVICE_GET_INFO,
- GET_REGION_INFO,
- VFIO_DEVICE_INFO_CAP_CXL capability list,
- sparse component-BAR vs DPA/COMP_REG regions,
- HDM decoder emulation (masks, commit, lock),
- DVSEC-backed config where the driver exposes it.

Large region read/write loops and FLR-heavy test cases are still
pending; Need to revisit these in next version of patches.

vfio_pci_device_setup() skips auto-mmap for BARs that carry
sparse-mmap capabilities; those require the caller to mmap only the
windows advertised by the capability.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/r/20260401143917.108413-21-mhonap@nvidia.com)
[kobak: Treat DPA mmap as optional because BOS intentionally withholds VFIO_REGION_INFO_FLAG_MMAP for the DPA region.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
Export CXL reset helper entry points for VFIO CXL users so vfio-pci can
coordinate CXL reset and memory/cache state safely.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from commit 2d40efb from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[kobak: Kept the BOS CXL core tail and placed the exported reset helpers after cxl_port_get_possible_dports().]
[kobak: Adapted to the BOS cxl_pci_functions_reset_prepare() error-return flow and added the target-local CXL reset helper prototypes required by public and private CXL headers.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
Wire the VFIO CXL reset prepare/finish paths into VFIO PCI reset flows so
DPA mappings are zapped before reset and restored after successful reset.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from commit 0bd9c4c from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[kobak: Preserved existing VFIO PCI DMABUF reset movement while adding CXL reset prepare/finish handling.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
…e post-reset BAR access

A reset caller may disable Memory Space to quiesce device DMA before
issuing the reset. The reset path saves and restores PCI_COMMAND via
pci_dev_save_and_disable() and pci_dev_restore(). If Memory Space was
disabled before FLR, it will be restored in the disabled state.

vfio_cxl_finish_reset() reads HDM decoder registers through the
component register BAR immediately after reset. Accessing a BAR with
Memory Space disabled produces an Unsupported Request completion; on
platforms that promote UR to a fatal error this triggers DPC.

Add vfio_cxl_enable_memory_space() and call it at the start of
vfio_cxl_finish_reset() before touching any BAR.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from commit 5071d3b from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
Signed-off-by: Koba Ko <kobak@nvidia.com>
…ss reset

After FLR, reinit_comp_regs() re-reads HDM decoder registers from
hardware into comp_reg_virt[].  Hardware is not all-zeros at this
point: pci_dev_restore() ran first and re-committed the pre-reset
host-physical decoder bases into the registers.  reinit_comp_regs()
therefore overwrites the emulated guest-physical bases that the device
manager programmed with the host-physical bases used by the host CXL
core.  The kernel provides no notification that BASE was overwritten,
so the emulated GPA bases are silently lost.

The same issue affects the CTRL LOCK bit: FLR clears it in hardware
and pci_dev_restore() does not re-apply it, so a decoder that the
guest had locked re-emerges from reset with LOCK clear in shadow.

Add vfio_cxl_reinit_hdm_shadow() which snapshots BASE_LOW, BASE_HIGH,
and the CTRL LOCK bit from the shadow before calling
reinit_comp_regs(), then writes them back after, keeping the emulated
decoder consistent with what the guest programmed.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(cherry-picked from commit 9e0e291 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
Signed-off-by: Koba Ko <kobak@nvidia.com>
…nfig shadow

STATUS2 was read directly from hardware while all other DVSEC registers
were served from the vconfig shadow. This created two problems:

1. VOLATILE_HDM_PRES_ERROR (RW1CS, bit 3): guest writes cleared the
   hardware bit but the shadow was not updated, so subsequent reads still
   returned the set bit from hardware (which the hardware had cleared).

2. CXL_RESET_COMPLETE and CXL_RESET_ERROR (bits 1-2): these outcome bits
   will be written by vfio_cxl_reset() into the shadow after a protocol
   reset. Hardware does not update them on its own; serving reads from
   hardware would hide the outcome from the guest.

Add STATUS2 to the read switch so reads come from the shadow, and update
cxl_dvsec_status2_write() to mirror VOLATILE_HDM_PRES_ERROR clears into
the shadow after forwarding to hardware.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from commit 14fbdcb from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
Signed-off-by: Koba Ko <kobak@nvidia.com>
Add vfio_cxl_reset() to drive a CXL protocol reset on behalf of a guest.

Unlike cxl_do_reset(), this path skips host memory offlining since the
DPA region is guest memory.  The function takes memory_lock for the full
sequence, calls vfio_cxl_prepare_reset() to zap DPA region PTEs, drives
the hardware via cxl_dev_reset_locked(), which performs
pci_dev_save_and_disable(), cxl_dev_reset(), sibling CXL.cachemem
coordination, and pci_dev_restore() under the CXL reset mutex, then calls
vfio_cxl_finish_reset() to reinitialise emulated state.

STATUS2 outcome bits (CXL_RESET_COMPLETE / CXL_RESET_ERROR) are written
back to vconfig after the reset so the guest can poll for the result
without reading hardware.  cxl_save_dvsec() / cxl_restore_dvsec() cover
CTRL, CTRL2, range_base_*, and LOCK; STATUS2 is not saved or restored
across the reset, so the hardware value is re-read after restore (it
will have both outcome bits clear) and the outcome is stamped on top.

When the guest writes INIT_CXL_RST into DVSEC CONTROL2, invoke
vfio_cxl_reset() to perform a CXL protocol reset.  The bit is not
forwarded to hardware; cxl_dev_reset() drives the reset sequence
directly.  Silently drop writes on devices that do not advertise
RST_CAPABLE to avoid log noise for the reserved-bit case.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
(cherry-picked from commit 67c66e7 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
Signed-off-by: Koba Ko <kobak@nvidia.com>
… passthrough

Enable VFIO CXL core support on amd64 and arm64 to allow CXL Type-2
device passthrough via vfio-pci.

Signed-off-by: Jiandi An <jan@nvidia.com>
(backported from commit 74b6b99 from https://github.com/JiandiAnNVIDIA/NV-Kernels.git cxl-vfio_2026-04-23)
[kobak: Applied the equivalent annotation to debian.master/config/annotations because the 7.0 HWE target does not carry debian.nvidia-6.17/config/annotations.]
Signed-off-by: Koba Ko <kobak@nvidia.com>
@kobak2026 kobak2026 force-pushed the dgx16138-vfio-cxl-26.04-bos branch from 573f955 to 895c1eb Compare June 3, 2026 04:07
@kobak2026
Copy link
Copy Markdown
Collaborator Author

@jamieNguyenNVIDIA thanks, fixed.

@jamieNguyenNVIDIA
Copy link
Copy Markdown
Collaborator

Acked-by: Jamie Nguyen <jamien@nvidia.com>

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Jun 3, 2026

No further issues from me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>


@kobak2026 One side note...

In the future, when picking / backporting commits, please preserve trailers/provenance ordering as-is.

Here is an example from this PR where this could have been improved...

The clean trailer flow should be:

  <original author trailers>
  (backported from lore...)
  [jan: Jiandi’s annotation...]
  Signed-off-by: Jiandi An <jan@nvidia.com>
  (backported/cherry-picked from commit <Jiandi commit> ...)
  [kobak: Koba’s annotation...]
  Signed-off-by: Koba Ko <kobak@nvidia.com>

Instead, in the updated PR, Jiandi’s Signed-off-by was moved above the lore backport tag and [jan:] annotation, e.g. current 58bf39c has:

  Signed-off-by: Manish Honap <mhonap@nvidia.com>
  Signed-off-by: Jiandi An <jan@nvidia.com>
  (backported from https://lore...)
  (backported from commit 52ead...)
  [jan: ...]
  Signed-off-by: Koba Ko <kobak@nvidia.com>

But Jiandi’s source commit had:

  Signed-off-by: Manish Honap <mhonap@nvidia.com>
  (backported from https://lore...)
  [jan: ...]
  Signed-off-by: Jiandi An <jan@nvidia.com>

@nirmoy nirmoy added has_2_acks and removed help wanted Extra attention is needed has_1_ack labels Jun 3, 2026
@clsotog
Copy link
Copy Markdown
Collaborator

clsotog commented Jun 3, 2026

Acked-by: Carol L Soto <csoto@nvidia.com>

@nvmochs nvmochs changed the title VFIO: Add CXL Type-2 passthrough support for BOS [26.04_linux-nvidia-bos] VFIO: Add CXL Type-2 passthrough support for BOS Jun 3, 2026
@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Jun 3, 2026

Merged, closing PR.

ed07d769390c (nresolute/nvidia-bos-next) NVIDIA: VR: SAUCE: config: Enable CONFIG_VFIO_CXL_CORE for CXL Type-2 passthrough
07fa68eb2575 NVIDIA: VR: SAUCE: vfio/cxl: Implement vfio_cxl_reset()
e67d4b5ee2f8 NVIDIA: VR: SAUCE: vfio/cxl: virtualize DVSEC STATUS2 register in vconfig shadow
2f2b852af8c7 NVIDIA: VR: SAUCE: vfio/cxl: preserve HDM decoder base addresses across reset
1c13c68be7eb NVIDIA: VR: SAUCE: vfio/cxl: Ensure PCI Memory Space is enabled before post-reset BAR access
501d8e0558ee NVIDIA: VR: SAUCE: vfio/pci: Wire CXL DPA reset handling
2f0be2ad310d NVIDIA: VR: SAUCE: cxl: Export the CXL reset helpers for VFIO users
0c309a50cac0 selftests/vfio: Add CXL Type-2 VFIO assignment test
0afa25344b76 NVIDIA: VR: SAUCE: docs: vfio-pci: Document CXL Type-2 device passthrough
bcf5e2450012 NVIDIA: VR: SAUCE: vfio/cxl: Provide opt-out for CXL feature
ad41104242ed NVIDIA: VR: SAUCE: vfio/pci: Advertise CXL cap and sparse component BAR to userspace
27b91da534d2 NVIDIA: VR: SAUCE: vfio/cxl: Register regions with VFIO layer
57793778b0f2 NVIDIA: VR: SAUCE: vfio/cxl: Virtualize CXL DVSEC config writes
45487704ca6f NVIDIA: VR: SAUCE: vfio/cxl: DPA VFIO region with demand fault mmap and reset zap
4db36bb59e25 NVIDIA: VR: SAUCE: vfio/cxl: CXL region management support
b49ee763c7e1 NVIDIA: VR: SAUCE: vfio/cxl: Wait for HDM ranges and create memdev
0271cc9cd058 NVIDIA: VR: SAUCE: vfio/cxl: Introduce HDM decoder register emulation framework
572bbbf98242 NVIDIA: VR: SAUCE: vfio/pci: Export config access helpers
6fb1bbed2686 NVIDIA: VR: SAUCE: vfio/cxl: Detect CXL DVSEC and probe HDM block
043fbd60f4ee NVIDIA: VR: SAUCE: vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks
f9eefcda2091 NVIDIA: VR: SAUCE: vfio/pci: Add CXL state to vfio_pci_core_device
fa037ef5a580 NVIDIA: VR: SAUCE: vfio: UAPI for CXL-capable PCI device assignment
454103096668 NVIDIA: VR: SAUCE: cxl: Record BIR and BAR offset in cxl_register_map
7ff0c7b6bdf4 NVIDIA: VR: SAUCE: cxl: Split cxl_await_range_active() from media-ready wait
b5855acb0ef5 NVIDIA: VR: SAUCE: cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
8814f7ee4c63 NVIDIA: VR: SAUCE: cxl: Declare cxl_probe_component_regs in public header
b45c19779a2f NVIDIA: VR: SAUCE: cxl: Add cxl_get_hdm_info() for HDM decoder metadata

@nvmochs nvmochs closed this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants