feat(core): Program.compile(cache=...) convenience wrapper

cpcloud · cpcloud · commit e08e89ed5d7c · 2026-04-29T08:17:36.000-04:00
Adds a ``cache=`` keyword to :meth:`cuda.core.Program.compile` that
threads the persistent cache machinery into the high-level compile
path. With ``cache=None`` (the default) the call is byte-identical
to the un-cached path -- no key derivation, no extra import, no
behavior change.

When a cache is provided, the wrapper derives a key via
:func:`~cuda.core.utils.make_program_cache_key` from the program's
source, options, and target type; checks the cache; on hit,
returns a fresh
``ObjectCode._init(hit_bytes, target_type, name=self._options.name)``;
on miss, runs the underlying compile and stores
``cache[key] = compiled`` (the cache extracts ``bytes(obj.code)``).

Two compile-time guards close obvious footguns:

* ``name_expressions`` plus ``cache=`` raises ``ValueError``.
  NVRTC populates ``ObjectCode.symbol_mapping`` from name-expression
  mangling at compile time, and that mapping isn't carried in the
  binary the cache stores. Without this guard the first call (miss)
  would return an ObjectCode with mappings populated, while every
  subsequent call (hit) would return one without -- silently
  breaking later ``get_kernel(name_expression)`` lookups that work
  on the uncached path. Compiles that need name_expressions should
  run without ``cache=``, or look up mangled symbols by hand from
  the cached ``ObjectCode``.

* Inputs whose compilation effect isn't captured by the key
  (``include_path``, ``pre_include``, ``pch``, ``use_pch``,
  ``pch_dir``, NVVM ``use_libdevice=True``, NVRTC ``options.name``
  with a directory component, side-effect options like
  ``create_pch`` / ``time`` / ``fdevice_time_trace``) propagate the
  ``ValueError`` from ``make_program_cache_key`` -- those callers
  should use ``make_program_cache_key`` directly with an
  ``extra_digest`` covering the external content.

Supporting refactors:

* Unify ``Program``'s source retention into a single ``_code``
  field (was split between ``_code`` for NVVM and a separate
  ``_source`` for c++/ptx). ``_code`` is now always bytes; the
  cache wrapper decodes back to ``str`` for c++/ptx before passing
  to ``make_program_cache_key`` (which only accepts bytes for NVVM).

* Move the actual compile call into a module-level
  ``_program_compile_uncached`` so tests can monkeypatch the seam
  without going through NVRTC. ``Program`` is a ``cdef class``, so
  its methods cannot be reassigned from Python -- the seam has to
  live outside the class.

* The unified ``_code`` field also exposed a pre-existing bug on
  the NVVM path: the C pointer was being recomputed from the
  caller's original ``code`` argument rather than from
  ``self._code``, which crashed for ``bytearray`` inputs that the
  field's bytes coercion handled cleanly. Fixed; regression test
  added in ``test_program.py``.

Tests in ``test_program_compile_cache.py`` cover both halves of the
contract: the wrapper-level miss/hit/error paths against a recording
stub (verifying it's duck-typed and doesn't require subclassing
``ProgramCacheResource``), the rejection paths (name_expressions,
extra_digest-required options, side-effect options, NVRTC
``options.name`` with a directory component), and a real NVRTC
end-to-end roundtrip using ``FileStreamProgramCache`` across reopen
so the bytes match across processes.
diff --git a/cuda_core/cuda/core/_program.pxd b/cuda_core/cuda/core/_program.pxd
@@ -17,5 +17,6 @@ cdef class Program:
         object _compile_lock  # Per-instance lock for compile-time mutation
         bint _use_libdevice      # Flag for libdevice loading
         bint _libdevice_added
-        bytes _nvrtc_code       # Source code for NVRTC retry (PCH auto-resize)
+        bytes _code             # Source code as bytes: used for key derivation and NVRTC PCH retry
+        str _code_type          # Normalised code_type ("c++", "ptx", "nvvm")
         str _pch_status         # PCH creation outcome after compile
diff --git a/cuda_core/cuda/core/_program.pyx b/cuda_core/cuda/core/_program.pyx
@@ -85,7 +85,12 @@ cdef class Program:
         self._h_nvvm.reset()
 
     def compile(
-        self, target_type: str, name_expressions: tuple | list = (), logs = None
+        self,
+        target_type: str,
+        name_expressions: tuple | list = (),
+        logs=None,
+        *,
+        cache: "ProgramCacheResource | None" = None,
     ) -> ObjectCode:
         """Compile the program to the specified target type.
 
@@ -98,13 +103,82 @@ cdef class Program:
             Used for template instantiation and similar cases.
         logs : object, optional
             Object with a ``write`` method to receive compilation logs.
+        cache : :class:`~cuda.core.utils.ProgramCacheResource`, optional
+            If provided, the compiled binary is looked up in ``cache`` via a
+            key derived from the program's code, options, and ``target_type``.
+            On a hit the cached bytes are wrapped in a fresh
+            :class:`~cuda.core.ObjectCode` (with the same ``target_type``
+            and ``ProgramOptions.name``) and returned without re-compiling;
+            on a miss the compile output is stored as raw bytes (the cache
+            extracts ``bytes(object_code.code)``). Passing a non-empty
+            ``name_expressions`` together with ``cache=`` raises
+            ``ValueError``: NVRTC populates
+            ``ObjectCode.symbol_mapping`` at compile time and that mapping
+            is not carried in the binary the cache stores, so cache hits
+            would silently miss ``get_kernel(name_expression)`` lookups.
+            Options that require an ``extra_digest`` (``include_path``,
+            ``pre_include``, ``pch``, ``use_pch``, ``pch_dir``, NVVM
+            ``use_libdevice=True``, or NVRTC ``options.name`` with a
+            directory component) raise ``ValueError`` via
+            :func:`~cuda.core.utils.make_program_cache_key`; for those
+            compiles, use the manual ``make_program_cache_key(...)``
+            pattern directly.
 
         Returns
         -------
         :class:`~cuda.core.ObjectCode`
             The compiled object code.
         """
-        return Program_compile(self, target_type, name_expressions, logs)
+        if cache is None:
+            return _program_compile_uncached(self, target_type, name_expressions, logs)
+
+        # ``name_expressions`` is incompatible with the cache: NVRTC
+        # populates ``ObjectCode.symbol_mapping`` from name-expression
+        # mangling at compile time, and that mapping isn't carried in
+        # the binary bytes the cache stores. Without this guard the
+        # first call (cache miss) would return an ObjectCode with
+        # symbol_mapping populated, while every subsequent call (hit)
+        # would return one without -- silently breaking later
+        # ``get_kernel(name_expression)`` lookups that work on the
+        # uncached path. Fail loud here instead.
+        if name_expressions:
+            raise ValueError(
+                "Program.compile(cache=...) does not support name_expressions: "
+                "ObjectCode.symbol_mapping is populated by NVRTC at compile "
+                "time and is not preserved across a cache round-trip, so cache "
+                "hits would silently break get_kernel(name_expression) lookups "
+                "that the uncached path supports. Compile without cache= when "
+                "name_expressions are needed, or look up mangled symbols by "
+                "hand from the cached ObjectCode."
+            )
+
+        # Deferred import to avoid a circular import between _program and
+        # cuda.core.utils._program_cache (the cache module already imports
+        # ProgramOptions from this module). Import from the leaf module so
+        # tests that monkeypatch make_program_cache_key via that path
+        # intercept reliably.
+        from cuda.core.utils._program_cache import make_program_cache_key
+
+        # ``self._code`` is always stored as bytes (see ``Program_init``),
+        # but ``make_program_cache_key`` only accepts bytes when
+        # ``code_type == "nvvm"`` -- c++/ptx must be ``str``. Decode back
+        # to the original str for the NVRTC/linker paths so the generated
+        # key matches keys callers build by passing the str source
+        # directly.
+        code_for_key = self._code if self._code_type == "nvvm" else self._code.decode("utf-8")
+
+        key = make_program_cache_key(
+            code=code_for_key,
+            code_type=self._code_type,
+            options=self._options,
+            target_type=target_type,
+        )
+        hit_bytes = cache.get(key)
+        if hit_bytes is not None:
+            return ObjectCode._init(hit_bytes, target_type, name=self._options.name)
+        compiled = _program_compile_uncached(self, target_type, name_expressions, logs)
+        cache[key] = compiled
+        return compiled
 
     @property
     def pch_status(self) -> str | None:
@@ -503,6 +577,19 @@ class ProgramOptions:
 # Private Classes and Helper Functions
 # =============================================================================
 
+
+def _program_compile_uncached(program, target_type, name_expressions, logs):
+    """Run ``Program_compile`` without the cache wrapper.
+
+    Module-level Python function so tests can monkeypatch it from
+    ``cuda.core._program`` to avoid invoking NVRTC when exercising the cache
+    wrapper in :meth:`Program.compile`. ``Program`` itself is a ``cdef class``
+    and its methods cannot be reassigned from Python, so the seam must live
+    outside the class.
+    """
+    return Program_compile(program, target_type, name_expressions, logs)
+
+
 # Module-level state for NVVM lazy loading
 _nvvm_module = None
 _nvvm_import_attempted = False
@@ -618,6 +705,7 @@ cdef inline int Program_init(Program self, object code, str code_type, object op
 
     self._options = options = check_or_create_options(ProgramOptions, options, "Program options")
     code_type = code_type.lower()
+    self._code_type = code_type
     self._compile_lock = threading.Lock()
     self._use_libdevice = False
     self._libdevice_added = False
@@ -638,16 +726,18 @@ cdef inline int Program_init(Program self, object code, str code_type, object op
             HANDLE_RETURN_NVRTC(NULL, cynvrtc.nvrtcCreateProgram(
                 &nvrtc_prog, code_ptr, name_ptr, 0, NULL, NULL))
         self._h_nvrtc = create_nvrtc_program_handle(nvrtc_prog)
-        self._nvrtc_code = code_bytes
+        self._code = code_bytes
         self._backend = "NVRTC"
         self._linker = None
 
     elif code_type == "ptx":
         assert_type(code, str)
         if options.extra_sources is not None:
             raise ValueError("extra_sources is not supported by the PTX backend.")
+        code_bytes = code.encode()
+        self._code = code_bytes
         self._linker = Linker(
-            ObjectCode._init(code.encode(), code_type), options=_translate_program_options(options)
+            ObjectCode._init(code_bytes, code_type), options=_translate_program_options(options)
         )
         self._backend = self._linker.backend
 
@@ -657,10 +747,13 @@ cdef inline int Program_init(Program self, object code, str code_type, object op
             code = code.encode("utf-8")
         elif not isinstance(code, (bytes, bytearray)):
             raise TypeError("NVVM IR code must be provided as str, bytes, or bytearray")
+        self._code = bytes(code)  # Coerce bytearray -> bytes so retention type is stable
 
-        code_ptr = <const char*>(<bytes>code)
+        # Use self._code (strictly bytes) for the C pointer so a bytearray
+        # input doesn't trip the `<bytes>code` cast at runtime.
+        code_ptr = <const char*>self._code
         name_ptr = <const char*>options._name
-        code_len = len(code)
+        code_len = len(self._code)
 
         with nogil:
             HANDLE_RETURN_NVVM(NULL, cynvvm.nvvmCreateProgram(&nvvm_prog))
@@ -832,7 +925,7 @@ cdef object Program_compile_nvrtc(Program self, str target_type, object name_exp
         HANDLE_RETURN_NVRTC(NULL, cynvrtc.nvrtcSetPCHHeapSize(required))
 
     cdef cynvrtc.nvrtcProgram retry_prog
-    cdef const char* code_ptr = <const char*>self._nvrtc_code
+    cdef const char* code_ptr = <const char*>self._code
     cdef const char* name_ptr = <const char*>self._options._name
     with nogil:
         HANDLE_RETURN_NVRTC(NULL, cynvrtc.nvrtcCreateProgram(
diff --git a/cuda_core/tests/test_program.py b/cuda_core/tests/test_program.py
@@ -434,6 +434,23 @@ def test_nvvm_compile_invalid_target(nvvm_ir):
     program.close()
 
 
+@nvvm_available
+def test_nvvm_accepts_bytearray_input(nvvm_ir):
+    """Program(..., 'nvvm') must accept bytearray input.
+
+    Regression for a bug where the NVVM init branch retained the coerced
+    ``self._code`` as bytes but still cast the original ``code`` object to
+    ``<bytes>`` for the C pointer -- tripping a runtime type error for
+    bytearray inputs before nvvmAddModuleToProgram was called.
+    """
+    program = Program(bytearray(nvvm_ir, "utf-8"), "nvvm")
+    try:
+        assert program.backend == "NVVM"
+        assert program.handle is not None
+    finally:
+        program.close()
+
+
 @nvvm_available
 def test_nvvm_compile_invalid_ir():
     """Compiling invalid NVVM IR exercises the HANDLE_RETURN_NVVM error path."""
diff --git a/cuda_core/tests/test_program_compile_cache.py b/cuda_core/tests/test_program_compile_cache.py