Add program caches (in-memory, sqlite, filestream) by cpcloud · Pull Request #1912 · NVIDIA/cuda-python

cpcloud · 2026-04-14T22:00:23Z

Summary

Convert cuda.core.utils from a module to a package; expose cache APIs lazily via __getattr__ so from cuda.core.utils import StridedMemoryView stays lightweight. _LAZY_CACHE_ATTRS is a single ordered tuple spliced into __all__ via *_LAZY_CACHE_ATTRS, and the module docstring notes that the laziness guarantee is for explicit imports only (star-import walks __all__ and therefore resolves every lazy attribute).
Add ProgramCacheResource ABC with bytes | str keys, context manager, pickle-safety warning, and rejection of path-backed ObjectCode at write time.
Add make_program_cache_key() — blake2b(32) digest with backend-specific gates that mirror Program/Linker:
- Versions: cuda-core, NVRTC (c++), libNVVM lib+IR (nvvm), linker backend+version (ptx); driver only on the cuLink path.
- Validates code_type/target_type against Program.compile's SUPPORTED_TARGETS; rejects bytes-like code for non-NVVM and extra_sources for non-NVVM.
- NVRTC side-effect (create_pch, time, fdevice_time_trace) and external-content (include_path, pre_include, pch, use_pch, pch_dir) options require extra_digest; NVVM use_libdevice=True likewise.
- NVRTC options.name with a directory component (e.g. /path/to/kernel.cu) also requires extra_digest because NVRTC searches that directory for #include "..." lookups; bare labels ("default_program", "kernel-a") fall back to CWD and stay accepted. no_source_include=True disables the search and the guard.
- PTX (Linker) options pass through per-field gates that match _prepare_nvjitlink_options / _prepare_driver_options; ptxas_options canonicalised across str/list/tuple/empty shapes; driver-linker hard rejections (time, ptxas_options, split_compile) raise at key time; ftz/prec_div/prec_sqrt/fma collapse under driver linker.
- Failed env probes mix the exception class name into a *_probe_failed label so broken environments never collide with working ones, while staying stable across processes and repeated calls.
Add InMemoryProgramCache — in-process dict-backed cache that stores ObjectCode by reference (no pickling). Optional max_entries and max_size_bytes caps with LRU eviction; threading.RLock serialises every method. __getitem__ promotes LRU order; __contains__ is read-only. Rejects non-ObjectCode values and path-backed ObjectCode the same way the persistent backends do.
Add SQLiteProgramCache — single-file sqlite3 (WAL + autocommit), LRU eviction, optional size cap, wal_checkpoint(TRUNCATE) + VACUUM after evictions so the cap bounds real on-disk usage. __contains__ is read-only; __len__ validates and prunes corrupt rows. threading.RLock serialises connection use. Schema-mismatch on open drops tables and rebuilds; corrupt / non-SQLite files reinitialise empty; OperationalError (lock/busy) propagates without nuking the file (and closes the partial connection).
Add FileStreamProgramCache — multi-process via tmp + os.replace. Hash-based filenames so arbitrary-length keys don't overflow filesystem limits. Reader pruning, clear(), and _enforce_size_cap are all stat-guarded (snapshot (ino, size, mtime_ns), refuse unlink on mismatch) so a concurrent writer's os.replace is preserved. _enforce_size_cap also decrements its running total when a concurrent deleter wins the unlink race, so over-eviction after a suppressed FileNotFoundError cannot delete a freshly-committed entry. Stale temp files swept on open; live temps count toward the size cap. Windows ERROR_SHARING_VIOLATION/ERROR_LOCK_VIOLATION on os.replace are retried with bounded backoff (~185ms) before being treated as a non-fatal cache miss; other PermissionError and all POSIX failures propagate. __len__ also rejects stored_key/path mismatch.

Program.compile(cache=...) integration is out of scope (tracked by #176).

Test plan

~200 cache tests — single-process CRUD for all three backends; LRU/size-cap (logical and on-disk); InMemory combined caps, overwrite-updates-size, LRU-touch-on-read, contains-does-not-bump, degenerate caps (single entry > cap, max_entries=0); corruption + __len__ pruning; schema-mismatch table-DROP; threaded SQLite (4 writers + 4 readers × 200 ops); threaded InMemory stress; cross-process FileStream stress (writer/reader race exercising the stat-guard prune; clear/eviction race injection via generator cleanup); over-eviction race (monkeypatched Path.unlink simulates a concurrent deleter winning exactly once; asserts the fresh entry survives); Windows vs POSIX PermissionError narrowing (winerror 32/33 swallow + retry, others propagate; partial-conn close on OperationalError); NVRTC source-directory path-name guard with POSIX/Windows separators and both accept paths; lazy-import subprocess test; _SUPPORTED_TARGETS_BY_CODE_TYPE parity test that parses _program.pyx via tokenize + ast.literal_eval.
End-to-end: real CUDA C++ compile → store in cache → reopen → get_kernel on the deserialised ObjectCode, parametrized over the two persistent backends.
CI: clean across all platforms.

Closes #177
Closes #178
Closes #179

github-actions · 2026-04-14T22:32:37Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1912/
https://nvidia.github.io/cuda-python/pr-preview/pr-1912/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1912/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1912/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rwgk · 2026-04-20T22:29:36Z

Generated with the help of Cursor GPT-5.4 Extra High Fast

High: `make_program_cache_key()` misses implicit source-directory header dependencies

make_program_cache_key() only forces extra_digest for explicit include/PCH options in cuda_core/cuda/core/utils/_program_cache.py:393 and cuda_core/cuda/core/utils/_program_cache.py:592, but NVRTC also implicitly searches the source file's directory unless no_source_include is set in cuda_core/cuda/core/_program.pyx:1001.

Program passes options.name straight to nvrtcCreateProgram() in cuda_core/cuda/core/_program.pyx:635, while the key builder only hashes that path string in cuda_core/cuda/core/utils/_program_cache.py:778. That means a workflow like options.name="/path/to/kernel.cu" plus #include "local.h" can reuse a stale cached ObjectCode after local.h changes.

The new tests cover explicit include/PCH knobs, but not this default source-directory include path (cuda_core/tests/test_program_cache.py:765).

Medium: `FileStreamProgramCache._enforce_size_cap()` can over-evict under concurrent capped writers

After the re-stat at cuda_core/cuda/core/utils/_program_cache.py:1515, a concurrent deleter can remove the candidate before path.unlink(). That FileNotFoundError is suppressed at cuda_core/cuda/core/utils/_program_cache.py:1530, but total is not adjusted, so eviction continues and can delete newer entries unnecessarily.

For a backend explicitly documented for multi-process use, that turns ordinary contention into avoidable cache data loss. The current multiprocess coverage exercises concurrent writes and prune races, but not max_size_bytes under concurrency (cuda_core/tests/test_program_cache_multiprocess.py).

Reduced simulation I used locally:

import time
from pathlib import Path

from cuda.core._module import ObjectCode
from cuda.core.utils import FileStreamProgramCache

cache = FileStreamProgramCache("/tmp/cuda_cache_review_race", max_size_bytes=1000)
cache[b"old"] = ObjectCode._init(b"a" * 600, "cubin", name="old")
time.sleep(0.01)

old_path = cache._path_for_key(b"old")
orig_unlink = Path.unlink
state = {"done": False}


def flaky_unlink(self, *args, **kwargs):
    if self == old_path and not state["done"]:
        state["done"] = True
        # Simulate another process deleting the file after stat() but before
        # _enforce_size_cap() updates its bookkeeping.
        orig_unlink(self, *args, **kwargs)
        raise FileNotFoundError(self)
    return orig_unlink(self, *args, **kwargs)


Path.unlink = flaky_unlink
try:
    cache[b"new"] = ObjectCode._init(b"b" * 600, "cubin", name="new")
finally:
    Path.unlink = orig_unlink

remaining = [key for key in (b"old", b"new") if cache.get(key) is not None]
print(remaining)  # []

That produced [] for me: once the first deletion race is swallowed without decrementing total, the loop keeps evicting and drops the fresh entry too.

Low: `from cuda.core.utils import *` now eagerly imports the cache stack

The package conversion keeps explicit imports like from cuda.core.utils import StridedMemoryView lightweight, but from cuda.core.utils import * walks __all__, resolves the lazy cache symbols, and imports _program_cache (cuda_core/cuda/core/utils/__init__.py:10, cuda_core/cuda/core/utils/__init__.py:32).

I verified that star-import now loads cuda.core.utils._program_cache. That said, this only affects import *, which is already discouraged. I think a short comment explaining that the laziness guarantee is intended for explicit imports, not star-import, seems sufficient here.

rwgk · 2026-04-20T22:40:27Z

+__all__ = [
+    "FileStreamProgramCache",
+    "ProgramCacheResource",
+    "SQLiteProgramCache",
+    "StridedMemoryView",
+    "args_viewable_as_strided_memory",
+    "make_program_cache_key",
+]
+
+# Lazily expose the program-cache APIs so ``from cuda.core.utils import
+# StridedMemoryView`` stays lightweight -- the cache backends pull in driver,
+# NVRTC, and module-load machinery that memoryview-only consumers do not need.
+_LAZY_CACHE_ATTRS = frozenset(
+    {
+        "FileStreamProgramCache",
+        "ProgramCacheResource",
+        "SQLiteProgramCache",
+        "make_program_cache_key",
+    }
+)


Small readability/maintenance cleanup suggestion:

__all__ and _LAZY_CACHE_ATTRS currently duplicate the same cache-export names, so defining the ordered lazy-export list once and reusing it in __all__ seems a bit easier to scan and reduces the chance that the two drift apart later.

Something along these lines:

_LAZY_CACHE_ATTRS = ( "FileStreamProgramCache", "ProgramCacheResource", "SQLiteProgramCache", "make_program_cache_key", ) __all__ = [ "StridedMemoryView", "args_viewable_as_strided_memory", *_LAZY_CACHE_ATTRS, ]

Mostly just a readability nit, but I think this makes the relationship between "lazy exports" and "public exports" a little clearer.

Done in cad93d0. _LAZY_CACHE_ATTRS is now the single ordered source; __all__ splices it in via *_LAZY_CACHE_ATTRS so the two lists can't drift. Same commit adds a note that the laziness guarantee is for explicit imports only -- star-import still walks __all__ and therefore resolves every lazy attribute.

leofang · 2026-04-22T00:42:18Z

Thanks, Phillip! I have this PR in my review backlog 🙏

The most important question: Are these cache implementations multithreading/multiprocessing safe? This is the key challenge that real-world apps will stress test. In CuPy, our on-disk cache has been stress-tested in DOE supercomputers.

Convert cuda.core.utils to a package and add persistent, on-disk caches for compiled ObjectCode produced by Program.compile. Public API (cuda.core.utils): * ProgramCacheResource -- abstract bytes|str -> ObjectCode mapping with context manager and pickle-safety warning. Path-backed ObjectCode is rejected at write time (would store only the path). * SQLiteProgramCache -- single-file sqlite3 backend (WAL mode, autocommit) with LRU eviction against an optional size cap. A threading.RLock serialises connection use so one cache object is safe across threads. wal_checkpoint(TRUNCATE) + VACUUM run after evictions so the size cap bounds real on-disk usage. __contains__ is read-only -- it does not bump LRU. __len__ counts only entries that survive validation and prunes corrupt rows. Schema-version mismatch on open drops the tables and rebuilds; corrupt / non-SQLite files are detected and the cache reinitialises empty. Transient OperationalError ("database is locked") propagates without nuking the file (and closes the partial connection). * FileStreamProgramCache -- directory of atomically-written entries (tmp + os.replace) safe across concurrent processes. On-disk filenames are blake2b(32) hashes of the key so arbitrary-length keys never overflow filesystem name limits. Reader pruning is stat-guarded: only delete a corrupt-looking file if its inode/ size/mtime have not changed since the read, so a concurrent os.replace by a writer is preserved. clear() and _enforce_size_cap use the same stat guard. Stale temp files (older than 1 hour) are swept on open and during eviction; live temp files count toward the size cap. Windows ERROR_SHARING_VIOLATION (32) and ERROR_LOCK_VIOLATION (33) on os.replace are retried with bounded backoff (~185ms) before being treated as a non-fatal cache miss; other PermissionErrors and all POSIX failures propagate. __len__ matches __getitem__ semantics (rejects schema/key/value mismatch). * make_program_cache_key -- stable 32-byte blake2b key over code, code_type, ProgramOptions, target_type, name expressions, cuda core/NVRTC versions, NVVM lib+IR version, linker backend+version for PTX inputs (driver version included only on the cuLink path). Backend-specific gates mirror Program/Linker: * code_type lower-cased to match Program_init. * code_type/target_type combination validated against Program's SUPPORTED_TARGETS matrix. * NVRTC side-effect options (create_pch, time, fdevice_time_trace) and external-content options (include_path, pre_include, pch, use_pch, pch_dir) require an extra_digest from the caller. The per-field set/unset predicate (_option_is_set) mirrors the compiler's emission gates; collections.abc.Sequence is the is_sequence check, matching _prepare_nvrtc_options_impl. * NVVM use_libdevice=True requires extra_digest because libdevice bitcode comes from the active toolkit. extra_sources is rejected for non-NVVM. Bytes-like ``code`` is rejected for non-NVVM (Program() requires str there). * PTX (Linker) input options are normalised through per-field gates that match _prepare_nvjitlink_options / _prepare_driver_options. ftz/prec_div/prec_sqrt/fma collapse to a sentinel under the driver linker (it ignores them). ptxas_options canonicalises across str/list/tuple/empty shapes. The driver linker's hard rejections (time, ptxas_options, split_compile) raise at key time. * name_expressions are gated on backend == "nvrtc"; PTX/NVVM ignore them, matching Program.compile. * Failed environment probes mix the exception class name into a *_probe_failed label so broken environments never collide with working ones, while staying stable across processes and across repeated calls within a process. Lazy import: ``from cuda.core.utils import StridedMemoryView`` does NOT pull in the cache backends. The cache classes are exposed via module __getattr__. sqlite3 is imported lazily inside SQLiteProgramCache.__init__ so the package is usable on interpreters built without libsqlite3. Tests: 177 cache tests covering single-process CRUD, LRU/size-cap (logical and on-disk, including stat-guarded race scenarios), corruption + __len__ pruning, schema-mismatch table-DROP, threaded SQLite, cross-process FileStream stress (writer/reader race exercising the stat-guard prune; clear/eviction race injection via generator cleanup), Windows vs POSIX PermissionError narrowing (winerror 32/33 swallow + retry, others propagate; partial-conn close on OperationalError), lazy-import subprocess test, an end-to-end test that compiles a real CUDA C++ kernel, stores the ObjectCode, reopens the cache, and calls get_kernel on the deserialised copy, and a test that parses _program.pyx via tokenize + ast.literal_eval to assert the cache's _SUPPORTED_TARGETS_BY_CODE_TYPE matches Program.compile's matrix. Public API is documented in cuda_core/docs/source/api.rst.

…ent pickle compat; add usage example

…ent star-import caveat Defines the ordered lazy-export tuple once and splices it into __all__ so the two lists cannot drift. Notes that star-import still eagerly resolves every lazy attribute (walking __all__ pulls _program_cache in), which is expected given star-imports are already discouraged.

…ction High: make_program_cache_key() now refuses to build an NVRTC key when options.name carries a directory component and neither extra_digest nor no_source_include=True is set. NVRTC searches the directory portion of the name for #include "..." lookups, and the cache cannot fingerprint those files -- a stale header would otherwise yield a stale cache hit. Names without a directory separator fall back to CWD (the same search root every NVRTC compile sees) and remain accepted unchanged. Medium: FileStreamProgramCache._enforce_size_cap() now decrements the running ``total`` whether it unlinks the candidate itself or a concurrent pruner already removed the file. Previously the suppressed FileNotFoundError left ``total`` inflated, so the loop kept evicting newer entries to "hit" a cap that had already been met -- turning ordinary contention into cache data loss. Tests cover: path-like names on POSIX/Windows separators with and without extra_digest / no_source_include; names without a directory component (default_program, bare labels, filenames) must stay allowed; non-NVRTC backends are exempt from the new guard. The eviction test monkeypatches Path.unlink to simulate a concurrent deleter winning the race and verifies the freshly-committed entry survives.

cpcloud · 2026-04-22T12:04:47Z

Addressed in ff886d3585 (fixes) and cad93d0 (refactor + star-import note).

High -- source-directory include. make_program_cache_key() now refuses to build an NVRTC key when options.name contains a directory separator and neither extra_digest nor no_source_include=True is set. Scoping the guard to names that actually introduce a new search directory (/abs/kernel.cu, rel/kernel.cu, C:\src\kernel.cu) keeps bare labels like "default_program" or "kernel-a" -- which fall back to CWD, the same search root every NVRTC compile sees -- accepted unchanged. Tests cover POSIX and Windows separators, the extra_digest and no_source_include=True accept paths, and confirm the guard is NVRTC-only (PTX and NVVM unaffected).

Medium -- over-eviction race. FileStreamProgramCache._enforce_size_cap() now decrements total whether it unlinks the candidate itself or a concurrent pruner already removed the file. The FileNotFoundError is still suppressed, but the accounting now matches reality, so the loop stops as soon as the cap is met. Added a test that monkeypatches Path.unlink to simulate a concurrent deleter winning exactly once, then verifies the freshly-committed entry survives.

Low -- star-import. Added a note in cuda_core/cuda/core/utils/__init__.py that the laziness guarantee is for explicit imports only -- from cuda.core.utils import * walks __all__ and therefore resolves every lazy attribute. Star-imports are discouraged anyway, so treat that as expected.

cpcloud · 2026-04-22T12:05:08Z

@leofang -- yes, both backends are designed and tested for concurrent access, with different scopes:

SQLiteProgramCache -- single-process, multi-threaded. check_same_thread=False on the connection plus a threading.RLock serialises every connection-touching method, so threads cannot interleave a read/update or a write/VACUUM pair. WAL + autocommit on open. Stressed by a 4 writers + 4 readers x 200 ops test in test_program_cache.py. Multiple processes can technically share the sqlite file (WAL serialises writes) but the recommended choice for multi-process workloads is FileStreamProgramCache.

FileStreamProgramCache -- multi-process. Every write lands on a per-write temp file and is promoted via os.replace, so a reader/writer race either sees the old entry or the new one -- never a half-written file. Reader pruning, clear(), and _enforce_size_cap are all stat-guarded: before unlinking, the code re-stats the candidate and refuses if (ino, size, mtime_ns) differs from the snapshot, so a concurrent writer's os.replace is preserved. Stale temp files are swept on open. On Windows, os.replace can surface ERROR_SHARING_VIOLATION (32) / ERROR_LOCK_VIOLATION (33) against a reader briefly holding the handle; the code retries with bounded backoff (~185ms total) before treating it as a non-fatal cache miss -- all other PermissionErrors and POSIX failures propagate.

Cross-process coverage in test_program_cache_multiprocess.py:

concurrent writers producing overlapping keys
a writer/reader race exercising the stat-guarded prune path
clear/eviction race injection via generator cleanup (the cleanup code after the last yield runs at StopIteration, which is exactly between _enforce_size_cap's scan and its eviction loop)
Windows PermissionError narrowing (winerror 32/33 swallow + retry, all others propagate)

One concurrency bug this review shook out (over-eviction after a suppressed FileNotFoundError in _enforce_size_cap) is fixed in ff886d3585 with its own test. If you see a DOE-style pattern from CuPy's cache that we don't cover yet, happy to add a test that reproduces it -- mapping that stress-testing onto this backend would be useful.

New in-process cache that stores ObjectCode instances by reference inside an OrderedDict, suitable for workflows that compile kernels once per process and look them up many times without wanting disk I/O. Behaviour: - LRU eviction on both ``max_entries`` and ``max_size_bytes`` (either or both can be set; ``None`` means unbounded on that axis). - ``__getitem__`` promotes the entry; ``__contains__`` is read-only and does not shift LRU order -- matches the persistent backends. - A ``threading.RLock`` serialises every method so the cache can be shared across threads without external locking. - Entries are stored by reference: reads return the same Python object, so callers must treat the returned ObjectCode as read-only. - Rejects non-ObjectCode values and path-backed ObjectCode (same ``_require_object_code`` guard the persistent backends use) to avoid silently caching content that lives elsewhere on disk. Tests cover CRUD, key normalisation, cap validation, LRU touch/contains semantics, combined caps, size accounting on overwrite, degenerate caps (single entry > cap, max_entries=0), and a threaded stress smoke test. Closes NVIDIA#177

cpcloud added this to the cuda.core v1.0.0 milestone Apr 14, 2026

cpcloud added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Apr 14, 2026

cpcloud self-assigned this Apr 14, 2026

cpcloud force-pushed the persistent-program-cache-178 branch from de57bd8 to ac38a68 Compare April 14, 2026 22:15

cpcloud force-pushed the persistent-program-cache-178 branch 23 times, most recently from f1ae40e to b27ed2c Compare April 19, 2026 13:28

cpcloud force-pushed the persistent-program-cache-178 branch 3 times, most recently from 2dc5c8f to 5da111b Compare April 20, 2026 12:18

cpcloud requested review from leofang and rwgk April 20, 2026 13:21

rwgk reviewed Apr 20, 2026

View reviewed changes

cpcloud added 5 commits April 22, 2026 08:03

fixup! feat(core.utils): couple key schema with backend schema; docum…

76afc58

…ent pickle compat; add usage example

fixup! feat(core.utils): ruff format the added docstring example

e98f580

cpcloud force-pushed the persistent-program-cache-178 branch from 3a32786 to cad93d0 Compare April 22, 2026 12:04

cpcloud changed the title ~~Add PersistentProgramCache (sqlite + filestream backends)~~ Add program caches (in-memory, sqlite, filestream) Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add program caches (in-memory, sqlite, filestream)#1912

Add program caches (in-memory, sqlite, filestream)#1912
cpcloud wants to merge 6 commits intoNVIDIA:mainfrom
cpcloud:persistent-program-cache-178

cpcloud commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 14, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk commented Apr 20, 2026

Uh oh!

rwgk Apr 20, 2026

Uh oh!

cpcloud Apr 22, 2026

Uh oh!

leofang commented Apr 22, 2026

Uh oh!

cpcloud commented Apr 22, 2026

Uh oh!

cpcloud commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cpcloud commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 14, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

rwgk commented Apr 20, 2026

High: make_program_cache_key() misses implicit source-directory header dependencies

Medium: FileStreamProgramCache._enforce_size_cap() can over-evict under concurrent capped writers

Low: from cuda.core.utils import * now eagerly imports the cache stack

Uh oh!

rwgk Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

cpcloud Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

leofang commented Apr 22, 2026

Uh oh!

cpcloud commented Apr 22, 2026

Uh oh!

cpcloud commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cpcloud commented Apr 14, 2026 •

edited

Loading

High: `make_program_cache_key()` misses implicit source-directory header dependencies

Medium: `FileStreamProgramCache._enforce_size_cap()` can over-evict under concurrent capped writers

Low: `from cuda.core.utils import *` now eagerly imports the cache stack