Skip to content

Commit c534df1

Browse files
committed
feat(core.utils): add persistent program caches (sqlite + filestream)
Convert cuda.core.utils to a package and add persistent, on-disk caches for compiled ObjectCode produced by Program.compile. Public API (cuda.core.utils): * ProgramCacheResource -- abstract bytes|str -> ObjectCode mapping with context manager and pickle-safety warning. Path-backed ObjectCode is rejected at write time (would store only the path). * SQLiteProgramCache -- single-file sqlite3 backend (WAL mode, autocommit) with LRU eviction against an optional size cap. A threading.RLock serialises connection use so one cache object is safe across threads. wal_checkpoint(TRUNCATE) + VACUUM run after evictions so the size cap bounds real on-disk usage. __contains__ is read-only -- it does not bump LRU. __len__ counts only entries that survive validation and prunes corrupt rows. Schema-version mismatch on open drops the tables and rebuilds; corrupt / non-SQLite files are detected and the cache reinitialises empty. Transient OperationalError ("database is locked") propagates without nuking the file (and closes the partial connection). * FileStreamProgramCache -- directory of atomically-written entries (tmp + os.replace) safe across concurrent processes. On-disk filenames are blake2b(32) hashes of the key so arbitrary-length keys never overflow filesystem name limits. Reader pruning is stat-guarded: only delete a corrupt-looking file if its inode/ size/mtime have not changed since the read, so a concurrent os.replace by a writer is preserved. clear() and _enforce_size_cap use the same stat guard. Stale temp files (older than 1 hour) are swept on open and during eviction; live temp files count toward the size cap. Windows ERROR_SHARING_VIOLATION (32) and ERROR_LOCK_VIOLATION (33) on os.replace are retried with bounded backoff (~185ms) before being treated as a non-fatal cache miss; other PermissionErrors and all POSIX failures propagate. __len__ matches __getitem__ semantics (rejects schema/key/value mismatch). * make_program_cache_key -- stable 32-byte blake2b key over code, code_type, ProgramOptions, target_type, name expressions, cuda core/NVRTC versions, NVVM lib+IR version, linker backend+version for PTX inputs (driver version included only on the cuLink path). Backend-specific gates mirror Program/Linker: * code_type lower-cased to match Program_init. * code_type/target_type combination validated against Program's SUPPORTED_TARGETS matrix. * NVRTC side-effect options (create_pch, time, fdevice_time_trace) and external-content options (include_path, pre_include, pch, use_pch, pch_dir) require an extra_digest from the caller. The per-field set/unset predicate (_option_is_set) mirrors the compiler's emission gates; collections.abc.Sequence is the is_sequence check, matching _prepare_nvrtc_options_impl. * NVVM use_libdevice=True requires extra_digest because libdevice bitcode comes from the active toolkit. extra_sources is rejected for non-NVVM. Bytes-like ``code`` is rejected for non-NVVM (Program() requires str there). * PTX (Linker) input options are normalised through per-field gates that match _prepare_nvjitlink_options / _prepare_driver_options. ftz/prec_div/prec_sqrt/fma collapse to a sentinel under the driver linker (it ignores them). ptxas_options canonicalises across str/list/tuple/empty shapes. The driver linker's hard rejections (time, ptxas_options, split_compile) raise at key time. * name_expressions are gated on backend == "nvrtc"; PTX/NVVM ignore them, matching Program.compile. * Failed environment probes mix the exception class name into a *_probe_failed label so broken environments never collide with working ones, while staying stable across processes and across repeated calls within a process. Lazy import: ``from cuda.core.utils import StridedMemoryView`` does NOT pull in the cache backends. The cache classes are exposed via module __getattr__. sqlite3 is imported lazily inside SQLiteProgramCache.__init__ so the package is usable on interpreters built without libsqlite3. Tests: 177 cache tests covering single-process CRUD, LRU/size-cap (logical and on-disk, including stat-guarded race scenarios), corruption + __len__ pruning, schema-mismatch table-DROP, threaded SQLite, cross-process FileStream stress (writer/reader race exercising the stat-guard prune; clear/eviction race injection via generator cleanup), Windows vs POSIX PermissionError narrowing (winerror 32/33 swallow + retry, others propagate; partial-conn close on OperationalError), lazy-import subprocess test, an end-to-end test that compiles a real CUDA C++ kernel, stores the ObjectCode, reopens the cache, and calls get_kernel on the deserialised copy, and a test that parses _program.pyx via tokenize + ast.literal_eval to assert the cache's _SUPPORTED_TARGETS_BY_CODE_TYPE matches Program.compile's matrix. Public API is documented in cuda_core/docs/source/api.rst.
1 parent 355fcaa commit c534df1

File tree

6 files changed

+3796
-8
lines changed

6 files changed

+3796
-8
lines changed

cuda_core/cuda/core/utils.py

Lines changed: 0 additions & 8 deletions
This file was deleted.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
from cuda.core._memoryview import (
6+
StridedMemoryView,
7+
args_viewable_as_strided_memory,
8+
)
9+
10+
__all__ = [
11+
"FileStreamProgramCache",
12+
"ProgramCacheResource",
13+
"SQLiteProgramCache",
14+
"StridedMemoryView",
15+
"args_viewable_as_strided_memory",
16+
"make_program_cache_key",
17+
]
18+
19+
# Lazily expose the program-cache APIs so ``from cuda.core.utils import
20+
# StridedMemoryView`` stays lightweight -- the cache backends pull in driver,
21+
# NVRTC, and module-load machinery that memoryview-only consumers do not need.
22+
_LAZY_CACHE_ATTRS = frozenset(
23+
{
24+
"FileStreamProgramCache",
25+
"ProgramCacheResource",
26+
"SQLiteProgramCache",
27+
"make_program_cache_key",
28+
}
29+
)
30+
31+
32+
def __getattr__(name):
33+
if name in _LAZY_CACHE_ATTRS:
34+
from cuda.core.utils import _program_cache
35+
36+
value = getattr(_program_cache, name)
37+
globals()[name] = value # cache for subsequent accesses
38+
return value
39+
raise AttributeError(f"module 'cuda.core.utils' has no attribute {name!r}")
40+
41+
42+
def __dir__():
43+
# Merge the lazy public API with the real module namespace so REPL and
44+
# introspection tools still surface ``__file__``, ``__spec__``, etc.
45+
return sorted(set(globals()) | set(__all__))

0 commit comments

Comments
 (0)