Commit 1b24442
committed
feat(core.utils): add persistent program caches (sqlite + filestream)
Convert cuda.core.utils to a package and add persistent, on-disk caches
for compiled ObjectCode produced by Program.compile.
Public API (cuda.core.utils):
* ProgramCacheResource -- abstract bytes|str -> ObjectCode mapping
with context manager and pickle-safety warning. Path-backed
ObjectCode is rejected at write time (would store only the path).
* SQLiteProgramCache -- single-file sqlite3 backend (WAL mode,
autocommit) with LRU eviction against an optional size cap. A
threading.RLock serialises connection use so one cache object is
safe across threads. wal_checkpoint(TRUNCATE) + VACUUM run after
evictions so the size cap bounds real on-disk usage. __contains__
is read-only -- it does not bump LRU. __len__ counts only entries
that survive validation and prunes corrupt rows. Schema-version
mismatch on open drops the tables and rebuilds; corrupt /
non-SQLite files are detected and the cache reinitialises empty.
Transient OperationalError ("database is locked") propagates
without nuking the file (and closes the partial connection).
* FileStreamProgramCache -- directory of atomically-written entries
(tmp + os.replace) safe across concurrent processes. On-disk
filenames are blake2b(32) hashes of the key so arbitrary-length
keys never overflow filesystem name limits. Reader pruning is
stat-guarded: only delete a corrupt-looking file if its inode/
size/mtime have not changed since the read, so a concurrent
os.replace by a writer is preserved. clear() and _enforce_size_cap
use the same stat guard. Stale temp files (older than 1 hour) are
swept on open and during eviction; live temp files count toward
the size cap. Windows ERROR_SHARING_VIOLATION (32) and
ERROR_LOCK_VIOLATION (33) on os.replace are retried with bounded
backoff (~185ms) before being treated as a non-fatal cache miss;
other PermissionErrors and all POSIX failures propagate. __len__
matches __getitem__ semantics (rejects schema/key/value mismatch).
* make_program_cache_key -- stable 32-byte blake2b key over code,
code_type, ProgramOptions, target_type, name expressions, cuda
core/NVRTC versions, NVVM lib+IR version, linker backend+version
for PTX inputs (driver version included only on the cuLink path).
Backend-specific gates mirror Program/Linker:
* code_type lower-cased to match Program_init.
* code_type/target_type combination validated against Program's
SUPPORTED_TARGETS matrix.
* NVRTC side-effect options (create_pch, time, fdevice_time_trace)
and external-content options (include_path, pre_include, pch,
use_pch, pch_dir) require an extra_digest from the caller. The
per-field set/unset predicate (_option_is_set) mirrors the
compiler's emission gates; collections.abc.Sequence is the
is_sequence check, matching _prepare_nvrtc_options_impl.
* NVVM use_libdevice=True requires extra_digest because libdevice
bitcode comes from the active toolkit. extra_sources is
rejected for non-NVVM. Bytes-like ``code`` is rejected for
non-NVVM (Program() requires str there).
* PTX (Linker) input options are normalised through per-field
gates that match _prepare_nvjitlink_options /
_prepare_driver_options. ftz/prec_div/prec_sqrt/fma collapse
to a sentinel under the driver linker (it ignores them).
ptxas_options canonicalises across str/list/tuple/empty shapes.
The driver linker's hard rejections (time, ptxas_options,
split_compile) raise at key time.
* name_expressions are gated on backend == "nvrtc"; PTX/NVVM
ignore them, matching Program.compile.
* Failed environment probes mix the exception class name into a
*_probe_failed label so broken environments never collide with
working ones, while staying stable across processes and across
repeated calls within a process.
Lazy import: ``from cuda.core.utils import StridedMemoryView`` does
NOT pull in the cache backends. The cache classes are exposed via
module __getattr__. sqlite3 is imported lazily inside
SQLiteProgramCache.__init__ so the package is usable on interpreters
built without libsqlite3.
Tests: 177 cache tests covering single-process CRUD, LRU/size-cap
(logical and on-disk, including stat-guarded race scenarios),
corruption + __len__ pruning, schema-mismatch table-DROP, threaded
SQLite, cross-process FileStream stress (writer/reader race exercising
the stat-guard prune; clear/eviction race injection via generator
cleanup), Windows vs POSIX PermissionError narrowing (winerror 32/33
swallow + retry, others propagate; partial-conn close on
OperationalError), lazy-import subprocess test, an end-to-end test
that compiles a real CUDA C++ kernel, stores the ObjectCode, reopens
the cache, and calls get_kernel on the deserialised copy, and a test
that parses _program.pyx via tokenize + ast.literal_eval to assert
the cache's _SUPPORTED_TARGETS_BY_CODE_TYPE matches Program.compile's
matrix. Public API is documented in cuda_core/docs/source/api.rst.1 parent a18022c commit 1b24442
File tree
6 files changed
+3687
-8
lines changed- cuda_core
- cuda/core
- utils
- docs/source
- tests
6 files changed
+3687
-8
lines changedThis file was deleted.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
0 commit comments