Add program caches (in-memory, sqlite, filestream)#1912
Add program caches (in-memory, sqlite, filestream)#1912cpcloud wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
de57bd8 to
ac38a68
Compare
|
f1ae40e to
b27ed2c
Compare
2dc5c8f to
5da111b
Compare
|
Generated with the help of Cursor GPT-5.4 Extra High Fast High:
|
| __all__ = [ | ||
| "FileStreamProgramCache", | ||
| "ProgramCacheResource", | ||
| "SQLiteProgramCache", | ||
| "StridedMemoryView", | ||
| "args_viewable_as_strided_memory", | ||
| "make_program_cache_key", | ||
| ] | ||
|
|
||
| # Lazily expose the program-cache APIs so ``from cuda.core.utils import | ||
| # StridedMemoryView`` stays lightweight -- the cache backends pull in driver, | ||
| # NVRTC, and module-load machinery that memoryview-only consumers do not need. | ||
| _LAZY_CACHE_ATTRS = frozenset( | ||
| { | ||
| "FileStreamProgramCache", | ||
| "ProgramCacheResource", | ||
| "SQLiteProgramCache", | ||
| "make_program_cache_key", | ||
| } | ||
| ) |
There was a problem hiding this comment.
Small readability/maintenance cleanup suggestion:
__all__ and _LAZY_CACHE_ATTRS currently duplicate the same cache-export names, so defining the ordered lazy-export list once and reusing it in __all__ seems a bit easier to scan and reduces the chance that the two drift apart later.
Something along these lines:
_LAZY_CACHE_ATTRS = (
"FileStreamProgramCache",
"ProgramCacheResource",
"SQLiteProgramCache",
"make_program_cache_key",
)
__all__ = [
"StridedMemoryView",
"args_viewable_as_strided_memory",
*_LAZY_CACHE_ATTRS,
]Mostly just a readability nit, but I think this makes the relationship between "lazy exports" and "public exports" a little clearer.
There was a problem hiding this comment.
Done in cad93d0. _LAZY_CACHE_ATTRS is now the single ordered source; __all__ splices it in via *_LAZY_CACHE_ATTRS so the two lists can't drift. Same commit adds a note that the laziness guarantee is for explicit imports only -- star-import still walks __all__ and therefore resolves every lazy attribute.
|
Thanks, Phillip! I have this PR in my review backlog 🙏 The most important question: Are these cache implementations multithreading/multiprocessing safe? This is the key challenge that real-world apps will stress test. In CuPy, our on-disk cache has been stress-tested in DOE supercomputers. |
Convert cuda.core.utils to a package and add persistent, on-disk caches
for compiled ObjectCode produced by Program.compile.
Public API (cuda.core.utils):
* ProgramCacheResource -- abstract bytes|str -> ObjectCode mapping
with context manager and pickle-safety warning. Path-backed
ObjectCode is rejected at write time (would store only the path).
* SQLiteProgramCache -- single-file sqlite3 backend (WAL mode,
autocommit) with LRU eviction against an optional size cap. A
threading.RLock serialises connection use so one cache object is
safe across threads. wal_checkpoint(TRUNCATE) + VACUUM run after
evictions so the size cap bounds real on-disk usage. __contains__
is read-only -- it does not bump LRU. __len__ counts only entries
that survive validation and prunes corrupt rows. Schema-version
mismatch on open drops the tables and rebuilds; corrupt /
non-SQLite files are detected and the cache reinitialises empty.
Transient OperationalError ("database is locked") propagates
without nuking the file (and closes the partial connection).
* FileStreamProgramCache -- directory of atomically-written entries
(tmp + os.replace) safe across concurrent processes. On-disk
filenames are blake2b(32) hashes of the key so arbitrary-length
keys never overflow filesystem name limits. Reader pruning is
stat-guarded: only delete a corrupt-looking file if its inode/
size/mtime have not changed since the read, so a concurrent
os.replace by a writer is preserved. clear() and _enforce_size_cap
use the same stat guard. Stale temp files (older than 1 hour) are
swept on open and during eviction; live temp files count toward
the size cap. Windows ERROR_SHARING_VIOLATION (32) and
ERROR_LOCK_VIOLATION (33) on os.replace are retried with bounded
backoff (~185ms) before being treated as a non-fatal cache miss;
other PermissionErrors and all POSIX failures propagate. __len__
matches __getitem__ semantics (rejects schema/key/value mismatch).
* make_program_cache_key -- stable 32-byte blake2b key over code,
code_type, ProgramOptions, target_type, name expressions, cuda
core/NVRTC versions, NVVM lib+IR version, linker backend+version
for PTX inputs (driver version included only on the cuLink path).
Backend-specific gates mirror Program/Linker:
* code_type lower-cased to match Program_init.
* code_type/target_type combination validated against Program's
SUPPORTED_TARGETS matrix.
* NVRTC side-effect options (create_pch, time, fdevice_time_trace)
and external-content options (include_path, pre_include, pch,
use_pch, pch_dir) require an extra_digest from the caller. The
per-field set/unset predicate (_option_is_set) mirrors the
compiler's emission gates; collections.abc.Sequence is the
is_sequence check, matching _prepare_nvrtc_options_impl.
* NVVM use_libdevice=True requires extra_digest because libdevice
bitcode comes from the active toolkit. extra_sources is
rejected for non-NVVM. Bytes-like ``code`` is rejected for
non-NVVM (Program() requires str there).
* PTX (Linker) input options are normalised through per-field
gates that match _prepare_nvjitlink_options /
_prepare_driver_options. ftz/prec_div/prec_sqrt/fma collapse
to a sentinel under the driver linker (it ignores them).
ptxas_options canonicalises across str/list/tuple/empty shapes.
The driver linker's hard rejections (time, ptxas_options,
split_compile) raise at key time.
* name_expressions are gated on backend == "nvrtc"; PTX/NVVM
ignore them, matching Program.compile.
* Failed environment probes mix the exception class name into a
*_probe_failed label so broken environments never collide with
working ones, while staying stable across processes and across
repeated calls within a process.
Lazy import: ``from cuda.core.utils import StridedMemoryView`` does
NOT pull in the cache backends. The cache classes are exposed via
module __getattr__. sqlite3 is imported lazily inside
SQLiteProgramCache.__init__ so the package is usable on interpreters
built without libsqlite3.
Tests: 177 cache tests covering single-process CRUD, LRU/size-cap
(logical and on-disk, including stat-guarded race scenarios),
corruption + __len__ pruning, schema-mismatch table-DROP, threaded
SQLite, cross-process FileStream stress (writer/reader race exercising
the stat-guard prune; clear/eviction race injection via generator
cleanup), Windows vs POSIX PermissionError narrowing (winerror 32/33
swallow + retry, others propagate; partial-conn close on
OperationalError), lazy-import subprocess test, an end-to-end test
that compiles a real CUDA C++ kernel, stores the ObjectCode, reopens
the cache, and calls get_kernel on the deserialised copy, and a test
that parses _program.pyx via tokenize + ast.literal_eval to assert
the cache's _SUPPORTED_TARGETS_BY_CODE_TYPE matches Program.compile's
matrix. Public API is documented in cuda_core/docs/source/api.rst.
…ent pickle compat; add usage example
…ent star-import caveat Defines the ordered lazy-export tuple once and splices it into __all__ so the two lists cannot drift. Notes that star-import still eagerly resolves every lazy attribute (walking __all__ pulls _program_cache in), which is expected given star-imports are already discouraged.
…ction High: make_program_cache_key() now refuses to build an NVRTC key when options.name carries a directory component and neither extra_digest nor no_source_include=True is set. NVRTC searches the directory portion of the name for #include "..." lookups, and the cache cannot fingerprint those files -- a stale header would otherwise yield a stale cache hit. Names without a directory separator fall back to CWD (the same search root every NVRTC compile sees) and remain accepted unchanged. Medium: FileStreamProgramCache._enforce_size_cap() now decrements the running ``total`` whether it unlinks the candidate itself or a concurrent pruner already removed the file. Previously the suppressed FileNotFoundError left ``total`` inflated, so the loop kept evicting newer entries to "hit" a cap that had already been met -- turning ordinary contention into cache data loss. Tests cover: path-like names on POSIX/Windows separators with and without extra_digest / no_source_include; names without a directory component (default_program, bare labels, filenames) must stay allowed; non-NVRTC backends are exempt from the new guard. The eviction test monkeypatches Path.unlink to simulate a concurrent deleter winning the race and verifies the freshly-committed entry survives.
3a32786 to
cad93d0
Compare
|
Addressed in ff886d3585 (fixes) and cad93d0 (refactor + star-import note). High -- source-directory include. Medium -- over-eviction race. Low -- star-import. Added a note in |
|
@leofang -- yes, both backends are designed and tested for concurrent access, with different scopes:
Cross-process coverage in
One concurrency bug this review shook out (over-eviction after a suppressed |
New in-process cache that stores ObjectCode instances by reference inside an OrderedDict, suitable for workflows that compile kernels once per process and look them up many times without wanting disk I/O. Behaviour: - LRU eviction on both ``max_entries`` and ``max_size_bytes`` (either or both can be set; ``None`` means unbounded on that axis). - ``__getitem__`` promotes the entry; ``__contains__`` is read-only and does not shift LRU order -- matches the persistent backends. - A ``threading.RLock`` serialises every method so the cache can be shared across threads without external locking. - Entries are stored by reference: reads return the same Python object, so callers must treat the returned ObjectCode as read-only. - Rejects non-ObjectCode values and path-backed ObjectCode (same ``_require_object_code`` guard the persistent backends use) to avoid silently caching content that lives elsewhere on disk. Tests cover CRUD, key normalisation, cap validation, LRU touch/contains semantics, combined caps, size accounting on overwrite, degenerate caps (single entry > cap, max_entries=0), and a threaded stress smoke test. Closes NVIDIA#177
Summary
cuda.core.utilsfrom a module to a package; expose cache APIs lazily via__getattr__sofrom cuda.core.utils import StridedMemoryViewstays lightweight._LAZY_CACHE_ATTRSis a single ordered tuple spliced into__all__via*_LAZY_CACHE_ATTRS, and the module docstring notes that the laziness guarantee is for explicit imports only (star-import walks__all__and therefore resolves every lazy attribute).ProgramCacheResourceABC withbytes | strkeys, context manager, pickle-safety warning, and rejection of path-backedObjectCodeat write time.make_program_cache_key()— blake2b(32) digest with backend-specific gates that mirrorProgram/Linker:code_type/target_typeagainstProgram.compile'sSUPPORTED_TARGETS; rejects bytes-likecodefor non-NVVM andextra_sourcesfor non-NVVM.create_pch,time,fdevice_time_trace) and external-content (include_path,pre_include,pch,use_pch,pch_dir) options requireextra_digest; NVVMuse_libdevice=Truelikewise.options.namewith a directory component (e.g./path/to/kernel.cu) also requiresextra_digestbecause NVRTC searches that directory for#include "..."lookups; bare labels ("default_program","kernel-a") fall back to CWD and stay accepted.no_source_include=Truedisables the search and the guard._prepare_nvjitlink_options/_prepare_driver_options;ptxas_optionscanonicalised across str/list/tuple/empty shapes; driver-linker hard rejections (time,ptxas_options,split_compile) raise at key time;ftz/prec_div/prec_sqrt/fmacollapse under driver linker.*_probe_failedlabel so broken environments never collide with working ones, while staying stable across processes and repeated calls.InMemoryProgramCache— in-process dict-backed cache that storesObjectCodeby reference (no pickling). Optionalmax_entriesandmax_size_bytescaps with LRU eviction;threading.RLockserialises every method.__getitem__promotes LRU order;__contains__is read-only. Rejects non-ObjectCodevalues and path-backedObjectCodethe same way the persistent backends do.SQLiteProgramCache— single-file sqlite3 (WAL + autocommit), LRU eviction, optional size cap,wal_checkpoint(TRUNCATE) + VACUUMafter evictions so the cap bounds real on-disk usage.__contains__is read-only;__len__validates and prunes corrupt rows.threading.RLockserialises connection use. Schema-mismatch on open drops tables and rebuilds; corrupt / non-SQLite files reinitialise empty;OperationalError(lock/busy) propagates without nuking the file (and closes the partial connection).FileStreamProgramCache— multi-process via tmp +os.replace. Hash-based filenames so arbitrary-length keys don't overflow filesystem limits. Reader pruning,clear(), and_enforce_size_capare all stat-guarded (snapshot(ino, size, mtime_ns), refuse unlink on mismatch) so a concurrent writer'sos.replaceis preserved._enforce_size_capalso decrements its runningtotalwhen a concurrent deleter wins the unlink race, so over-eviction after a suppressedFileNotFoundErrorcannot delete a freshly-committed entry. Stale temp files swept on open; live temps count toward the size cap. WindowsERROR_SHARING_VIOLATION/ERROR_LOCK_VIOLATIONonos.replaceare retried with bounded backoff (~185ms) before being treated as a non-fatal cache miss; otherPermissionErrorand all POSIX failures propagate.__len__also rejectsstored_key/path mismatch.Program.compile(cache=...)integration is out of scope (tracked by #176).Test plan
__len__pruning; schema-mismatch table-DROP; threaded SQLite (4 writers + 4 readers × 200 ops); threaded InMemory stress; cross-process FileStream stress (writer/reader race exercising the stat-guard prune; clear/eviction race injection via generator cleanup); over-eviction race (monkeypatchedPath.unlinksimulates a concurrent deleter winning exactly once; asserts the fresh entry survives); Windows vs POSIXPermissionErrornarrowing (winerror 32/33 swallow + retry, others propagate; partial-conn close onOperationalError); NVRTC source-directory path-name guard with POSIX/Windows separators and both accept paths; lazy-import subprocess test;_SUPPORTED_TARGETS_BY_CODE_TYPEparity test that parses_program.pyxviatokenize+ast.literal_eval.get_kernelon the deserialisedObjectCode, parametrized over the two persistent backends.Closes #177
Closes #178
Closes #179