Commit 29a8be9
committed
Pinning runtime: out-of-pool resident allocations with budget split
EXPERIMENTAL. Lands the runtime half of pinning. Commit 9b will
add the public partitioner kwarg, the public
``weight_offload_budget_mb`` runtime spec, and the
``--cuda_runtime_spec`` runner flag that consume what this
commit ships.
A pinned FQN now:
* Has a probe op in the schedule like any other FQN (no
change — pass behavior unchanged).
* Is allocated once at Session::create via out-of-pool
``cudaMalloc`` + a synchronous ``cudaMemcpyAsync`` from the
pinned host mirror + ``cudaStreamSynchronize``. Lives for
the Session lifetime in ``pinned_``; freed in the dtor
between dummies cleanup and host-mirror free.
* At serve(): the pinned fast path bypasses the pool, event
waits, and streaming stats — but STILL calls
``opportunistic_prefetch(probe_id)`` so a pinned→streaming
transition doesn't lose overlap.
* At prefetch lookup: short-circuited (pinned FQNs are
already resident; no work needed).
Budget accounting splits cleanly:
* ``total_budget_bytes_`` — what the user configured (or the
default, which is now ``floor + pinned_bytes`` when no spec
is provided so a no-spec default never starves pinning).
* ``pinned_bytes_total_`` — sum of payload.pin_fqns logical
nbytes, computed from VALIDATED metadata (before any GPU
work).
* ``streaming_budget_bytes_ = total - pinned`` — the cap the
miss-path and prefetch-path eviction loops compare against.
The pool's release threshold (soft) also uses this.
Floor check at init becomes ``streaming_budget >=
payload.floor_bytes`` (the pass-computed floor already excludes
pinned). Below-floor budgets hard-fail with a descriptive
message naming pinned bytes, streaming floor, and required
total — "Weight offloading needs at least X bytes... pinned: Y
bytes, streaming pool floor: Z bytes, required total: X" — so
the user knows exactly what to set.
Three-layer dedupe on payload.pin_fqns to prevent
double-accounting / overwrite:
1. Pass-side: ``_apply_weight_offload`` dedupes
``pin_fqns`` first-seen-order before payload serialization.
2. Runtime parse: cuda_backend.cpp hard-fails at parse if
``payload.pin_fqns`` contains duplicates (corrupted /
hand-rolled artifact protection).
3. Allocation: Session::create's ``pinned_.emplace(fqn, dev)``
asserts not-already-inserted as a third-layer guard.
Stream/release-threshold contract: the pool's release threshold
is set to ``streaming_budget_bytes_`` (not total) so that
requested live offload bytes are capped at
``pinned + streaming = total``. The threshold is SOFT — driver-
reserved / pool cache memory may briefly exceed that — but the
accounting invariant ``peak_live_bytes <= streaming_budget``
stays self-consistent.
Refactoring + cleanup:
* Extracted ``Session::wrap_borrowed_tensor`` (used in 4
sites now: hit, miss, pinned, and the prefetched-then-
consumed hit path).
* Renamed ``budget_bytes_`` to ``total_budget_bytes_`` for
clarity; existing accessor renamed to
``total_budget_bytes()`` with new
``pinned_bytes_total()`` and ``streaming_budget_bytes()``
siblings.
* Stats log extended with ``pinned_bytes=Pn
streaming_budget=Sb`` fields. ``_STATS_RE`` in
test_weight_offload_pool.py extends to capture both.
Tests:
* Existing 6 pool tests + 9 transport tests + 4 catalog tests
still pass; their assertions for "pin_fqns hard-fails"
flip to "pin_fqns now succeeds".
* NEW ``test_pinning_default_budget_covers_pinned`` (pool):
no explicit budget + non-empty pin_fqns succeeds; stats show
``streaming + pinned == total`` and ``streaming >= floor``.
Validates the v3 default-budget-with-pins fix.
* NEW ``test_pinning_pinned_fqn_resident_no_streaming_h2d``
(pool): pin w1, run, assert ``bytes_h2d == 16384`` (w2 only),
``pinned_bytes == 16384``. Confirms pinned allocations
bypass the streaming pool entirely.
* NEW ``test_pinning_pinned_then_streaming_still_prefetches``
(pool): asserts ``prefetch_attempted >= 1`` even when one
of the two probes hits the pinned fast path — proves the
pinned fast path still calls opportunistic_prefetch.
* ``test_runtime_rejects_nonempty_pin_fqns`` → renamed
``test_runtime_accepts_nonempty_pin_fqns`` and updated to
assert the success summary is emitted.
* ``test_hard_fails_when_pin_fqns_set`` (catalog) removed;
coverage moved to the transport + pool side.
* ``test_pinning_below_floor_with_pinned_hard_fails`` is
deferred to 9b — needs a way to inject a sub-required
budget (the ``--cuda_runtime_spec`` runner flag or a C++
Module harness, both landing in 9b).
Banner updates:
* session.h: "POOL+LRU+DUMMIES+PREFETCH WIRED" entry under
"Resolved" gains a "Pinning (commit 9a)" subsection. The
deferred-items list drops the pinning bullet and adds the
9b public-knob bullet.
* weight_offload.h: "Resolved in commit 9a" block added
spelling out the pinning contract + the streaming-only
release-threshold rationale.
* weight_offload_pass.py: docstring flipped from "pinning is
hard-failed" to the new commit-9a behavior.1 parent e35ef0c commit 29a8be9
8 files changed
Lines changed: 599 additions & 160 deletions
File tree
- backends/cuda
- passes
- runtime
- weight_offload
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
34 | | - | |
35 | | - | |
36 | | - | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
37 | 40 | | |
38 | 41 | | |
39 | 42 | | |
| |||
763 | 766 | | |
764 | 767 | | |
765 | 768 | | |
766 | | - | |
| 769 | + | |
| 770 | + | |
| 771 | + | |
| 772 | + | |
| 773 | + | |
| 774 | + | |
| 775 | + | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
767 | 781 | | |
768 | 782 | | |
769 | 783 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
439 | 439 | | |
440 | 440 | | |
441 | 441 | | |
442 | | - | |
443 | | - | |
444 | | - | |
445 | | - | |
446 | | - | |
447 | | - | |
448 | | - | |
449 | | - | |
450 | | - | |
451 | | - | |
452 | | - | |
453 | | - | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
454 | 463 | | |
455 | 464 | | |
456 | 465 | | |
| |||
1124 | 1133 | | |
1125 | 1134 | | |
1126 | 1135 | | |
| 1136 | + | |
| 1137 | + | |
| 1138 | + | |
| 1139 | + | |
| 1140 | + | |
| 1141 | + | |
| 1142 | + | |
| 1143 | + | |
| 1144 | + | |
| 1145 | + | |
| 1146 | + | |
| 1147 | + | |
| 1148 | + | |
| 1149 | + | |
| 1150 | + | |
| 1151 | + | |
| 1152 | + | |
| 1153 | + | |
| 1154 | + | |
| 1155 | + | |
| 1156 | + | |
| 1157 | + | |
1127 | 1158 | | |
1128 | | - | |
| 1159 | + | |
| 1160 | + | |
1129 | 1161 | | |
1130 | | - | |
| 1162 | + | |
| 1163 | + | |
| 1164 | + | |
1131 | 1165 | | |
1132 | 1166 | | |
1133 | 1167 | | |
1134 | | - | |
| 1168 | + | |
1135 | 1169 | | |
1136 | 1170 | | |
1137 | 1171 | | |
| |||
1179 | 1213 | | |
1180 | 1214 | | |
1181 | 1215 | | |
| 1216 | + | |
| 1217 | + | |
| 1218 | + | |
| 1219 | + | |
| 1220 | + | |
| 1221 | + | |
| 1222 | + | |
| 1223 | + | |
1182 | 1224 | | |
1183 | 1225 | | |
1184 | 1226 | | |
| |||
0 commit comments