You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add depth-1 opportunistic prefetch to the offload Session
EXPERIMENTAL. After every successful Session::serve(probe_id),
opportunistic_prefetch() issues an asynchronous H2D for
schedule_[(probe_id + 1) % size()]. With wraparound: the last
probe of a forward pass prefetches probe_id 0 of the next pass
— free warmup for autoregressive decode, one wasted prefetch
for one-shot inference.
Per the v3 RFC microbenchmarks (Phase 4) + design header
choice 4, depth-1 is the depth current measurements support in
both regimes: when compute hides PCIe, one-ahead saturates the
overlap budget; when PCIe dominates, the copy stream is already
serializing back-to-back and deeper queueing doesn't change
throughput. Re-measure if hardware/workload shifts meaningfully.
Pieces:
* SessionStats grows prefetch_attempted / prefetch_succeeded
counters. ``attempted`` bumps BEFORE the H2D is issued;
``succeeded`` bumps AFTER cudaMemcpyAsync is queued on
copy_stream_. ``attempted - succeeded`` = swallowed errors.
Stats log line extends with both fields; the
``_STATS_RE`` regex in test_weight_offload_pool.py captures
them.
* Session::opportunistic_prefetch() is the new private member.
Skips immediately if the target is already live (same FQN
case, including 1-FQN-schedule wraparound). Defensive
"never evict current_fqn" guard catches the narrow case
where pick_lru would target the single immediately-just-
served FQN — only protects that case, NOT the general
below-floor scenario (a fused kernel with probes for A and
B sharing one launch could still have A evicted by a
prefetch after B if the floor invariant were violated).
The floor hard-fail at init remains the real general
safety contract.
* Stream-ordering invariant extended: every
cudaFreeAsync(e.dev_ptr, compute_stream_) is now preceded
by cudaStreamWaitEvent(compute_stream_, e.ready_event, 0).
This was implicit pre-commit-8 because every live entry's
ready_event had been waited on by the prior serve's hit
path — commit 8 introduces prefetched entries whose
ready_event is NOT waited on until the NEXT serve consumes
them, so the wait must be made explicit. Applied to three
sites: prefetch eviction (new), miss-path eviction
(retrofit), and ~Session()'s live-cleanup loop (retrofit).
* Post-eviction event-batch failure path falls back to
cudaStreamSynchronize(compute_stream_). When the batch's
cudaEventCreate / cudaEventRecord /
cudaStreamWaitEvent(copy_stream_, evict_done) fails AFTER
live_/bytes_in_flight_ have been mutated to reflect the
evictions, returning Error::Internal alone would leave the
Session in a state where a subsequent
cudaMallocFromPoolAsync on copy_stream_ races the pending
cudaFreeAsyncs on compute_stream_. The sync guarantees the
frees physically complete before return. Cheap insurance
for a rare error path; applied to both miss-path and
prefetch-path eviction batches.
Banner flips from "POOL+LRU+DUMMIES WIRED" to
"POOL+LRU+DUMMIES+PREFETCH WIRED" in both session.h and
weight_offload.h. The "Depth-1 prefetch" entry moves from
"NOT YET WIRED" to the resolved list.
Tests:
* Existing 5 pool tests still pass.
* NEW ``test_prefetch_converts_second_probe_to_pool_hit``:
on _TwoWeightModel (2 distinct probed FQNs) under a budget
that comfortably fits 2+ weights, asserts pool_misses ==
1 (just the first cold weight) and prefetch_succeeded >= 1
— proving the second probe hit because the prior serve
prefetched it. Test name and docstring make explicit that
"pool hit" doesn't mean "no stall": the hit path still
does cudaStreamWaitEvent on the ready_event, so the
consuming kernel can stall briefly if the prefetch H2D
hasn't finished. A true no-stall assertion needs
wall-clock measurement (separate workstream).
0 commit comments