Skip to content

Commit 922adad

Browse files
Gasoonjiagasoonjia
andauthored
Per weight constant cache (#18901)
**Problem**: Multi-method AOTI models (e.g., Qwen3.5 MoE with separate prefill/decode methods) load the full weight blob independently for each method, even when they share identical weights. This causes duplicate GPU allocations -- Qwen3.5 MoE peaked at ~35 GB, making it impossible to run on a single 24 GB GPU (e.g., 4090). **Solution**: Introduce a per-weight FQN-keyed constant cache in `CudaBackend`. The first method loads its constants from the blob and caches them. Subsequent methods with matching FQNs skip blob loading entirely and reuse cached GPU tensors via `update_user_managed_constant_buffer_pairs`. A legacy fallback path is preserved for older AOTI models without constant management APIs. **Results** Peak GPU memory: 35.4 GB → 17.6 GB (-50%) --------- Co-authored-by: gasoonjia <gasoonjia@fb.com>
1 parent 62d8f21 commit 922adad

5 files changed

Lines changed: 428 additions & 163 deletions

File tree

.ci/scripts/test_model_e2e.sh

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -397,6 +397,27 @@ if [ -n "$EXPECTED_OUTPUT" ]; then
397397
else
398398
echo "SUCCESS: Runner completed successfully"
399399
fi
400+
401+
# Validate GPU peak memory usage for models with known memory budgets.
402+
# The runner prints "GPU peak memory usage: XXXX.X MiB" at the end.
403+
case "$MODEL_NAME" in
404+
qwen3_5_moe)
405+
MAX_MEMORY_MIB=20480 # 20 GB — must fit on a single GPU (e.g. 4090)
406+
PEAK_MEM=$(echo "$OUTPUT" | grep -oP 'GPU peak memory usage: \K[0-9.]+' || true)
407+
if [ -n "$PEAK_MEM" ]; then
408+
# Compare as integers (truncate decimals)
409+
PEAK_MEM_INT=${PEAK_MEM%%.*}
410+
if [ "$PEAK_MEM_INT" -gt "$MAX_MEMORY_MIB" ]; then
411+
echo "FAIL: GPU peak memory ${PEAK_MEM} MiB exceeds budget ${MAX_MEMORY_MIB} MiB"
412+
exit 1
413+
else
414+
echo "Success: GPU peak memory ${PEAK_MEM} MiB within budget (max ${MAX_MEMORY_MIB} MiB)"
415+
fi
416+
else
417+
echo "WARNING: GPU peak memory usage not found in output"
418+
fi
419+
;;
420+
esac
400421
echo "::endgroup::"
401422

402423
popd

backends/aoti/aoti_backend.py

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@
2525

2626
class COMPILE_SPEC_KEYS(Enum):
2727
METHOD_NAME = "method_name"
28-
SHARE_KV_CACHE_ACROSS_METHODS = "share_kv_cache_across_methods"
2928

3029

3130
@experimental(
@@ -287,13 +286,3 @@ def method_name_from_compile_specs(
287286
raise RuntimeError(
288287
f"Could not find method name in compile specs: {compile_specs}"
289288
)
290-
291-
@classmethod
292-
def generate_share_kv_cache_compile_spec(cls) -> CompileSpec:
293-
"""
294-
Generate a CompileSpec to enable cross-method KV cache sharing.
295-
"""
296-
return CompileSpec(
297-
COMPILE_SPEC_KEYS.SHARE_KV_CACHE_ACROSS_METHODS.value,
298-
bytes([1]),
299-
)

0 commit comments

Comments
 (0)