intel_gpu: fix unnecessary tmp_out buffer per-layer in paged_attention

gta · Copilot · gta · commit 8575a7d0d9bb · 2026-04-21T03:16:16.000Z
paged_attention_opt__multi_tokens allocates a tmp_out scratch buffer sized total_tokens * heads_num * v_head_size * num_of_partitions * sizeof(float). For Qwen3-30B with chunk_size=4096 and 8K KV context this is 2 GB per layer. With 48 layers all executing sequentially, this totalled 96 GB of demand-paged USM device allocation. On Intel iGPU (ARLS, i915 driver), the driver pins the entire allocation as Unevictable on first GPU access regardless of pages touched, causing CL_OUT_OF_RESOURCES on a 31 GB machine. Root cause: can_share_internal_buffer(false) in paged_attention_node unconditionally blocked the memory pool for ALL internal buffers. This was added in PR openvinotoolkit#33204 to prevent CPU/GPU races on lockable buffers (blocks_indexes_start/end, blocked_gws_subseq_mapping) written by prepare_internal_buffers(). However it also blocked pool reuse for non-lockable GPU-only buffers (exp_sums, max_logits, tmp_out) which are safe to share across sequential layers. Fix: - Remove can_share_internal_buffer(false) from paged_attention_node; per-buffer lockability already tracked via BufferDescriptor::m_lockable. so CPU-written (lockable=true, usm_host) buffers remain non-shareable while GPU-only (lockable=false, usm_device) buffers can be reused from the pool. - In allocate_internal_buffers(): pass buffer_descs[i].m_lockable to the call (previously dropped, causing wrong alloc type on initial allocation). Result: 48 layers share one 2 GB tmp_out buffer instead of allocating 48 separate 2 GB buffers. Peak Unevictable drops from OOM crash (~28+ GB) to ~18.9 GB on ARLS (Intel Arc 8086:7d67, Arrow Lake-S iGPU, 31 GB). Verified: Qwen3-30B-A3B-Instruct-2507-int4-ov with chunk_size=4096, 8K prompt, ContinuousBatching on ARLS completes successfully with exit code 0 and 20 coherent output tokens. Not affected on ARLH (supports_immad=true takes micro_sdpa path which does not allocate tmp_out at all). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
diff --git a/src/plugins/intel_gpu/src/graph/paged_attention.cpp b/src/plugins/intel_gpu/src/graph/paged_attention.cpp
@@ -13,7 +13,9 @@ GPU_DEFINE_PRIMITIVE_TYPE_ID(paged_attention)
 
 paged_attention_node::typed_program_node(const std::shared_ptr<paged_attention> prim, program& prog)
     : parent(prim, prog) {
-    can_share_internal_buffer(false);
+    // Internal buffer sharing is controlled per-buffer via m_lockable in allocate_internal_buffer():
+    // - lockable (CPU-written) buffers: not shared (avoids CPU/GPU race in prepare_internal_buffers)
+    // - non-lockable (GPU-only) buffers: reused from pool (e.g. 2GB tmp_out shared across layers)
 }
 
 layout paged_attention_inst::calc_output_layout(const paged_attention_node& /*node*/, kernel_impl_params const& impl_param) {
diff --git a/src/plugins/intel_gpu/src/graph/primitive_inst.cpp b/src/plugins/intel_gpu/src/graph/primitive_inst.cpp
@@ -2492,7 +2492,7 @@ memory::ptr primitive_inst::allocate_internal_buffer(const layout& layout, size_
                              *_node,
                              layout,
                              alloc_type,
-                             can_share_internal_buffer(),
+                             !lockable && can_share_internal_buffer(),
                              _runtime_memory_dependencies,
                              reset,
                              _intermediates_memory.size() > idx ? _intermediates_memory[idx].get() : nullptr);
@@ -2513,7 +2513,7 @@ void primitive_inst::allocate_internal_buffers(bool reset) {
     for (size_t i = 0; i < buffer_descs.size(); ++i) {
         if (buffer_descs[i].m_layout.get_linear_size() == 0)
             continue;
-        intermediates_memory.push_back(allocate_internal_buffer(buffer_descs[i].m_layout, i, reset));
+        intermediates_memory.push_back(allocate_internal_buffer(buffer_descs[i].m_layout, i, reset, buffer_descs[i].m_lockable));
         _max_intermediates_memory_sizes.push_back(intermediates_memory[i]->size());
     }
     _intermediates_memory = intermediates_memory;

Original file line number	Diff line number	Diff line change
`@@ -13,7 +13,9 @@ GPU_DEFINE_PRIMITIVE_TYPE_ID(paged_attention)`
`13`	`13`
`14`	`14`	`paged_attention_node::typed_program_node(const std::shared_ptr<paged_attention> prim, program& prog)`
`15`	`15`	`: parent(prim, prog) {`
`16`		`- can_share_internal_buffer(false);`
	`16`	`+ // Internal buffer sharing is controlled per-buffer via m_lockable in allocate_internal_buffer():`
	`17`	`+ // - lockable (CPU-written) buffers: not shared (avoids CPU/GPU race in prepare_internal_buffers)`
	`18`	`+ // - non-lockable (GPU-only) buffers: reused from pool (e.g. 2GB tmp_out shared across layers)`
`17`	`19`	`}`
`18`	`20`
`19`	`21`	`layout paged_attention_inst::calc_output_layout(const paged_attention_node& /node/, kernel_impl_params const& impl_param) {`