Skip to content

Commit 8575a7d

Browse files
gtaCopilot
authored andcommitted
intel_gpu: fix unnecessary tmp_out buffer per-layer in paged_attention
paged_attention_opt__multi_tokens allocates a tmp_out scratch buffer sized total_tokens * heads_num * v_head_size * num_of_partitions * sizeof(float). For Qwen3-30B with chunk_size=4096 and 8K KV context this is 2 GB per layer. With 48 layers all executing sequentially, this totalled 96 GB of demand-paged USM device allocation. On Intel iGPU (ARLS, i915 driver), the driver pins the entire allocation as Unevictable on first GPU access regardless of pages touched, causing CL_OUT_OF_RESOURCES on a 31 GB machine. Root cause: can_share_internal_buffer(false) in paged_attention_node unconditionally blocked the memory pool for ALL internal buffers. This was added in PR openvinotoolkit#33204 to prevent CPU/GPU races on lockable buffers (blocks_indexes_start/end, blocked_gws_subseq_mapping) written by prepare_internal_buffers(). However it also blocked pool reuse for non-lockable GPU-only buffers (exp_sums, max_logits, tmp_out) which are safe to share across sequential layers. Fix: - Remove can_share_internal_buffer(false) from paged_attention_node; per-buffer lockability already tracked via BufferDescriptor::m_lockable. so CPU-written (lockable=true, usm_host) buffers remain non-shareable while GPU-only (lockable=false, usm_device) buffers can be reused from the pool. - In allocate_internal_buffers(): pass buffer_descs[i].m_lockable to the call (previously dropped, causing wrong alloc type on initial allocation). Result: 48 layers share one 2 GB tmp_out buffer instead of allocating 48 separate 2 GB buffers. Peak Unevictable drops from OOM crash (~28+ GB) to ~18.9 GB on ARLS (Intel Arc 8086:7d67, Arrow Lake-S iGPU, 31 GB). Verified: Qwen3-30B-A3B-Instruct-2507-int4-ov with chunk_size=4096, 8K prompt, ContinuousBatching on ARLS completes successfully with exit code 0 and 20 coherent output tokens. Not affected on ARLH (supports_immad=true takes micro_sdpa path which does not allocate tmp_out at all). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 8a030fd commit 8575a7d

2 files changed

Lines changed: 5 additions & 3 deletions

File tree

src/plugins/intel_gpu/src/graph/paged_attention.cpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,9 @@ GPU_DEFINE_PRIMITIVE_TYPE_ID(paged_attention)
1313

1414
paged_attention_node::typed_program_node(const std::shared_ptr<paged_attention> prim, program& prog)
1515
: parent(prim, prog) {
16-
can_share_internal_buffer(false);
16+
// Internal buffer sharing is controlled per-buffer via m_lockable in allocate_internal_buffer():
17+
// - lockable (CPU-written) buffers: not shared (avoids CPU/GPU race in prepare_internal_buffers)
18+
// - non-lockable (GPU-only) buffers: reused from pool (e.g. 2GB tmp_out shared across layers)
1719
}
1820

1921
layout paged_attention_inst::calc_output_layout(const paged_attention_node& /*node*/, kernel_impl_params const& impl_param) {

src/plugins/intel_gpu/src/graph/primitive_inst.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2492,7 +2492,7 @@ memory::ptr primitive_inst::allocate_internal_buffer(const layout& layout, size_
24922492
*_node,
24932493
layout,
24942494
alloc_type,
2495-
can_share_internal_buffer(),
2495+
!lockable && can_share_internal_buffer(),
24962496
_runtime_memory_dependencies,
24972497
reset,
24982498
_intermediates_memory.size() > idx ? _intermediates_memory[idx].get() : nullptr);
@@ -2513,7 +2513,7 @@ void primitive_inst::allocate_internal_buffers(bool reset) {
25132513
for (size_t i = 0; i < buffer_descs.size(); ++i) {
25142514
if (buffer_descs[i].m_layout.get_linear_size() == 0)
25152515
continue;
2516-
intermediates_memory.push_back(allocate_internal_buffer(buffer_descs[i].m_layout, i, reset));
2516+
intermediates_memory.push_back(allocate_internal_buffer(buffer_descs[i].m_layout, i, reset, buffer_descs[i].m_lockable));
25172517
_max_intermediates_memory_sizes.push_back(intermediates_memory[i]->size());
25182518
}
25192519
_intermediates_memory = intermediates_memory;

0 commit comments

Comments
 (0)