Skip to content

[GPU] Alter reference NonZero implementation to parallelize counting#35157

Closed
mdvoretc-intel wants to merge 4 commits intoopenvinotoolkit:masterfrom
mdvoretc-intel:non_zero_emergency
Closed

[GPU] Alter reference NonZero implementation to parallelize counting#35157
mdvoretc-intel wants to merge 4 commits intoopenvinotoolkit:masterfrom
mdvoretc-intel:non_zero_emergency

Conversation

@mdvoretc-intel
Copy link
Copy Markdown
Contributor

Details:

  • This change alters the reference implementation of the NonZero operation to allow the gather_nonzero operation to utilize multiple hardware threads.
  • The choice to alter the reference implementation is caused by the pre-existence of a defined API between count_nonzero and gather_nonzero, which ideally should be fully refactored to be internal to separate multi-kernel implementations.
  • Said refactor is not implemented at this time due to request urgency.
  • Separate tests for gather_nonzero are deleted due to the change in the API that is internal to the node.

Tickets:

AI Assistance:

  • AI assistance used: no

@github-actions github-actions Bot added the category: GPU OpenVINO GPU plugin label Apr 3, 2026
@sys-openvino-ci sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Apr 3, 2026
@mdvoretc-intel mdvoretc-intel marked this pull request as ready for review April 7, 2026 09:45
@mdvoretc-intel mdvoretc-intel requested review from a team as code owners April 7, 2026 09:45
@mdvoretc-intel
Copy link
Copy Markdown
Contributor Author

@isanghao please review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Intel GPU plugin’s reference NonZero implementation so gather_nonzero can use multiple work-items (within a single workgroup) by changing the internal contract between count_nonzero and gather_nonzero to use per-work-item prefix offsets.

Changes:

  • Change count_nonzero_ref to write work-item prefix offsets (and total count) into its output buffer.
  • Update gather_nonzero_ref dispatch and kernel logic to partition the input across work-items and use the prefix offsets for deterministic output writes.
  • Update count_nonzero output layout to a fixed 1024-element buffer and remove dedicated gather_nonzero unit tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/plugins/intel_gpu/tests/unit/test_cases/non_zero_gpu_test.cpp Removes standalone gather_nonzero unit tests previously validating gather behavior.
src/plugins/intel_gpu/src/kernel_selector/kernels/non_zero/gather_nonzero_kernel_ref.cpp Updates dispatch (gws/lws) to enable multi-work-item execution (capped at 1024).
src/plugins/intel_gpu/src/kernel_selector/kernels/non_zero/count_nonzero_kernel_ref.cpp Caps workgroup width to 1024 to match the new count→gather contract.
src/plugins/intel_gpu/src/kernel_selector/cl_kernels/gather_nonzero_ref.cl Implements parallel input partitioning and uses prefix offsets for output placement.
src/plugins/intel_gpu/src/kernel_selector/cl_kernels/count_nonzero_ref.cl Computes per-work-item prefix offsets via work-group scan and stores them in output.
src/plugins/intel_gpu/src/graph/non_zero.cpp Changes count_nonzero output layout from scalar to a fixed 1024-element buffer.

Comment thread src/plugins/intel_gpu/src/kernel_selector/cl_kernels/gather_nonzero_ref.cl Outdated
Comment on lines 159 to 163
}
}

template<typename T>
void test_gather_non_zero(layout in_layout, std::vector<T> in_data) {
auto& engine = get_test_engine();
auto input_mem = engine.allocate_memory(in_layout);
auto count_non_zero = ov::reference::non_zero_get_count<T>(in_data.data(), in_layout.get_shape());
auto in_rank = in_layout.get_shape().size();
std::vector<int32_t> expected_results(count_non_zero * in_rank);
ov::reference::non_zero<T, int32_t>(in_data.data(), expected_results.data(), in_layout.get_shape());

auto output_shape_layout = layout{ov::PartialShape{1}, data_types::i32, format::bfyx};
auto output_shape_mem = engine.allocate_memory(output_shape_layout);
set_values(input_mem, in_data);

std::vector<int32_t> output_shape_data = {(int32_t)count_non_zero};

set_values(output_shape_mem, output_shape_data);

topology topology;
topology.add(input_layout("InputData", in_layout));
topology.add(data("OutputShape", output_shape_mem));
topology.add(
gather_nonzero("gather_nonzero", input_info("InputData"), input_info("OutputShape"))
);

network network(engine, topology, get_test_default_config(engine));

network.set_input_data("InputData", input_mem);
auto outputs = network.execute();
auto output = outputs.at("gather_nonzero").get_memory();
cldnn::mem_lock<int32_t> output_ptr(output, get_test_stream());
cldnn::mem_lock<int32_t> shape_ptr(output_shape_mem, get_test_stream());

for (size_t i = 0; i < expected_results.size(); ++i) {
ASSERT_EQ(expected_results[i], output_ptr[i]);
}
}

TEST(test_gather_non_zero, 4d_fp32_1_3_3_1) {
std::vector<float> in_data = {
0.1f, 0.2f, 0.3f, 0.0f,
0.0f, 0.4f, 0.1f, 0.9f, 0.10f
};
test_gather_non_zero<float>(layout{ov::PartialShape{1, 3, 3, 1}, data_types::f32, format::bfyx}, in_data);
}

TEST(test_gather_non_zero, 4d_fp32_2_4_3_2) {
std::vector<float> in_data = {
0.1f, 0.2f, 0.3f, 0.0f, 12.0f, 2.0f, 0.4f, 0.1f,
1.9f, 0.10f, 1.0f, 0.0f, 0.1f, 0.2f, 0.0f, 100.0f,
0.0001f, 0.0f, 2.9f, 0.2f, 4.0f, 0.0f, 9.1f, 0.9f,
100.0f, 0.4f, 0.1f, 0.3f, 0.0f, 24.2f, 1.23f, 0.0f,
4.0f, 0.0f, 3.1f, 0.9f, 0.10f, 49.2f, 0.0f, 0.3f,
100.0f, 0.4f, 0.1f, 0.9f, 0.1f, 33.12f, 12.1f, 0.0001f
};
test_gather_non_zero<float>(layout{ov::PartialShape{2, 4, 3, 2}, data_types::f32, format::bfyx}, in_data);
}
TEST(test_gather_non_zero, 4d_fp16_2_4_3_2) {
std::vector<ov::float16> in_data = {
0.1f, 0.2f, 0.3f, 0.0f, 12.0f, 2.0f, 0.4f, 0.1f,
1.9f, 0.10f, 1.0f, 0.0f, 0.1f, 0.2f, 0.0f, 100.0f,
0.0001f, 0.0f, 2.9f, 0.2f, 4.0f, 0.0f, 9.1f, 0.9f,
100.0f, 0.4f, 0.1f, 0.3f, 0.0f, 24.2f, 1.23f, 0.0f,
4.0f, 0.0f, 3.1f, 0.9f, 0.10f, 49.2f, 0.0f, 0.3f,
100.0f, 0.4f, 0.1f, 0.9f, 0.1f, 33.12f, 12.1f, 0.0001f
};
test_gather_non_zero<ov::float16>(layout{ov::PartialShape{2, 4, 3, 2}, data_types::f16, format::bfyx}, in_data);
}

TEST(test_gather_non_zero, 5d_fp32_1_3_3_2_2) {
std::vector<float> in_data = {
0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f,
0.0f, 0.0f, 0.1f, 0.9f, 0.10f, 0.001f,
8.0f, 3.0f, 0.1f, 0.00001f, 0.10f, 0.001f,
0.1f, -0.2f, 0.3f, 0.0f, 12.1f, 11.1f,
0.0f, 0.0f, 0.1f, 0.9f, 0.10f, 0.001f,
0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f,
8.0f, 3.0f, 0.1f, 0.00001f, 0.10f, 0.001f,
0.1f, -0.2f, 0.3f, 0.0f, 12.1f, 11.1f,
};
test_gather_non_zero<float>(layout{ov::PartialShape{1, 3, 4, 2, 2}, data_types::f32, format::bfzyx}, in_data);
}

TEST(test_gather_non_zero, 6d_fp16_2_3_1_3_2_4) {
std::vector<float> in_data = {
0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f,
1.0f, 0.0f, 0.1f, 0.9f, 0.10f, 0.001f,
0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f,
19.0f, 0.0f, 0.1f, 0.9f, 0.10f, -0.001f,
0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f,
8.0f, 3.0f, 0.1f, 0.00001f, 0.10f, 0.001f,
0.1f, -0.2f, 0.3f, 0.0f, 12.1f, 11.1f,
13.0f, 1.0f, 0.1f, 0.9f, 0.10f, 0.001f,
11.1f, 0.2f, 0.3f, 66.0f, 12.1f, 11.1f,
0.0f, 0.0001f, 0.1f, 0.9f, 0.10f, 0.001f,
0.1f, 0.2f, 0.3f, 2.0f, 12.1f, 11.1f,
0.0f, 0.0f, 0.1f, 0.9f, 0.10f, 0.001f,
0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f,
-13.0f, 1.0f, 0.1f, 0.9f, 0.10f, 0.001f,
0.1f, 0.2f, 0.3f, 66.0f, 12.1f, 11.1f,
0.0f, 0.001f, 0.1f, 0.9f, 0.10f, 0.001f,
0.1f, 0.2f, 0.3f, 2.0f, 12.1f, 11.1f,
0.1f, 1.2f, 0.3f, 99.0f, 12.1f, 11.1f,
100.0f, 0.0f, 0.1f, 0.9f, 0.10f, 0.001f,
0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f,
13.0f, 1.0f, 0.1f, 0.9f, -0.10f, 0.001f,
0.1f, 0.2f, 0.3f, 66.0f, 12.1f, 11.1f,
0.0f, 0.0001f, 0.1f, 0.9f, 0.10f, 0.001f,
0.1f, 0.2f, 0.3f, 2.0f, 12.1f, 11.1f,
};
test_gather_non_zero<float>(layout{ov::PartialShape{2, 3, 1, 3, 2, 4}, data_types::f32, format::bfwzyx}, in_data);
}

TEST(non_zero_gpu, dynamic) {
auto& engine = get_test_engine();
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] The dedicated gather_nonzero unit tests were removed, but this PR changes the internal count→gather contract and introduces a new work partitioning/prefix-sum scheme (including the 1024 work-item cap). Existing non_zero tests here are relatively small and don’t exercise the >1024-element path or the non-local-memory path that used to be covered. Please add a regression test that validates NonZero correctness for an input size >1024 (and ideally another case that forces USE_LOCAL_MEM off) to prevent silent wrong-index regressions.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 19 to 30
layout count_nonzero_inst::calc_output_layout(count_nonzero_node const& node, kernel_impl_params const& impl_param) {
assert(static_cast<bool>(node.get_primitive()->output_data_types[0]) == false &&
"Output data type forcing is not supported for count_nonzero_node!");
return layout{cldnn::data_types::i32, cldnn::format::bfyx, tensor{1, 1, 1, 1}};
return layout{cldnn::data_types::i32, cldnn::format::bfyx, tensor{1, 1, 1, 1024}};
}

template<typename ShapeType>
std::vector<layout> count_nonzero_inst::calc_output_layouts(count_nonzero_node const& /*node*/, kernel_impl_params const& impl_param) {
assert(static_cast<bool>(impl_param.desc->output_data_types[0]) == false &&
"Output data type forcing is not supported for count_nonzero_node!");
return {layout{ov::PartialShape{1}, cldnn::data_types::i32, cldnn::format::bfyx}};
return {layout{ov::PartialShape{1024}, cldnn::data_types::i32, cldnn::format::bfyx}};
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] count_nonzero now advertises a fixed 1024-element output buffer. This hard-coded magic size is duplicated across graph layout and kernel dispatch (and will increase host-side reads like read_vector(...) that only need element 0). Consider centralizing the constant (or deriving the output layout from the chosen dispatch size) to avoid drift and to keep shape-inference overhead minimal.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@isanghao isanghao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is injecting too much technical debt into OV code base.

  • Duplicated usage of 1024
  • layout contract between count_nonzero and gather_nonzero is very difficult to understand/maintain. For performance optimization, I would suggest to implement a proper fix.
  • Reduced test coverage
  • AFAIR, count_nonzero and gather_nonzero are separated because we need to allocate output buffer of this primitive. I am afraid it would be complicated to implemented in a single primitive because we don't have any information for output buffer size when preparing the primitive execution. How do you think about this @e-ddykim ?

dispatchData.gws = {1, 1, 1};
dispatchData.lws = {1, 1, 1};
// Set 1 work group to avoid synchornization issue for summation of nonzero counting.
size_t max_dim_size = (input.LogicalSize() > params.engineInfo.maxWorkGroupSize) ? params.engineInfo.maxWorkGroupSize : input.LogicalSize();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: std::min(params.engineInfo.maxWorkGroupSize, input.LogicalSize())

@mdvoretc-intel
Copy link
Copy Markdown
Contributor Author

@isanghao

layout contract between count_nonzero and gather_nonzero is very difficult to understand/maintain. For performance optimization, I would suggest to implement a proper fix.

The problem here is that the contract appears to be defined on the core level, with the reference implementation used in the tests returning only the overall count. This is not suitable for any parallelized implementation of gather_nonzero, where there needs to be an alignment between the input and output positions for each thread.

If the only problem here is output allocation then it could be possible to make a new two-kernel gather_nonzero implementation that re-runs the count to get the full information required, introducing redundant computation to maintain the existing API. Do you believe that that is the better approach?

@mdvoretc-intel
Copy link
Copy Markdown
Contributor Author

@isanghao If a refactor is required, please review #35239.

@p-durandin p-durandin closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin ExternalIntelPR External contributor from Intel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants