[GPU] Alter reference NonZero implementation to parallelize counting by mdvoretc-intel · Pull Request #35157 · openvinotoolkit/openvino

mdvoretc-intel · 2026-04-03T08:37:27Z

Details:

This change alters the reference implementation of the NonZero operation to allow the gather_nonzero operation to utilize multiple hardware threads.
The choice to alter the reference implementation is caused by the pre-existence of a defined API between count_nonzero and gather_nonzero, which ideally should be fully refactored to be internal to separate multi-kernel implementations.
Said refactor is not implemented at this time due to request urgency.
Separate tests for gather_nonzero are deleted due to the change in the API that is internal to the node.

Tickets:

CVS-180811

AI Assistance:

AI assistance used: no

mdvoretc-intel · 2026-04-07T14:44:25Z

@isanghao please review.

Copilot

Pull request overview

This PR updates the Intel GPU plugin’s reference NonZero implementation so gather_nonzero can use multiple work-items (within a single workgroup) by changing the internal contract between count_nonzero and gather_nonzero to use per-work-item prefix offsets.

Changes:

Change count_nonzero_ref to write work-item prefix offsets (and total count) into its output buffer.
Update gather_nonzero_ref dispatch and kernel logic to partition the input across work-items and use the prefix offsets for deterministic output writes.
Update count_nonzero output layout to a fixed 1024-element buffer and remove dedicated gather_nonzero unit tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/plugins/intel_gpu/tests/unit/test_cases/non_zero_gpu_test.cpp	Removes standalone `gather_nonzero` unit tests previously validating gather behavior.
src/plugins/intel_gpu/src/kernel_selector/kernels/non_zero/gather_nonzero_kernel_ref.cpp	Updates dispatch (gws/lws) to enable multi-work-item execution (capped at 1024).
src/plugins/intel_gpu/src/kernel_selector/kernels/non_zero/count_nonzero_kernel_ref.cpp	Caps workgroup width to 1024 to match the new count→gather contract.
src/plugins/intel_gpu/src/kernel_selector/cl_kernels/gather_nonzero_ref.cl	Implements parallel input partitioning and uses prefix offsets for output placement.
src/plugins/intel_gpu/src/kernel_selector/cl_kernels/count_nonzero_ref.cl	Computes per-work-item prefix offsets via work-group scan and stores them in output.
src/plugins/intel_gpu/src/graph/non_zero.cpp	Changes `count_nonzero` output layout from scalar to a fixed 1024-element buffer.

Copilot · 2026-04-08T04:59:30Z

    }
 }

-template<typename T>
-void test_gather_non_zero(layout in_layout, std::vector<T> in_data) {
-    auto& engine = get_test_engine();
-    auto input_mem = engine.allocate_memory(in_layout);
-    auto count_non_zero = ov::reference::non_zero_get_count<T>(in_data.data(), in_layout.get_shape());
-    auto in_rank = in_layout.get_shape().size();
-    std::vector<int32_t> expected_results(count_non_zero * in_rank);
-    ov::reference::non_zero<T, int32_t>(in_data.data(), expected_results.data(), in_layout.get_shape());
-
-    auto output_shape_layout = layout{ov::PartialShape{1}, data_types::i32, format::bfyx};
-    auto output_shape_mem = engine.allocate_memory(output_shape_layout);
-    set_values(input_mem, in_data);
-
-    std::vector<int32_t> output_shape_data = {(int32_t)count_non_zero};
-
-    set_values(output_shape_mem, output_shape_data);
-
-    topology topology;
-    topology.add(input_layout("InputData", in_layout));
-    topology.add(data("OutputShape", output_shape_mem));
-    topology.add(
-        gather_nonzero("gather_nonzero", input_info("InputData"), input_info("OutputShape"))
-    );
-
-    network network(engine, topology, get_test_default_config(engine));
-
-    network.set_input_data("InputData", input_mem);
-    auto outputs = network.execute();
-    auto output = outputs.at("gather_nonzero").get_memory();
-    cldnn::mem_lock<int32_t> output_ptr(output, get_test_stream());
-    cldnn::mem_lock<int32_t> shape_ptr(output_shape_mem, get_test_stream());
-
-    for (size_t i = 0; i < expected_results.size(); ++i) {
-        ASSERT_EQ(expected_results[i], output_ptr[i]);
-    }
-}
-
-TEST(test_gather_non_zero, 4d_fp32_1_3_3_1) {
-    std::vector<float> in_data = {
-        0.1f, 0.2f, 0.3f, 0.0f,
-        0.0f, 0.4f, 0.1f, 0.9f, 0.10f
-    };
-    test_gather_non_zero<float>(layout{ov::PartialShape{1, 3, 3, 1}, data_types::f32, format::bfyx}, in_data);
-}
-
-TEST(test_gather_non_zero, 4d_fp32_2_4_3_2) {
-    std::vector<float> in_data = {
-        0.1f,   0.2f,  0.3f, 0.0f, 12.0f, 2.0f,   0.4f,  0.1f,
-        1.9f,   0.10f, 1.0f, 0.0f, 0.1f,  0.2f,   0.0f,  100.0f,
-        0.0001f,   0.0f,  2.9f, 0.2f, 4.0f,  0.0f,   9.1f,  0.9f,
-        100.0f, 0.4f,  0.1f, 0.3f, 0.0f,  24.2f,  1.23f, 0.0f,
-        4.0f,   0.0f,  3.1f, 0.9f, 0.10f, 49.2f,  0.0f,  0.3f,
-        100.0f, 0.4f,  0.1f, 0.9f, 0.1f,  33.12f, 12.1f, 0.0001f
-    };
-    test_gather_non_zero<float>(layout{ov::PartialShape{2, 4, 3, 2}, data_types::f32, format::bfyx}, in_data);
-}
-TEST(test_gather_non_zero, 4d_fp16_2_4_3_2) {
-    std::vector<ov::float16> in_data = {
-        0.1f,   0.2f,  0.3f, 0.0f, 12.0f, 2.0f,   0.4f,  0.1f,
-        1.9f,   0.10f, 1.0f, 0.0f, 0.1f,  0.2f,   0.0f,  100.0f,
-        0.0001f,   0.0f,  2.9f, 0.2f, 4.0f,  0.0f,   9.1f,  0.9f,
-        100.0f, 0.4f,  0.1f, 0.3f, 0.0f,  24.2f,  1.23f, 0.0f,
-        4.0f,   0.0f,  3.1f, 0.9f, 0.10f, 49.2f,  0.0f,  0.3f,
-        100.0f, 0.4f,  0.1f, 0.9f, 0.1f,  33.12f, 12.1f, 0.0001f
-    };
-    test_gather_non_zero<ov::float16>(layout{ov::PartialShape{2, 4, 3, 2}, data_types::f16, format::bfyx}, in_data);
-}
-
-TEST(test_gather_non_zero, 5d_fp32_1_3_3_2_2) {
-    std::vector<float> in_data = {
-        0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f,
-        0.0f, 0.0f, 0.1f, 0.9f, 0.10f, 0.001f,
-        8.0f,  3.0f, 0.1f, 0.00001f,  0.10f, 0.001f,
-        0.1f, -0.2f, 0.3f, 0.0f,  12.1f, 11.1f,
-        0.0f,  0.0f, 0.1f, 0.9f,  0.10f, 0.001f,
-        0.1f,  0.2f, 0.3f, 0.0f,  12.1f, 11.1f,
-        8.0f,  3.0f, 0.1f, 0.00001f,  0.10f, 0.001f,
-        0.1f, -0.2f, 0.3f, 0.0f,  12.1f, 11.1f,
-    };
-    test_gather_non_zero<float>(layout{ov::PartialShape{1, 3, 4, 2, 2}, data_types::f32, format::bfzyx}, in_data);
-}
-
-TEST(test_gather_non_zero, 6d_fp16_2_3_1_3_2_4) {
-   std::vector<float> in_data = {
-        0.1f,  0.2f, 0.3f, 0.0f,  12.1f, 11.1f,
-        1.0f,  0.0f, 0.1f, 0.9f,  0.10f, 0.001f,
-        0.1f,  0.2f, 0.3f, 0.0f,  12.1f, 11.1f,
-        19.0f, 0.0f, 0.1f, 0.9f,  0.10f, -0.001f,
-        0.1f,  0.2f, 0.3f, 0.0f,  12.1f, 11.1f,
-        8.0f,  3.0f, 0.1f, 0.00001f,  0.10f, 0.001f,
-        0.1f, -0.2f, 0.3f, 0.0f,  12.1f, 11.1f,
-        13.0f, 1.0f, 0.1f, 0.9f,  0.10f, 0.001f,
-        11.1f,  0.2f, 0.3f, 66.0f, 12.1f, 11.1f,
-        0.0f,  0.0001f, 0.1f, 0.9f,  0.10f, 0.001f,
-        0.1f,  0.2f, 0.3f, 2.0f,  12.1f, 11.1f,
-        0.0f,  0.0f, 0.1f, 0.9f,  0.10f, 0.001f,
-        0.1f,  0.2f, 0.3f, 0.0f,  12.1f, 11.1f,
-        -13.0f, 1.0f, 0.1f, 0.9f,  0.10f, 0.001f,
-        0.1f,  0.2f, 0.3f, 66.0f, 12.1f, 11.1f,
-        0.0f,  0.001f, 0.1f, 0.9f,  0.10f, 0.001f,
-        0.1f,  0.2f, 0.3f, 2.0f,  12.1f, 11.1f,
-        0.1f,  1.2f, 0.3f, 99.0f,  12.1f, 11.1f,
-       100.0f,  0.0f, 0.1f, 0.9f,  0.10f, 0.001f,
-        0.1f,  0.2f, 0.3f, 0.0f,  12.1f, 11.1f,
-        13.0f, 1.0f, 0.1f, 0.9f,  -0.10f, 0.001f,
-        0.1f,  0.2f, 0.3f, 66.0f, 12.1f, 11.1f,
-        0.0f,  0.0001f, 0.1f, 0.9f,  0.10f, 0.001f,
-        0.1f,  0.2f, 0.3f, 2.0f,  12.1f, 11.1f,
-    };
-    test_gather_non_zero<float>(layout{ov::PartialShape{2, 3, 1, 3, 2, 4}, data_types::f32, format::bfwzyx}, in_data);
-}
-
 TEST(non_zero_gpu, dynamic) {
    auto& engine = get_test_engine();


[HIGH] The dedicated gather_nonzero unit tests were removed, but this PR changes the internal count→gather contract and introduces a new work partitioning/prefix-sum scheme (including the 1024 work-item cap). Existing non_zero tests here are relatively small and don’t exercise the >1024-element path or the non-local-memory path that used to be covered. Please add a regression test that validates NonZero correctness for an input size >1024 (and ideally another case that forces USE_LOCAL_MEM off) to prevent silent wrong-index regressions.

Copilot · 2026-04-08T04:59:31Z

 layout count_nonzero_inst::calc_output_layout(count_nonzero_node const& node, kernel_impl_params const& impl_param) {
    assert(static_cast<bool>(node.get_primitive()->output_data_types[0]) == false &&
           "Output data type forcing is not supported for count_nonzero_node!");
-    return layout{cldnn::data_types::i32, cldnn::format::bfyx, tensor{1, 1, 1, 1}};
+    return layout{cldnn::data_types::i32, cldnn::format::bfyx, tensor{1, 1, 1, 1024}};
 }

 template<typename ShapeType>
 std::vector<layout> count_nonzero_inst::calc_output_layouts(count_nonzero_node const& /*node*/, kernel_impl_params const& impl_param) {
    assert(static_cast<bool>(impl_param.desc->output_data_types[0]) == false &&
            "Output data type forcing is not supported for count_nonzero_node!");
-    return {layout{ov::PartialShape{1}, cldnn::data_types::i32, cldnn::format::bfyx}};
+    return {layout{ov::PartialShape{1024}, cldnn::data_types::i32, cldnn::format::bfyx}};
 }


[MEDIUM] count_nonzero now advertises a fixed 1024-element output buffer. This hard-coded magic size is duplicated across graph layout and kernel dispatch (and will increase host-side reads like read_vector(...) that only need element 0). Consider centralizing the constant (or deriving the output layout from the chosen dispatch size) to avoid drift and to keep shape-inference overhead minimal.

…nzero_ref.cl Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

isanghao

I think this is injecting too much technical debt into OV code base.

Duplicated usage of 1024
layout contract between count_nonzero and gather_nonzero is very difficult to understand/maintain. For performance optimization, I would suggest to implement a proper fix.
Reduced test coverage
AFAIR, count_nonzero and gather_nonzero are separated because we need to allocate output buffer of this primitive. I am afraid it would be complicated to implemented in a single primitive because we don't have any information for output buffer size when preparing the primitive execution. How do you think about this @e-ddykim ?

isanghao · 2026-04-08T10:36:54Z

-    dispatchData.gws = {1, 1, 1};
-    dispatchData.lws = {1, 1, 1};
+    // Set 1 work group to avoid synchornization issue for summation of nonzero counting.
+    size_t max_dim_size = (input.LogicalSize() > params.engineInfo.maxWorkGroupSize) ? params.engineInfo.maxWorkGroupSize : input.LogicalSize();


nit: std::min(params.engineInfo.maxWorkGroupSize, input.LogicalSize())

mdvoretc-intel · 2026-04-08T12:39:22Z

@isanghao

layout contract between count_nonzero and gather_nonzero is very difficult to understand/maintain. For performance optimization, I would suggest to implement a proper fix.

The problem here is that the contract appears to be defined on the core level, with the reference implementation used in the tests returning only the overall count. This is not suitable for any parallelized implementation of gather_nonzero, where there needs to be an alignment between the input and output positions for each thread.

If the only problem here is output allocation then it could be possible to make a new two-kernel gather_nonzero implementation that re-runs the count to get the full information required, introducing redundant computation to maintain the existing API. Do you believe that that is the better approach?

mdvoretc-intel · 2026-04-09T15:41:10Z

@isanghao If a refactor is required, please review #35239.

[GPU] Alter reference NonZero implementation to parallelize counting

472c3b2

github-actions Bot added the category: GPU OpenVINO GPU plugin label Apr 3, 2026

sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Apr 3, 2026

mdvoretc-intel marked this pull request as ready for review April 7, 2026 09:45

mdvoretc-intel requested review from a team as code owners April 7, 2026 09:45

mlukasze requested a review from Copilot April 8, 2026 04:52

Copilot started reviewing on behalf of mlukasze April 8, 2026 04:53 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

mdvoretc-intel and others added 3 commits April 8, 2026 09:29

Update src/plugins/intel_gpu/src/kernel_selector/cl_kernels/gather_no…

ba5be7a

…nzero_ref.cl Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update count_nonzero_ref.cl

90f4a3d

Update count_nonzero_ref.cl

a62e10d

isanghao requested changes Apr 8, 2026

View reviewed changes

mdvoretc-intel mentioned this pull request Apr 9, 2026

[GPU] Introduce a parallelized NonZero implementation #35239

Open

p-durandin closed this Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Alter reference NonZero implementation to parallelize counting#35157

[GPU] Alter reference NonZero implementation to parallelize counting#35157
mdvoretc-intel wants to merge 4 commits intoopenvinotoolkit:masterfrom
mdvoretc-intel:non_zero_emergency

mdvoretc-intel commented Apr 3, 2026

Uh oh!

mdvoretc-intel commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

isanghao left a comment •

edited

Loading

Uh oh!

isanghao Apr 8, 2026

Uh oh!

mdvoretc-intel commented Apr 8, 2026

Uh oh!

mdvoretc-intel commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mdvoretc-intel commented Apr 3, 2026

Details:

Tickets:

AI Assistance:

Uh oh!

mdvoretc-intel commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

isanghao left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isanghao Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

mdvoretc-intel commented Apr 8, 2026

Uh oh!

mdvoretc-intel commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

isanghao left a comment •

edited

Loading