[GPU] Alter reference NonZero implementation to parallelize counting#35157
[GPU] Alter reference NonZero implementation to parallelize counting#35157mdvoretc-intel wants to merge 4 commits intoopenvinotoolkit:masterfrom
Conversation
|
@isanghao please review. |
There was a problem hiding this comment.
Pull request overview
This PR updates the Intel GPU plugin’s reference NonZero implementation so gather_nonzero can use multiple work-items (within a single workgroup) by changing the internal contract between count_nonzero and gather_nonzero to use per-work-item prefix offsets.
Changes:
- Change
count_nonzero_refto write work-item prefix offsets (and total count) into its output buffer. - Update
gather_nonzero_refdispatch and kernel logic to partition the input across work-items and use the prefix offsets for deterministic output writes. - Update
count_nonzerooutput layout to a fixed 1024-element buffer and remove dedicatedgather_nonzerounit tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/plugins/intel_gpu/tests/unit/test_cases/non_zero_gpu_test.cpp | Removes standalone gather_nonzero unit tests previously validating gather behavior. |
| src/plugins/intel_gpu/src/kernel_selector/kernels/non_zero/gather_nonzero_kernel_ref.cpp | Updates dispatch (gws/lws) to enable multi-work-item execution (capped at 1024). |
| src/plugins/intel_gpu/src/kernel_selector/kernels/non_zero/count_nonzero_kernel_ref.cpp | Caps workgroup width to 1024 to match the new count→gather contract. |
| src/plugins/intel_gpu/src/kernel_selector/cl_kernels/gather_nonzero_ref.cl | Implements parallel input partitioning and uses prefix offsets for output placement. |
| src/plugins/intel_gpu/src/kernel_selector/cl_kernels/count_nonzero_ref.cl | Computes per-work-item prefix offsets via work-group scan and stores them in output. |
| src/plugins/intel_gpu/src/graph/non_zero.cpp | Changes count_nonzero output layout from scalar to a fixed 1024-element buffer. |
| } | ||
| } | ||
|
|
||
| template<typename T> | ||
| void test_gather_non_zero(layout in_layout, std::vector<T> in_data) { | ||
| auto& engine = get_test_engine(); | ||
| auto input_mem = engine.allocate_memory(in_layout); | ||
| auto count_non_zero = ov::reference::non_zero_get_count<T>(in_data.data(), in_layout.get_shape()); | ||
| auto in_rank = in_layout.get_shape().size(); | ||
| std::vector<int32_t> expected_results(count_non_zero * in_rank); | ||
| ov::reference::non_zero<T, int32_t>(in_data.data(), expected_results.data(), in_layout.get_shape()); | ||
|
|
||
| auto output_shape_layout = layout{ov::PartialShape{1}, data_types::i32, format::bfyx}; | ||
| auto output_shape_mem = engine.allocate_memory(output_shape_layout); | ||
| set_values(input_mem, in_data); | ||
|
|
||
| std::vector<int32_t> output_shape_data = {(int32_t)count_non_zero}; | ||
|
|
||
| set_values(output_shape_mem, output_shape_data); | ||
|
|
||
| topology topology; | ||
| topology.add(input_layout("InputData", in_layout)); | ||
| topology.add(data("OutputShape", output_shape_mem)); | ||
| topology.add( | ||
| gather_nonzero("gather_nonzero", input_info("InputData"), input_info("OutputShape")) | ||
| ); | ||
|
|
||
| network network(engine, topology, get_test_default_config(engine)); | ||
|
|
||
| network.set_input_data("InputData", input_mem); | ||
| auto outputs = network.execute(); | ||
| auto output = outputs.at("gather_nonzero").get_memory(); | ||
| cldnn::mem_lock<int32_t> output_ptr(output, get_test_stream()); | ||
| cldnn::mem_lock<int32_t> shape_ptr(output_shape_mem, get_test_stream()); | ||
|
|
||
| for (size_t i = 0; i < expected_results.size(); ++i) { | ||
| ASSERT_EQ(expected_results[i], output_ptr[i]); | ||
| } | ||
| } | ||
|
|
||
| TEST(test_gather_non_zero, 4d_fp32_1_3_3_1) { | ||
| std::vector<float> in_data = { | ||
| 0.1f, 0.2f, 0.3f, 0.0f, | ||
| 0.0f, 0.4f, 0.1f, 0.9f, 0.10f | ||
| }; | ||
| test_gather_non_zero<float>(layout{ov::PartialShape{1, 3, 3, 1}, data_types::f32, format::bfyx}, in_data); | ||
| } | ||
|
|
||
| TEST(test_gather_non_zero, 4d_fp32_2_4_3_2) { | ||
| std::vector<float> in_data = { | ||
| 0.1f, 0.2f, 0.3f, 0.0f, 12.0f, 2.0f, 0.4f, 0.1f, | ||
| 1.9f, 0.10f, 1.0f, 0.0f, 0.1f, 0.2f, 0.0f, 100.0f, | ||
| 0.0001f, 0.0f, 2.9f, 0.2f, 4.0f, 0.0f, 9.1f, 0.9f, | ||
| 100.0f, 0.4f, 0.1f, 0.3f, 0.0f, 24.2f, 1.23f, 0.0f, | ||
| 4.0f, 0.0f, 3.1f, 0.9f, 0.10f, 49.2f, 0.0f, 0.3f, | ||
| 100.0f, 0.4f, 0.1f, 0.9f, 0.1f, 33.12f, 12.1f, 0.0001f | ||
| }; | ||
| test_gather_non_zero<float>(layout{ov::PartialShape{2, 4, 3, 2}, data_types::f32, format::bfyx}, in_data); | ||
| } | ||
| TEST(test_gather_non_zero, 4d_fp16_2_4_3_2) { | ||
| std::vector<ov::float16> in_data = { | ||
| 0.1f, 0.2f, 0.3f, 0.0f, 12.0f, 2.0f, 0.4f, 0.1f, | ||
| 1.9f, 0.10f, 1.0f, 0.0f, 0.1f, 0.2f, 0.0f, 100.0f, | ||
| 0.0001f, 0.0f, 2.9f, 0.2f, 4.0f, 0.0f, 9.1f, 0.9f, | ||
| 100.0f, 0.4f, 0.1f, 0.3f, 0.0f, 24.2f, 1.23f, 0.0f, | ||
| 4.0f, 0.0f, 3.1f, 0.9f, 0.10f, 49.2f, 0.0f, 0.3f, | ||
| 100.0f, 0.4f, 0.1f, 0.9f, 0.1f, 33.12f, 12.1f, 0.0001f | ||
| }; | ||
| test_gather_non_zero<ov::float16>(layout{ov::PartialShape{2, 4, 3, 2}, data_types::f16, format::bfyx}, in_data); | ||
| } | ||
|
|
||
| TEST(test_gather_non_zero, 5d_fp32_1_3_3_2_2) { | ||
| std::vector<float> in_data = { | ||
| 0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f, | ||
| 0.0f, 0.0f, 0.1f, 0.9f, 0.10f, 0.001f, | ||
| 8.0f, 3.0f, 0.1f, 0.00001f, 0.10f, 0.001f, | ||
| 0.1f, -0.2f, 0.3f, 0.0f, 12.1f, 11.1f, | ||
| 0.0f, 0.0f, 0.1f, 0.9f, 0.10f, 0.001f, | ||
| 0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f, | ||
| 8.0f, 3.0f, 0.1f, 0.00001f, 0.10f, 0.001f, | ||
| 0.1f, -0.2f, 0.3f, 0.0f, 12.1f, 11.1f, | ||
| }; | ||
| test_gather_non_zero<float>(layout{ov::PartialShape{1, 3, 4, 2, 2}, data_types::f32, format::bfzyx}, in_data); | ||
| } | ||
|
|
||
| TEST(test_gather_non_zero, 6d_fp16_2_3_1_3_2_4) { | ||
| std::vector<float> in_data = { | ||
| 0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f, | ||
| 1.0f, 0.0f, 0.1f, 0.9f, 0.10f, 0.001f, | ||
| 0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f, | ||
| 19.0f, 0.0f, 0.1f, 0.9f, 0.10f, -0.001f, | ||
| 0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f, | ||
| 8.0f, 3.0f, 0.1f, 0.00001f, 0.10f, 0.001f, | ||
| 0.1f, -0.2f, 0.3f, 0.0f, 12.1f, 11.1f, | ||
| 13.0f, 1.0f, 0.1f, 0.9f, 0.10f, 0.001f, | ||
| 11.1f, 0.2f, 0.3f, 66.0f, 12.1f, 11.1f, | ||
| 0.0f, 0.0001f, 0.1f, 0.9f, 0.10f, 0.001f, | ||
| 0.1f, 0.2f, 0.3f, 2.0f, 12.1f, 11.1f, | ||
| 0.0f, 0.0f, 0.1f, 0.9f, 0.10f, 0.001f, | ||
| 0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f, | ||
| -13.0f, 1.0f, 0.1f, 0.9f, 0.10f, 0.001f, | ||
| 0.1f, 0.2f, 0.3f, 66.0f, 12.1f, 11.1f, | ||
| 0.0f, 0.001f, 0.1f, 0.9f, 0.10f, 0.001f, | ||
| 0.1f, 0.2f, 0.3f, 2.0f, 12.1f, 11.1f, | ||
| 0.1f, 1.2f, 0.3f, 99.0f, 12.1f, 11.1f, | ||
| 100.0f, 0.0f, 0.1f, 0.9f, 0.10f, 0.001f, | ||
| 0.1f, 0.2f, 0.3f, 0.0f, 12.1f, 11.1f, | ||
| 13.0f, 1.0f, 0.1f, 0.9f, -0.10f, 0.001f, | ||
| 0.1f, 0.2f, 0.3f, 66.0f, 12.1f, 11.1f, | ||
| 0.0f, 0.0001f, 0.1f, 0.9f, 0.10f, 0.001f, | ||
| 0.1f, 0.2f, 0.3f, 2.0f, 12.1f, 11.1f, | ||
| }; | ||
| test_gather_non_zero<float>(layout{ov::PartialShape{2, 3, 1, 3, 2, 4}, data_types::f32, format::bfwzyx}, in_data); | ||
| } | ||
|
|
||
| TEST(non_zero_gpu, dynamic) { | ||
| auto& engine = get_test_engine(); |
There was a problem hiding this comment.
[HIGH] The dedicated gather_nonzero unit tests were removed, but this PR changes the internal count→gather contract and introduces a new work partitioning/prefix-sum scheme (including the 1024 work-item cap). Existing non_zero tests here are relatively small and don’t exercise the >1024-element path or the non-local-memory path that used to be covered. Please add a regression test that validates NonZero correctness for an input size >1024 (and ideally another case that forces USE_LOCAL_MEM off) to prevent silent wrong-index regressions.
| layout count_nonzero_inst::calc_output_layout(count_nonzero_node const& node, kernel_impl_params const& impl_param) { | ||
| assert(static_cast<bool>(node.get_primitive()->output_data_types[0]) == false && | ||
| "Output data type forcing is not supported for count_nonzero_node!"); | ||
| return layout{cldnn::data_types::i32, cldnn::format::bfyx, tensor{1, 1, 1, 1}}; | ||
| return layout{cldnn::data_types::i32, cldnn::format::bfyx, tensor{1, 1, 1, 1024}}; | ||
| } | ||
|
|
||
| template<typename ShapeType> | ||
| std::vector<layout> count_nonzero_inst::calc_output_layouts(count_nonzero_node const& /*node*/, kernel_impl_params const& impl_param) { | ||
| assert(static_cast<bool>(impl_param.desc->output_data_types[0]) == false && | ||
| "Output data type forcing is not supported for count_nonzero_node!"); | ||
| return {layout{ov::PartialShape{1}, cldnn::data_types::i32, cldnn::format::bfyx}}; | ||
| return {layout{ov::PartialShape{1024}, cldnn::data_types::i32, cldnn::format::bfyx}}; | ||
| } |
There was a problem hiding this comment.
[MEDIUM] count_nonzero now advertises a fixed 1024-element output buffer. This hard-coded magic size is duplicated across graph layout and kernel dispatch (and will increase host-side reads like read_vector(...) that only need element 0). Consider centralizing the constant (or deriving the output layout from the chosen dispatch size) to avoid drift and to keep shape-inference overhead minimal.
…nzero_ref.cl Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
I think this is injecting too much technical debt into OV code base.
- Duplicated usage of 1024
- layout contract between
count_nonzeroandgather_nonzerois very difficult to understand/maintain. For performance optimization, I would suggest to implement a proper fix. - Reduced test coverage
- AFAIR,
count_nonzeroandgather_nonzeroare separated because we need to allocate output buffer of this primitive. I am afraid it would be complicated to implemented in a single primitive because we don't have any information for output buffer size when preparing the primitive execution. How do you think about this @e-ddykim ?
| dispatchData.gws = {1, 1, 1}; | ||
| dispatchData.lws = {1, 1, 1}; | ||
| // Set 1 work group to avoid synchornization issue for summation of nonzero counting. | ||
| size_t max_dim_size = (input.LogicalSize() > params.engineInfo.maxWorkGroupSize) ? params.engineInfo.maxWorkGroupSize : input.LogicalSize(); |
There was a problem hiding this comment.
nit: std::min(params.engineInfo.maxWorkGroupSize, input.LogicalSize())
The problem here is that the contract appears to be defined on the core level, with the reference implementation used in the tests returning only the overall count. This is not suitable for any parallelized implementation of If the only problem here is output allocation then it could be possible to make a new two-kernel |
Details:
gather_nonzerooperation to utilize multiple hardware threads.count_nonzeroandgather_nonzero, which ideally should be fully refactored to be internal to separate multi-kernel implementations.gather_nonzeroare deleted due to the change in the API that is internal to the node.Tickets:
AI Assistance: