refactor(kheavyhash): reuse common OpenCL helpers + add resolver tests by yuzi-co · Pull Request #184 · luminousmining/miner

yuzi-co · 2026-06-19T14:10:23Z

What

Two related changes to the kHeavyHash backend:

1. Reuse common OpenCL helpers + split the Result struct

The kHeavyHash OpenCL kernel redefined rotl64/loadLe64/storeLe256 and the
Result struct inline. This drops the duplicates and aligns with the
ethash/autolykos/progpow convention:

kheavyhash.cl: drop inline helpers; call shared rol_u64
(kernel/common/rotate_byte.cl) and load_le_u64/store_le_u256
(kernel/common/load_store_le.cl). Bodies are bit-identical.
New kheavyhash_result.cl: the Result struct + publishHit, mirroring
ethash_result.cl / autolykos_v2_result.cl.
ResolverAmdKHeavyHash::buildSearch and the OpenCL KAT harness append the
shared helpers + result + body in dependency order.
CMake deploys kheavyhash_result.cl and exposes KH_CL_DIR /
KH_CL_COMMON_DIR for the KAT.

2. Resolver unit tests

New resolver/amd/tests/kheavyhash.cpp and resolver/nvidia/tests/kheavyhash.cpp
exercise findNonce end to end (all-0xFF boundary → every nonce is a hit →
a share is submitted).

Verification

Built via the windows-cross docker toolchain and run on real hardware
(AMD Radeon RX 9070 XT, gfx1201):

OpenClKat.powHashMatchesReference / kHeavyHashMatchesReference /
heavyHashMatchesReference — PASS: the refactored kernel program builds
and is bit-identical to the CPU reference.
Search/SearchKernel.reportsHitAtWinningNonce /
noHitWhenPowExceedsTarget — PASS: end-to-end winning-nonce + boundary.
ResolverKHeavyHashAmdTest.findNonce — PASS.
All 14 kHeavyHash unit tests green.
ResolverKHeavyHashNvidiaTest.findNonce compiles and links (clang-CUDA);
not run here (no NVIDIA device on the test rig).

Since the kernel bodies are bit-identical (KAT-gated), no hashrate change is
expected versus the previous self-contained kernel.

…uminousmining#179) The kHeavyHash kernel redefined helpers and the Result struct inline. Use the shared common helpers instead and move the Result struct out into a per-algo file, matching the ethash/autolykos/progpow convention. - kheavyhash.cl: drop inline rotl64/loadLe64/storeLe256; call the shared rol_u64 (kernel/common/rotate_byte.cl) and load_le_u64/store_le_u256 (kernel/common/load_store_le.cl). Bodies are bit-identical, so the KAT-gated kernel stays bit-identical. - New kheavyhash_result.cl: the Result struct + publishHit, mirroring ethash_result.cl / autolykos_v2_result.cl. - ResolverAmdKHeavyHash::buildSearch now appends rotate_byte.cl, load_store_le.cl and kheavyhash_result.cl ahead of the kernel body. - OpenCL KAT harness resolves and appends the same file set (the in-tree source layout and the deployed kernel/ tree differ, so each file is resolved independently). - CMake deploys kheavyhash_result.cl and exposes KH_CL_DIR / KH_CL_COMMON_DIR for the KAT. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ing#176) The kHeavyHash AMD and NVIDIA resolvers had no resolver-level test, unlike every other algorithm. Add one each, following the blake3 resolver-test pattern: build a job with a maximum (all-0xFF) boundary so every scanned nonce is a hit, run the resolver end to end, and assert a share is submitted (paramSubmit array non-empty). Files auto-register via the tests/ file(GLOB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The headerHash/boundary assignments in the new AMD and NVIDIA resolver tests were wrapped onto two lines; the single-line form is 113 columns (under the 120 limit), so per CODING_STYLE.md it stays on one line.

luminousmining · 2026-06-19T15:15:47Z

@@ -0,0 +1,24 @@
+// Result buffer shared with the host (mirrors algo::kheavyhash::Result in


Remove comment

Done — comment removed.

luminousmining · 2026-06-19T15:16:05Z

+// Result buffer shared with the host (mirrors algo::kheavyhash::Result in
+// sources/algo/kheavyhash/result.hpp). MAX_RESULT is overridable by the host
+// kernel generator (addDefine); the fallback keeps standalone builds valid.
+#ifndef MAX_RESULT


No overide, define only from kernelGenerator

Done — dropped the #ifndef MAX_RESULT fallback. The value is always supplied by the kernel generator (addDefine), like the other OpenCL kernels.

luminousmining · 2026-06-19T15:17:35Z

+} Result;
+
+
+inline void publishHit(__global Result* result, ulong const nonce)


inline
void publishHit(__global Result* result, ulong const nonce)

Done — inline now on its own line.

luminousmining · 2026-06-19T15:24:52Z

+#define MAX_RESULT 4
+#endif
+
+typedef struct __attribute__((aligned(8)))


Using uchar for found creates 3 bytes of implicit padding before count to satisfy its 4-byte alignment requirement:

Offset 0 : found (1 byte) Offset 1 : [pad] (3 bytes) ← implicit padding Offset 4 : count (4 bytes) Offset 8 : nonces (...)

In OpenCL, this is dangerous — the host compiler and the GPU driver compiler may handle padding differently, leading to mismatched struct layouts and corrupted buffer reads. Using uint for found removes the padding entirely and guarantees a consistent layout on both sides.

Also, we should consider switching from uint to cl_uint (and cl_ulong for ulong) on the host side. These are the portable types defined by the OpenCL spec and guarantee an exact match with the device types, regardless of the platform.

Good catch. Widened found from uchar to uint (host Result::found from bool to uint32_t), so the implicit 3-byte padding before count is gone and the host/device layouts match exactly.

I kept uint32_t/uint64_t rather than cl_uint/cl_ulong on the host: result.hpp is shared with the CUDA build (guarded by __LIB_CUDA), which has no OpenCL headers, and every other algo's result.hpp (ethash/blake3/progpow/autolykos) uses the fixed-width types too. Happy to revisit if you'd prefer the cl_* types here.

luminousmining · 2026-06-19T15:34:07Z

        ${OpenCL_INCLUDE_DIRS})
    target_compile_definitions(${UNIT_TEST_EXE}
-        PRIVATE KH_CL_PATH="${CMAKE_CURRENT_SOURCE_DIR}/kheavyhash.cl")
+        PRIVATE KH_CL_PATH="${CMAKE_CURRENT_SOURCE_DIR}/kheavyhash.cl"


I really don't understand the need for this.
I don't need such workarounds for the other algorithms.
I will look into this to understand either the problem or the misunderstanding.
But for progpow (and its derivatives) / ethash / autolykos I never needed this.
Everything is already copied into the kernel folder via CMakeLists.txt.

Exemple:

set(OUT_PROGPOW ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/kernel/progpow) file(MAKE_DIRECTORY ${OUT_PROGPOW}) set(PROGPOW_FILES progpow_result.cl evrprogpow_functions.cl firopow_functions.cl progpow_functions.cl kawpow_functions.cl meowpow_functions.cl progpow.cl ) foreach(FILE ${PROGPOW_FILES}) add_custom_command( OUTPUT ${OUT_PROGPOW}/${FILE} COMMAND ${CMAKE_COMMAND} -E copy_if_different ${CMAKE_CURRENT_SOURCE_DIR}/${FILE} ${OUT_PROGPOW}/${FILE} DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/${FILE} COMMENT "Copy ${FILE}" ) endforeach() add_custom_target(copy_progpow_opencl_files ALL DEPENDS ${OUT_PROGPOW}/progpow_result.cl ${OUT_PROGPOW}/evrprogpow_functions.cl ${OUT_PROGPOW}/firopow_functions.cl ${OUT_PROGPOW}/progpow_functions.cl ${OUT_PROGPOW}/kawpow_functions.cl ${OUT_PROGPOW}/meowpow_functions.cl ${OUT_PROGPOW}/progpow.cl )

Left as-is for now — holding the .cl copy mechanism pending your AMD review, per your note below.

luminousmining · 2026-06-19T15:35:27Z

+// The search kernel is split across shared helpers (rol_u64, load/store LE), the
+// per-algo Result struct, and the kheavyhash body — appended in this order so each
+// definition precedes its use, mirroring ResolverAmdKHeavyHash::buildSearch().
+static std::array<std::string, 4> resolveKernelPaths()


Waiting for my review on the .cl file copy.
I need to run some tests on AMD on my side.

Understood — leaving the kernel-path stitching as-is until you've run your AMD tests on the .cl copy approach.

luminousmining · 2026-06-19T15:37:50Z

-    }
-}
-
+// rotl64 / loadLe64 / storeLe256 now come from the shared OpenCL helpers


remove comments

Done — comments removed.

luminousmining · 2026-06-19T15:38:12Z

-
 // Real mining kernel: each work-item tries nonce = startNonce + global_id(0).
 // On a hit (pow <= target, little-endian) it publishes its nonce into result.
+// The Result struct and publishHit() come from kheavyhash_result.cl, appended


remove comment

Done — comment removed.

…mments, notFindNonce tests) - kheavyhash_result.cl: drop header comment and MAX_RESULT fallback define (the value is always supplied by the kernel generator via addDefine); widen found from uchar to uint to remove implicit struct padding so the host and device Result layouts match exactly; put inline on its own line. - result.hpp: widen Result::found from bool to uint32_t to mirror the device uint (kept uint32_t rather than cl_uint because this header is shared with the CUDA build, which has no OpenCL headers). - opencl_kat_test.cpp: mirror the found widening in the KAT Result struct. - kheavyhash.cl: drop the appended-by-resolver meta-comments. - amd/nvidia tests: add notFindNonce coverage (minimum target -> no submit).

luminousmining · 2026-06-20T19:43:10Z

        // sources/algo/kheavyhash/opencl/kheavyhash.cl (found | count | nonces).
        struct alignas(8) Result
        {
-            bool     found{ false };


you can keep bool = 4 bytes, uint32_t = 4 bytes.
bool is more understandable than uint32_t for this feature.

Done — reverted host-side Result::found to bool (commit 20ce8be). The device-side struct keeps uint found; with bool the host still lands count at offset 4 (1-byte bool + 3 padding), so the layout matches the device exactly. Rebuilt cross-amd and re-ran on RX 9070 XT: all 13 kHeavyHash tests green, including the OpenCL SearchKernel KAT that round-trips this struct.

luminousmining · 2026-06-20T19:43:57Z

    struct alignas(8) Result
    {
-        uint8_t  found{ 0u };
+        uint32_t found{ 0u };


bool 4 bytes
uint32_t 4 bytes
use bool please

Done — switched the KAT mirror struct back to bool found as well (commit 20ce8be), keeping it in lockstep with algo::kheavyhash::Result. Search/SearchKernel KATs pass bit-exact on RX 9070 XT.

The device-side Result keeps uint found; on the host bool still places count at offset 4 (1-byte bool + 3 padding), matching the device layout, and reads more clearly for a found flag. Per review feedback.

yuzi-co and others added 3 commits June 19, 2026 16:38

style(kheavyhash): collapse over-wrapped test assignments to one line

e0dd1d7

The headerHash/boundary assignments in the new AMD and NVIDIA resolver tests were wrapped onto two lines; the single-line form is 113 columns (under the 120 limit), so per CODING_STYLE.md it stays on one line.

luminousmining reviewed Jun 19, 2026

View reviewed changes

luminousmining reviewed Jun 20, 2026

View reviewed changes

refactor(kheavyhash): use bool for host-side Result::found

20ce8be

The device-side Result keeps uint found; on the host bool still places count at offset 4 (1-byte bool + 3 padding), matching the device layout, and reads more clearly for a found flag. Per review feedback.

luminousmining merged commit 454ec51 into luminousmining:main Jun 22, 2026

		@@ -0,0 +1,24 @@
		// Result buffer shared with the host (mirrors algo::kheavyhash::Result in

		} Result;


		inline void publishHit(__global Result* result, ulong const nonce)

Conversation

yuzi-co commented Jun 19, 2026

What

1. Reuse common OpenCL helpers + split the Result struct

2. Resolver unit tests

Verification

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants