refactor(kheavyhash): reuse common OpenCL helpers + add resolver tests#184
Conversation
…uminousmining#179) The kHeavyHash kernel redefined helpers and the Result struct inline. Use the shared common helpers instead and move the Result struct out into a per-algo file, matching the ethash/autolykos/progpow convention. - kheavyhash.cl: drop inline rotl64/loadLe64/storeLe256; call the shared rol_u64 (kernel/common/rotate_byte.cl) and load_le_u64/store_le_u256 (kernel/common/load_store_le.cl). Bodies are bit-identical, so the KAT-gated kernel stays bit-identical. - New kheavyhash_result.cl: the Result struct + publishHit, mirroring ethash_result.cl / autolykos_v2_result.cl. - ResolverAmdKHeavyHash::buildSearch now appends rotate_byte.cl, load_store_le.cl and kheavyhash_result.cl ahead of the kernel body. - OpenCL KAT harness resolves and appends the same file set (the in-tree source layout and the deployed kernel/ tree differ, so each file is resolved independently). - CMake deploys kheavyhash_result.cl and exposes KH_CL_DIR / KH_CL_COMMON_DIR for the KAT. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing#176) The kHeavyHash AMD and NVIDIA resolvers had no resolver-level test, unlike every other algorithm. Add one each, following the blake3 resolver-test pattern: build a job with a maximum (all-0xFF) boundary so every scanned nonce is a hit, run the resolver end to end, and assert a share is submitted (paramSubmit array non-empty). Files auto-register via the tests/ file(GLOB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The headerHash/boundary assignments in the new AMD and NVIDIA resolver tests were wrapped onto two lines; the single-line form is 113 columns (under the 120 limit), so per CODING_STYLE.md it stays on one line.
| @@ -0,0 +1,24 @@ | |||
| // Result buffer shared with the host (mirrors algo::kheavyhash::Result in | |||
There was a problem hiding this comment.
Done — comment removed.
| // Result buffer shared with the host (mirrors algo::kheavyhash::Result in | ||
| // sources/algo/kheavyhash/result.hpp). MAX_RESULT is overridable by the host | ||
| // kernel generator (addDefine); the fallback keeps standalone builds valid. | ||
| #ifndef MAX_RESULT |
There was a problem hiding this comment.
No overide, define only from kernelGenerator
There was a problem hiding this comment.
Done — dropped the #ifndef MAX_RESULT fallback. The value is always supplied by the kernel generator (addDefine), like the other OpenCL kernels.
| } Result; | ||
|
|
||
|
|
||
| inline void publishHit(__global Result* result, ulong const nonce) |
There was a problem hiding this comment.
inline
void publishHit(__global Result* result, ulong const nonce)
There was a problem hiding this comment.
Done — inline now on its own line.
| #define MAX_RESULT 4 | ||
| #endif | ||
|
|
||
| typedef struct __attribute__((aligned(8))) |
There was a problem hiding this comment.
Using uchar for found creates 3 bytes of implicit padding before count to satisfy its 4-byte alignment requirement:
Offset 0 : found (1 byte)
Offset 1 : [pad] (3 bytes) ← implicit padding
Offset 4 : count (4 bytes)
Offset 8 : nonces (...)
In OpenCL, this is dangerous — the host compiler and the GPU driver compiler may handle padding differently, leading to mismatched struct layouts and corrupted buffer reads. Using uint for found removes the padding entirely and guarantees a consistent layout on both sides.
Also, we should consider switching from uint to cl_uint (and cl_ulong for ulong) on the host side. These are the portable types defined by the OpenCL spec and guarantee an exact match with the device types, regardless of the platform.
There was a problem hiding this comment.
Good catch. Widened found from uchar to uint (host Result::found from bool to uint32_t), so the implicit 3-byte padding before count is gone and the host/device layouts match exactly.
I kept uint32_t/uint64_t rather than cl_uint/cl_ulong on the host: result.hpp is shared with the CUDA build (guarded by __LIB_CUDA), which has no OpenCL headers, and every other algo's result.hpp (ethash/blake3/progpow/autolykos) uses the fixed-width types too. Happy to revisit if you'd prefer the cl_* types here.
| ${OpenCL_INCLUDE_DIRS}) | ||
| target_compile_definitions(${UNIT_TEST_EXE} | ||
| PRIVATE KH_CL_PATH="${CMAKE_CURRENT_SOURCE_DIR}/kheavyhash.cl") | ||
| PRIVATE KH_CL_PATH="${CMAKE_CURRENT_SOURCE_DIR}/kheavyhash.cl" |
There was a problem hiding this comment.
I really don't understand the need for this.
I don't need such workarounds for the other algorithms.
I will look into this to understand either the problem or the misunderstanding.
But for progpow (and its derivatives) / ethash / autolykos I never needed this.
Everything is already copied into the kernel folder via CMakeLists.txt.
Exemple:
set(OUT_PROGPOW ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/kernel/progpow)
file(MAKE_DIRECTORY ${OUT_PROGPOW})
set(PROGPOW_FILES
progpow_result.cl
evrprogpow_functions.cl
firopow_functions.cl
progpow_functions.cl
kawpow_functions.cl
meowpow_functions.cl
progpow.cl
)
foreach(FILE ${PROGPOW_FILES})
add_custom_command(
OUTPUT ${OUT_PROGPOW}/${FILE}
COMMAND ${CMAKE_COMMAND} -E copy_if_different ${CMAKE_CURRENT_SOURCE_DIR}/${FILE} ${OUT_PROGPOW}/${FILE}
DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/${FILE}
COMMENT "Copy ${FILE}"
)
endforeach()
add_custom_target(copy_progpow_opencl_files ALL
DEPENDS
${OUT_PROGPOW}/progpow_result.cl
${OUT_PROGPOW}/evrprogpow_functions.cl
${OUT_PROGPOW}/firopow_functions.cl
${OUT_PROGPOW}/progpow_functions.cl
${OUT_PROGPOW}/kawpow_functions.cl
${OUT_PROGPOW}/meowpow_functions.cl
${OUT_PROGPOW}/progpow.cl
)
There was a problem hiding this comment.
Left as-is for now — holding the .cl copy mechanism pending your AMD review, per your note below.
| // The search kernel is split across shared helpers (rol_u64, load/store LE), the | ||
| // per-algo Result struct, and the kheavyhash body — appended in this order so each | ||
| // definition precedes its use, mirroring ResolverAmdKHeavyHash::buildSearch(). | ||
| static std::array<std::string, 4> resolveKernelPaths() |
There was a problem hiding this comment.
Waiting for my review on the .cl file copy.
I need to run some tests on AMD on my side.
There was a problem hiding this comment.
Understood — leaving the kernel-path stitching as-is until you've run your AMD tests on the .cl copy approach.
| } | ||
| } | ||
|
|
||
| // rotl64 / loadLe64 / storeLe256 now come from the shared OpenCL helpers |
There was a problem hiding this comment.
Done — comments removed.
|
|
||
| // Real mining kernel: each work-item tries nonce = startNonce + global_id(0). | ||
| // On a hit (pow <= target, little-endian) it publishes its nonce into result. | ||
| // The Result struct and publishHit() come from kheavyhash_result.cl, appended |
There was a problem hiding this comment.
Done — comment removed.
…mments, notFindNonce tests) - kheavyhash_result.cl: drop header comment and MAX_RESULT fallback define (the value is always supplied by the kernel generator via addDefine); widen found from uchar to uint to remove implicit struct padding so the host and device Result layouts match exactly; put inline on its own line. - result.hpp: widen Result::found from bool to uint32_t to mirror the device uint (kept uint32_t rather than cl_uint because this header is shared with the CUDA build, which has no OpenCL headers). - opencl_kat_test.cpp: mirror the found widening in the KAT Result struct. - kheavyhash.cl: drop the appended-by-resolver meta-comments. - amd/nvidia tests: add notFindNonce coverage (minimum target -> no submit).
| // sources/algo/kheavyhash/opencl/kheavyhash.cl (found | count | nonces). | ||
| struct alignas(8) Result | ||
| { | ||
| bool found{ false }; |
There was a problem hiding this comment.
you can keep bool = 4 bytes, uint32_t = 4 bytes.
bool is more understandable than uint32_t for this feature.
There was a problem hiding this comment.
Done — reverted host-side Result::found to bool (commit 20ce8be). The device-side struct keeps uint found; with bool the host still lands count at offset 4 (1-byte bool + 3 padding), so the layout matches the device exactly. Rebuilt cross-amd and re-ran on RX 9070 XT: all 13 kHeavyHash tests green, including the OpenCL SearchKernel KAT that round-trips this struct.
| struct alignas(8) Result | ||
| { | ||
| uint8_t found{ 0u }; | ||
| uint32_t found{ 0u }; |
There was a problem hiding this comment.
bool 4 bytes
uint32_t 4 bytes
use bool please
There was a problem hiding this comment.
Done — switched the KAT mirror struct back to bool found as well (commit 20ce8be), keeping it in lockstep with algo::kheavyhash::Result. Search/SearchKernel KATs pass bit-exact on RX 9070 XT.
The device-side Result keeps uint found; on the host bool still places count at offset 4 (1-byte bool + 3 padding), matching the device layout, and reads more clearly for a found flag. Per review feedback.
What
Two related changes to the kHeavyHash backend:
1. Reuse common OpenCL helpers + split the Result struct
The kHeavyHash OpenCL kernel redefined
rotl64/loadLe64/storeLe256and theResultstruct inline. This drops the duplicates and aligns with theethash/autolykos/progpow convention:
kheavyhash.cl: drop inline helpers; call sharedrol_u64(
kernel/common/rotate_byte.cl) andload_le_u64/store_le_u256(
kernel/common/load_store_le.cl). Bodies are bit-identical.kheavyhash_result.cl: theResultstruct +publishHit, mirroringethash_result.cl/autolykos_v2_result.cl.ResolverAmdKHeavyHash::buildSearchand the OpenCL KAT harness append theshared helpers + result + body in dependency order.
kheavyhash_result.cland exposesKH_CL_DIR/KH_CL_COMMON_DIRfor the KAT.2. Resolver unit tests
New
resolver/amd/tests/kheavyhash.cppandresolver/nvidia/tests/kheavyhash.cppexercise
findNonceend to end (all-0xFFboundary → every nonce is a hit →a share is submitted).
Verification
Built via the windows-cross docker toolchain and run on real hardware
(AMD Radeon RX 9070 XT, gfx1201):
OpenClKat.powHashMatchesReference/kHeavyHashMatchesReference/heavyHashMatchesReference— PASS: the refactored kernel program buildsand is bit-identical to the CPU reference.
Search/SearchKernel.reportsHitAtWinningNonce/noHitWhenPowExceedsTarget— PASS: end-to-end winning-nonce + boundary.ResolverKHeavyHashAmdTest.findNonce— PASS.ResolverKHeavyHashNvidiaTest.findNoncecompiles and links (clang-CUDA);not run here (no NVIDIA device on the test rig).
Since the kernel bodies are bit-identical (KAT-gated), no hashrate change is
expected versus the previous self-contained kernel.