Skip to content

refactor(kheavyhash): reuse common OpenCL helpers + add resolver tests#184

Merged
luminousmining merged 5 commits into
luminousmining:mainfrom
yuzi-co:feat/kheavyhash-opencl-common
Jun 22, 2026
Merged

refactor(kheavyhash): reuse common OpenCL helpers + add resolver tests#184
luminousmining merged 5 commits into
luminousmining:mainfrom
yuzi-co:feat/kheavyhash-opencl-common

Conversation

@yuzi-co

@yuzi-co yuzi-co commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

What

Two related changes to the kHeavyHash backend:

1. Reuse common OpenCL helpers + split the Result struct

The kHeavyHash OpenCL kernel redefined rotl64/loadLe64/storeLe256 and the
Result struct inline. This drops the duplicates and aligns with the
ethash/autolykos/progpow convention:

  • kheavyhash.cl: drop inline helpers; call shared rol_u64
    (kernel/common/rotate_byte.cl) and load_le_u64/store_le_u256
    (kernel/common/load_store_le.cl). Bodies are bit-identical.
  • New kheavyhash_result.cl: the Result struct + publishHit, mirroring
    ethash_result.cl / autolykos_v2_result.cl.
  • ResolverAmdKHeavyHash::buildSearch and the OpenCL KAT harness append the
    shared helpers + result + body in dependency order.
  • CMake deploys kheavyhash_result.cl and exposes KH_CL_DIR /
    KH_CL_COMMON_DIR for the KAT.

2. Resolver unit tests

New resolver/amd/tests/kheavyhash.cpp and resolver/nvidia/tests/kheavyhash.cpp
exercise findNonce end to end (all-0xFF boundary → every nonce is a hit →
a share is submitted).

Verification

Built via the windows-cross docker toolchain and run on real hardware
(AMD Radeon RX 9070 XT, gfx1201):

  • OpenClKat.powHashMatchesReference / kHeavyHashMatchesReference /
    heavyHashMatchesReferencePASS: the refactored kernel program builds
    and is bit-identical to the CPU reference.
  • Search/SearchKernel.reportsHitAtWinningNonce /
    noHitWhenPowExceedsTargetPASS: end-to-end winning-nonce + boundary.
  • ResolverKHeavyHashAmdTest.findNoncePASS.
  • All 14 kHeavyHash unit tests green.
  • ResolverKHeavyHashNvidiaTest.findNonce compiles and links (clang-CUDA);
    not run here (no NVIDIA device on the test rig).

Since the kernel bodies are bit-identical (KAT-gated), no hashrate change is
expected versus the previous self-contained kernel.

yuzi-co and others added 3 commits June 19, 2026 16:38
…uminousmining#179)

The kHeavyHash kernel redefined helpers and the Result struct inline. Use
the shared common helpers instead and move the Result struct out into a
per-algo file, matching the ethash/autolykos/progpow convention.

- kheavyhash.cl: drop inline rotl64/loadLe64/storeLe256; call the shared
  rol_u64 (kernel/common/rotate_byte.cl) and load_le_u64/store_le_u256
  (kernel/common/load_store_le.cl). Bodies are bit-identical, so the
  KAT-gated kernel stays bit-identical.
- New kheavyhash_result.cl: the Result struct + publishHit, mirroring
  ethash_result.cl / autolykos_v2_result.cl.
- ResolverAmdKHeavyHash::buildSearch now appends rotate_byte.cl,
  load_store_le.cl and kheavyhash_result.cl ahead of the kernel body.
- OpenCL KAT harness resolves and appends the same file set (the in-tree
  source layout and the deployed kernel/ tree differ, so each file is
  resolved independently).
- CMake deploys kheavyhash_result.cl and exposes KH_CL_DIR /
  KH_CL_COMMON_DIR for the KAT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing#176)

The kHeavyHash AMD and NVIDIA resolvers had no resolver-level test, unlike
every other algorithm. Add one each, following the blake3 resolver-test
pattern: build a job with a maximum (all-0xFF) boundary so every scanned
nonce is a hit, run the resolver end to end, and assert a share is
submitted (paramSubmit array non-empty).

Files auto-register via the tests/ file(GLOB).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The headerHash/boundary assignments in the new AMD and NVIDIA resolver
tests were wrapped onto two lines; the single-line form is 113 columns
(under the 120 limit), so per CODING_STYLE.md it stays on one line.
@@ -0,0 +1,24 @@
// Result buffer shared with the host (mirrors algo::kheavyhash::Result in

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — comment removed.

// Result buffer shared with the host (mirrors algo::kheavyhash::Result in
// sources/algo/kheavyhash/result.hpp). MAX_RESULT is overridable by the host
// kernel generator (addDefine); the fallback keeps standalone builds valid.
#ifndef MAX_RESULT

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No overide, define only from kernelGenerator

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — dropped the #ifndef MAX_RESULT fallback. The value is always supplied by the kernel generator (addDefine), like the other OpenCL kernels.

} Result;


inline void publishHit(__global Result* result, ulong const nonce)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline
void publishHit(__global Result* result, ulong const nonce)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — inline now on its own line.

#define MAX_RESULT 4
#endif

typedef struct __attribute__((aligned(8)))

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using uchar for found creates 3 bytes of implicit padding before count to satisfy its 4-byte alignment requirement:

Offset 0 : found  (1 byte)
Offset 1 : [pad]  (3 bytes) ← implicit padding
Offset 4 : count  (4 bytes)
Offset 8 : nonces (...)

In OpenCL, this is dangerous — the host compiler and the GPU driver compiler may handle padding differently, leading to mismatched struct layouts and corrupted buffer reads. Using uint for found removes the padding entirely and guarantees a consistent layout on both sides.

Also, we should consider switching from uint to cl_uint (and cl_ulong for ulong) on the host side. These are the portable types defined by the OpenCL spec and guarantee an exact match with the device types, regardless of the platform.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Widened found from uchar to uint (host Result::found from bool to uint32_t), so the implicit 3-byte padding before count is gone and the host/device layouts match exactly.

I kept uint32_t/uint64_t rather than cl_uint/cl_ulong on the host: result.hpp is shared with the CUDA build (guarded by __LIB_CUDA), which has no OpenCL headers, and every other algo's result.hpp (ethash/blake3/progpow/autolykos) uses the fixed-width types too. Happy to revisit if you'd prefer the cl_* types here.

Comment thread sources/resolver/amd/tests/kheavyhash.cpp
${OpenCL_INCLUDE_DIRS})
target_compile_definitions(${UNIT_TEST_EXE}
PRIVATE KH_CL_PATH="${CMAKE_CURRENT_SOURCE_DIR}/kheavyhash.cl")
PRIVATE KH_CL_PATH="${CMAKE_CURRENT_SOURCE_DIR}/kheavyhash.cl"

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't understand the need for this.
I don't need such workarounds for the other algorithms.
I will look into this to understand either the problem or the misunderstanding.
But for progpow (and its derivatives) / ethash / autolykos I never needed this.
Everything is already copied into the kernel folder via CMakeLists.txt.

Exemple:

set(OUT_PROGPOW ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/kernel/progpow)

file(MAKE_DIRECTORY ${OUT_PROGPOW})

set(PROGPOW_FILES
  progpow_result.cl
  evrprogpow_functions.cl
  firopow_functions.cl
  progpow_functions.cl
  kawpow_functions.cl
  meowpow_functions.cl
  progpow.cl
)

foreach(FILE ${PROGPOW_FILES})
  add_custom_command(
    OUTPUT ${OUT_PROGPOW}/${FILE}
    COMMAND ${CMAKE_COMMAND} -E copy_if_different ${CMAKE_CURRENT_SOURCE_DIR}/${FILE} ${OUT_PROGPOW}/${FILE}
    DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/${FILE}
    COMMENT "Copy ${FILE}"
  )
endforeach()

add_custom_target(copy_progpow_opencl_files ALL
  DEPENDS
    ${OUT_PROGPOW}/progpow_result.cl
    ${OUT_PROGPOW}/evrprogpow_functions.cl
    ${OUT_PROGPOW}/firopow_functions.cl
    ${OUT_PROGPOW}/progpow_functions.cl
    ${OUT_PROGPOW}/kawpow_functions.cl
    ${OUT_PROGPOW}/meowpow_functions.cl
    ${OUT_PROGPOW}/progpow.cl
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left as-is for now — holding the .cl copy mechanism pending your AMD review, per your note below.

// The search kernel is split across shared helpers (rol_u64, load/store LE), the
// per-algo Result struct, and the kheavyhash body — appended in this order so each
// definition precedes its use, mirroring ResolverAmdKHeavyHash::buildSearch().
static std::array<std::string, 4> resolveKernelPaths()

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for my review on the .cl file copy.
I need to run some tests on AMD on my side.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood — leaving the kernel-path stitching as-is until you've run your AMD tests on the .cl copy approach.

}
}

// rotl64 / loadLe64 / storeLe256 now come from the shared OpenCL helpers

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — comments removed.


// Real mining kernel: each work-item tries nonce = startNonce + global_id(0).
// On a hit (pow <= target, little-endian) it publishes its nonce into result.
// The Result struct and publishHit() come from kheavyhash_result.cl, appended

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — comment removed.

Comment thread sources/resolver/nvidia/tests/kheavyhash.cpp
…mments, notFindNonce tests)

- kheavyhash_result.cl: drop header comment and MAX_RESULT fallback define
  (the value is always supplied by the kernel generator via addDefine);
  widen found from uchar to uint to remove implicit struct padding so the
  host and device Result layouts match exactly; put inline on its own line.
- result.hpp: widen Result::found from bool to uint32_t to mirror the device
  uint (kept uint32_t rather than cl_uint because this header is shared with
  the CUDA build, which has no OpenCL headers).
- opencl_kat_test.cpp: mirror the found widening in the KAT Result struct.
- kheavyhash.cl: drop the appended-by-resolver meta-comments.
- amd/nvidia tests: add notFindNonce coverage (minimum target -> no submit).
// sources/algo/kheavyhash/opencl/kheavyhash.cl (found | count | nonces).
struct alignas(8) Result
{
bool found{ false };

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can keep bool = 4 bytes, uint32_t = 4 bytes.
bool is more understandable than uint32_t for this feature.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — reverted host-side Result::found to bool (commit 20ce8be). The device-side struct keeps uint found; with bool the host still lands count at offset 4 (1-byte bool + 3 padding), so the layout matches the device exactly. Rebuilt cross-amd and re-ran on RX 9070 XT: all 13 kHeavyHash tests green, including the OpenCL SearchKernel KAT that round-trips this struct.

struct alignas(8) Result
{
uint8_t found{ 0u };
uint32_t found{ 0u };

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool 4 bytes
uint32_t 4 bytes
use bool please

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — switched the KAT mirror struct back to bool found as well (commit 20ce8be), keeping it in lockstep with algo::kheavyhash::Result. Search/SearchKernel KATs pass bit-exact on RX 9070 XT.

The device-side Result keeps uint found; on the host bool still places
count at offset 4 (1-byte bool + 3 padding), matching the device layout,
and reads more clearly for a found flag. Per review feedback.
@luminousmining luminousmining merged commit 454ec51 into luminousmining:main Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants