bench(amd): add Autolykos v2 throughput benchmark#173
Conversation
bdb4b05 to
5d62eb0
Compare
Closes the last coverage gap in the AMD benchmark suite: autolykos_v2 was
the only mined AMD algorithm with no in-tree benchmark, while NVIDIA already
had one. Rescued from the stale bench/amd-ethash-progpow-ab branch (the only
unmerged content on it) and normalized to the sibling-benchmark conventions.
The benchmark reuses the production resolver kernels (DAG build, search,
verify) byte-for-byte via the kernel generator, so it measures exactly what
the miner runs. The DAG is built once (untimed) and the two hot-loop kernels
are timed independently; the grid is expressed in nonce terms so the report
reads nonces/s.
- sources/benchmark/amd/autolykos_v2.cpp: runAmdAutolykos() (new)
- workflow.{hpp,cpp}: declare + call it in runAmd()
- config.json: amd.autolykos_v2 entry, disabled by default like its siblings
- config.cpp: register autolykos_v2 in the AMD defaults (parity with NVIDIA)
Verified on RX 9070 XT: autolykos_v2_search ~1.05 GH/s, autolykos_v2_verify
~69 MH/s.
Add autolykos_v2 to the benchmark source-tree and runAmd() call-graph diagrams, a row in the kernel coverage table, and an AMD RX 9070 XT reference-results section (search ~1.05 GH/s, verify ~69 MH/s).
The AMD autolykos_v2 benchmark assembled the live production kernels (kernel/autolykos/*.cl). Move to self-contained snapshots under benchmark/opencl/autolykos_v2/, matching the kheavyhash/ethash convention: - autolykos_v2_lm1.cl - search throughput kernel (entry renamed) - autolykos_v2_verify.cl - autolykos_v2_dag.cl - untimed DAG fill Each is a frozen copy of the production assembly (result + rotate_byte + blake2b [+ blake2b_compress] + var_global + search/verify/dag), so the benchmark never tracks live mining-kernel edits. Register the files in the opencl copy target, point the driver at kernel/autolykos_v2/*, and rename the search stage autolykos_v2_search -> autolykos_v2_lm1 (config.json + docs). Verified on RX 9070 XT (gfx1201): autolykos_v2_lm1 ~1.03 GH, verify ~65 MH.
5d62eb0 to
dd65791
Compare
| | Ethash | 2 (lm_0, lm_1) | barrier vs sub_group_barrier | | ||
| | ProgPOW | 2 (lm_0, lm_1) | barrier vs sub_group_barrier | | ||
| | kHeavyHash | 6 (lm0–lm5) | ALU-bound; `v_dot4` matmul + keccak midstate ([study](reasearch_and_development/kheavyhash/amd.md)) | | ||
| | Autolykos V2 | 2 (search, verify) | production Ergo kernels — DAG-bound throughput coverage | |
There was a problem hiding this comment.
I should see search_lm and verify_lm.
| | Kernel | Hashrate (steady) | Notes | | ||
| |-----------------------|-------------------|-------| | ||
| | `autolykos_v2_lm1` | **~1.05 GH** | blake2b prehash over the DAG → BHashes; tracks the GPU boost clock, so it varies run to run | | ||
| | `autolykos_v2_verify` | ~69 MH | final blake2b + boundary test over the full grid | |
|
|
||
| | Kernel | Hashrate (steady) | Notes | | ||
| |-----------------------|-------------------|-------| | ||
| | `autolykos_v2_lm1` | **~1.05 GH** | blake2b prehash over the DAG → BHashes; tracks the GPU boost clock, so it varies run to run | |
| 0x1F83D9ABFB41BD6B, 0x5BE0CD19137E2179 | ||
| }; | ||
| __kernel | ||
| void autolykos_v2_verify( |
There was a problem hiding this comment.
autolykos_v2_verify -> autolykos_v2_verify_lm0
| ulong nonces[4]; | ||
| } t_result; | ||
| inline | ||
| ulong rol_u64( |
There was a problem hiding this comment.
already exist in rotate_byte.cl
|
|
||
|
|
||
| inline | ||
| ulong ror_u64( |
There was a problem hiding this comment.
already exisst in rotate_byte.cl
|
|
||
|
|
||
| inline | ||
| uint rol_u32(uint x, uint n) |
There was a problem hiding this comment.
already exisst in rotate_byte.cl
|
|
||
|
|
||
| inline | ||
| uint ror_u32(uint x, uint n) |
There was a problem hiding this comment.
already exisst in rotate_byte.cl
|
|
||
|
|
||
| inline | ||
| uint bswap32(uint const x) |
…, rename kernels
- Drop inlined rol/ror/bswap helpers from the dag/search/verify kernels;
the driver now prepends the shared kernel/common/rotate_byte.cl, matching
the production resolver's assembly order.
- Verify kernel uses the shared kernel/common/result.cl t_result struct
(driver appends it); the local copy and the unused struct in the search
kernel are removed.
- Uppercase the function-style macros and the blake2b IV constant:
reverseBytesInt -> REVERSE_BYTES_INT, fn_Add -> FN_ADD, ivals -> IVALS.
- Rename the timed kernels and their files to the *_lm0 convention:
autolykos_v2_lm1{.cl,} -> autolykos_v2_search_lm0,
autolykos_v2_verify{.cl,} -> autolykos_v2_verify_lm0.
Driver setKernelName/appendFile/runStage labels, the opencl CMakeLists
copy list and BENCHMARK.md updated to match.
|
Pushed 99dce0d addressing the review. Common code dedup
Uppercase
Renames — one note: your suggested targets had a few typos (
Driver Happy to switch the names if you prefer a different scheme. |
|
|
||
|
|
||
| inline | ||
| void blake2b_compress(ulong *h, const ulong *m, ulong t, ulong f) |
There was a problem hiding this comment.
const order long const* const m
|
|
||
|
|
||
| inline | ||
| void blake2b_compress(ulong *h, const ulong *m, ulong t, ulong f) |
|
|
||
|
|
||
| inline | ||
| void blake2b_compress(ulong *h, const ulong *m, ulong t, ulong f) |
| } | ||
| __kernel | ||
| void autolykos_v2_build_dag( | ||
| __global uint* restrict dag, |
There was a problem hiding this comment.
__global uint* restrict dag, -> __global uint* const restrict dag, ?
| // Hash constant message | ||
| //====================================================================// | ||
| ulong ctr = 0; | ||
| for (int x = 1; x < 16; ++x, ++ctr) |
| ${OUT_COMMON}/kheavyhash/kHeavyHash_lm3.cl | ||
| ${OUT_COMMON}/kheavyhash/kHeavyHash_lm4.cl | ||
| ${OUT_COMMON}/kheavyhash/kHeavyHash_lm5.cl | ||
| ${OUT_COMMON}/autolykos_v2/autolykos_v2_dag.cl |
| kheavyhash/kHeavyHash_lm3.cl | ||
| kheavyhash/kHeavyHash_lm4.cl | ||
| kheavyhash/kHeavyHash_lm5.cl | ||
| autolykos_v2/autolykos_v2_dag.cl |
…64 helper, unroll attributes - blake2b_compress: const-correct signature (ulong* const h, ulong const* const m, ulong const t/f); build_dag dag arg is now __global uint* const restrict - add bswap64 to common/opencl/rotate_byte.cl and use it (plus existing bswap32) in the DAG kernel instead of inlined as_ulong/as_uchar8 byte-swaps - DAG loop counters renamed to i; add the missing unroll hint on the message-init loop - replace #pragma unroll / unroll N with __attribute__((opencl_unroll_hint[(N)])) across dag/search/verify kernels (house convention) - 2 blank lines before each __kernel - CMakeLists: order autolykos_v2 entries alphabetically (after common/)
|
Pushed const-correctness
bswap helpers
loop counters / unroll
layout
CMakeLists
One open question: the verify/search |
|
If you can create common functions that can be used in other kernels and that are not "specific" to AutolykosV2, then I would be happy to have those functions in If you are referring to: This part is too specific to |
The search and verify benchmark kernels carried a local REVERSE_BYTES_INT(input, output) macro that is byte-for-byte identical to the bswap32 helper already provided by common/opencl/rotate_byte.cl (both reverse the four bytes of a uint). rotate_byte.cl is already prepended to every autolykos_v2 benchmark kernel, so route the call sites through bswap32 and drop the duplicate macro from both files. The B2B_* blake2b macros remain AutolykosV2-specific. No functional change: GPU-verified on an RX 9070 XT (search ~1.49 GH/s, verify ~69 MH/s), kernels compile and run unchanged.
The kernels were renamed to autolykos_v2_search_lm0 / autolykos_v2_verify_lm0, but the AMD benchmark config still referenced the pre-rename names (autolykos_v2_lm1 / autolykos_v2_verify). isKernelEnabled therefore matched nothing and the autolykos_v2 table rendered empty. Update the config to the current kernel names so the benchmark times the kernels the driver actually builds.
|
Done — pushed Generic helper →
Also fixed: the AMD benchmark GPU re-verified on an RX 9070 XT (gfx1201): |
What
Adds an in-tree AMD throughput benchmark for Autolykos v2 — the last
AMD-mined algorithm without one. It reuses the production resolver kernels
(DAG build, search, verify) byte-for-byte via the kernel generator, so it
measures exactly what the miner runs.
The DAG is built once (untimed setup); the two hot-loop kernels —
autolykos_v2_searchandautolykos_v2_verify— are timed independently,with the grid expressed in nonce terms so the report reads nonces/s.
Changes
sources/benchmark/amd/autolykos_v2.cpp—runAmdAutolykos()(new)sources/benchmark/workflow.{hpp,cpp}— declare + call it inrunAmd()sources/benchmark/config.json—amd.autolykos_v2entry (disabled by default, like its siblings)sources/benchmark/config.cpp— registerautolykos_v2in the AMD defaults (parity with NVIDIA)documentation/BENCHMARK.md— source-tree / call-graph diagrams, kernel-table row, RX 9070 XT reference resultssources/benchmark/amd/CMakeLists.txtglobs*.cpp, and the.clfilesauto-stage into
bin/kernel/autolykos/via the existingALLtarget — nobuild-system changes required.
Verification
Built for AMD and run on an RX 9070 XT:
autolykos_v2_searchautolykos_v2_verifyIssue
Part of #89 — completes AMD → AutolykosV2. (AMD Ethash, still unchecked
there, already landed in #142; details in an issue comment.)