Skip to content

bench(amd): add Autolykos v2 throughput benchmark#173

Merged
luminousmining merged 7 commits into
luminousmining:mainfrom
yuzi-co:bench/amd-autolykos-v2
Jun 14, 2026
Merged

bench(amd): add Autolykos v2 throughput benchmark#173
luminousmining merged 7 commits into
luminousmining:mainfrom
yuzi-co:bench/amd-autolykos-v2

Conversation

@yuzi-co

@yuzi-co yuzi-co commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What

Adds an in-tree AMD throughput benchmark for Autolykos v2 — the last
AMD-mined algorithm without one. It reuses the production resolver kernels
(DAG build, search, verify) byte-for-byte via the kernel generator, so it
measures exactly what the miner runs.

The DAG is built once (untimed setup); the two hot-loop kernels —
autolykos_v2_search and autolykos_v2_verify — are timed independently,
with the grid expressed in nonce terms so the report reads nonces/s.

Changes

  • sources/benchmark/amd/autolykos_v2.cpprunAmdAutolykos() (new)
  • sources/benchmark/workflow.{hpp,cpp} — declare + call it in runAmd()
  • sources/benchmark/config.jsonamd.autolykos_v2 entry (disabled by default, like its siblings)
  • sources/benchmark/config.cpp — register autolykos_v2 in the AMD defaults (parity with NVIDIA)
  • documentation/BENCHMARK.md — source-tree / call-graph diagrams, kernel-table row, RX 9070 XT reference results

sources/benchmark/amd/CMakeLists.txt globs *.cpp, and the .cl files
auto-stage into bin/kernel/autolykos/ via the existing ALL target — no
build-system changes required.

Verification

Built for AMD and run on an RX 9070 XT:

Kernel Hashrate
autolykos_v2_search ~1.05 GH/s
autolykos_v2_verify ~69 MH/s

Issue

Part of #89 — completes AMD → AutolykosV2. (AMD Ethash, still unchecked
there, already landed in #142; details in an issue comment.)

@yuzi-co yuzi-co mentioned this pull request Jun 11, 2026
6 tasks
@yuzi-co yuzi-co force-pushed the bench/amd-autolykos-v2 branch from bdb4b05 to 5d62eb0 Compare June 12, 2026 05:36
Comment thread sources/benchmark/amd/autolykos_v2.cpp
yuzi-co added 3 commits June 12, 2026 12:52
Closes the last coverage gap in the AMD benchmark suite: autolykos_v2 was
the only mined AMD algorithm with no in-tree benchmark, while NVIDIA already
had one. Rescued from the stale bench/amd-ethash-progpow-ab branch (the only
unmerged content on it) and normalized to the sibling-benchmark conventions.

The benchmark reuses the production resolver kernels (DAG build, search,
verify) byte-for-byte via the kernel generator, so it measures exactly what
the miner runs. The DAG is built once (untimed) and the two hot-loop kernels
are timed independently; the grid is expressed in nonce terms so the report
reads nonces/s.

- sources/benchmark/amd/autolykos_v2.cpp: runAmdAutolykos() (new)
- workflow.{hpp,cpp}: declare + call it in runAmd()
- config.json: amd.autolykos_v2 entry, disabled by default like its siblings
- config.cpp: register autolykos_v2 in the AMD defaults (parity with NVIDIA)

Verified on RX 9070 XT: autolykos_v2_search ~1.05 GH/s, autolykos_v2_verify
~69 MH/s.
Add autolykos_v2 to the benchmark source-tree and runAmd() call-graph
diagrams, a row in the kernel coverage table, and an AMD RX 9070 XT
reference-results section (search ~1.05 GH/s, verify ~69 MH/s).
The AMD autolykos_v2 benchmark assembled the live production kernels
(kernel/autolykos/*.cl). Move to self-contained snapshots under
benchmark/opencl/autolykos_v2/, matching the kheavyhash/ethash convention:

- autolykos_v2_lm1.cl    - search throughput kernel (entry renamed)
- autolykos_v2_verify.cl
- autolykos_v2_dag.cl     - untimed DAG fill

Each is a frozen copy of the production assembly (result + rotate_byte +
blake2b [+ blake2b_compress] + var_global + search/verify/dag), so the
benchmark never tracks live mining-kernel edits. Register the files in the
opencl copy target, point the driver at kernel/autolykos_v2/*, and rename the
search stage autolykos_v2_search -> autolykos_v2_lm1 (config.json + docs).

Verified on RX 9070 XT (gfx1201): autolykos_v2_lm1 ~1.03 GH, verify ~65 MH.
@yuzi-co yuzi-co force-pushed the bench/amd-autolykos-v2 branch from 5d62eb0 to dd65791 Compare June 12, 2026 10:10
Comment thread documentation/BENCHMARK.md Outdated
| Ethash | 2 (lm_0, lm_1) | barrier vs sub_group_barrier |
| ProgPOW | 2 (lm_0, lm_1) | barrier vs sub_group_barrier |
| kHeavyHash | 6 (lm0–lm5) | ALU-bound; `v_dot4` matmul + keccak midstate ([study](reasearch_and_development/kheavyhash/amd.md)) |
| Autolykos V2 | 2 (search, verify) | production Ergo kernels — DAG-bound throughput coverage |

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should see search_lm and verify_lm.

Comment thread documentation/BENCHMARK.md Outdated
| Kernel | Hashrate (steady) | Notes |
|-----------------------|-------------------|-------|
| `autolykos_v2_lm1` | **~1.05 GH** | blake2b prehash over the DAG → BHashes; tracks the GPU boost clock, so it varies run to run |
| `autolykos_v2_verify` | ~69 MH | final blake2b + boundary test over the full grid |

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_lm0

Comment thread documentation/BENCHMARK.md Outdated

| Kernel | Hashrate (steady) | Notes |
|-----------------------|-------------------|-------|
| `autolykos_v2_lm1` | **~1.05 GH** | blake2b prehash over the DAG → BHashes; tracks the GPU boost clock, so it varies run to run |

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ssearch_lm0

0x1F83D9ABFB41BD6B, 0x5BE0CD19137E2179
};
__kernel
void autolykos_v2_verify(

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autolykos_v2_verify -> autolykos_v2_verify_lm0

ulong nonces[4];
} t_result;
inline
ulong rol_u64(

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already exist in rotate_byte.cl



inline
ulong ror_u64(

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already exisst in rotate_byte.cl



inline
uint rol_u32(uint x, uint n)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already exisst in rotate_byte.cl



inline
uint ror_u32(uint x, uint n)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already exisst in rotate_byte.cl



inline
uint bswap32(uint const x)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to rotate_byte.cl

Comment thread sources/benchmark/opencl/autolykos_v2/autolykos_v2_dag.cl
…, rename kernels

- Drop inlined rol/ror/bswap helpers from the dag/search/verify kernels;
  the driver now prepends the shared kernel/common/rotate_byte.cl, matching
  the production resolver's assembly order.
- Verify kernel uses the shared kernel/common/result.cl t_result struct
  (driver appends it); the local copy and the unused struct in the search
  kernel are removed.
- Uppercase the function-style macros and the blake2b IV constant:
  reverseBytesInt -> REVERSE_BYTES_INT, fn_Add -> FN_ADD, ivals -> IVALS.
- Rename the timed kernels and their files to the *_lm0 convention:
  autolykos_v2_lm1{.cl,} -> autolykos_v2_search_lm0,
  autolykos_v2_verify{.cl,} -> autolykos_v2_verify_lm0.
  Driver setKernelName/appendFile/runStage labels, the opencl CMakeLists
  copy list and BENCHMARK.md updated to match.
@yuzi-co

yuzi-co commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Pushed 99dce0d addressing the review.

Common code dedup

  • Removed the inlined rol_u64/ror_u64/rol_u32/ror_u32/bswap32 helpers from the dag, search and verify kernels. The driver now prepends kernel/common/rotate_byte.cl to each KernelGenerator, same as ResolverAmdAutolykosV2 does.
  • The verify kernel now uses the shared kernel/common/result.cl t_result struct (driver appends it before the kernel body). The local struct copy is gone, and the unused struct in the search kernel was dropped too.

Uppercase

  • reverseBytesIntREVERSE_BYTES_INT, fn_AddFN_ADD, ivalsIVALS.

Renames — one note: your suggested targets had a few typos (autolykoss, ssearch, earch) and mixed lm0/lm1, so I normalised to the *_lm0 convention used by the other algos. Please confirm these are what you meant:

  • file autolykos_v2_lm1.clautolykos_v2_search_lm0.cl, kernel autolykos_v2_search_lm0
  • file autolykos_v2_verify.clautolykos_v2_verify_lm0.cl, kernel autolykos_v2_verify_lm0
  • autolykos_v2_dag.cl kept its name (untimed setup, not a timed lm variant)

Driver setKernelName/appendFile/runStage labels, the benchmark/opencl/CMakeLists.txt copy list, and BENCHMARK.md (variants column + per-kernel table) all updated to match.

Happy to switch the names if you prefer a different scheme.



inline
void blake2b_compress(ulong *h, const ulong *m, ulong t, ulong f)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const order long const* const m



inline
void blake2b_compress(ulong *h, const ulong *m, ulong t, ulong f)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t and f must be const



inline
void blake2b_compress(ulong *h, const ulong *m, ulong t, ulong f)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long* const h

}
__kernel
void autolykos_v2_build_dag(
__global uint* restrict dag,

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__global uint* restrict dag, -> __global uint* const restrict dag, ?

// Hash constant message
//====================================================================//
ulong ctr = 0;
for (int x = 1; x < 16; ++x, ++ctr)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int i

Comment thread sources/benchmark/opencl/autolykos_v2/autolykos_v2_search_lm0.cl
Comment thread sources/benchmark/opencl/autolykos_v2/autolykos_v2_search_lm0.cl
Comment thread sources/benchmark/opencl/autolykos_v2/autolykos_v2_verify_lm0.cl
Comment thread sources/benchmark/opencl/CMakeLists.txt Outdated
${OUT_COMMON}/kheavyhash/kHeavyHash_lm3.cl
${OUT_COMMON}/kheavyhash/kHeavyHash_lm4.cl
${OUT_COMMON}/kheavyhash/kHeavyHash_lm5.cl
${OUT_COMMON}/autolykos_v2/autolykos_v2_dag.cl

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alphabetical order

Comment thread sources/benchmark/opencl/CMakeLists.txt Outdated
kheavyhash/kHeavyHash_lm3.cl
kheavyhash/kHeavyHash_lm4.cl
kheavyhash/kHeavyHash_lm5.cl
autolykos_v2/autolykos_v2_dag.cl

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alphabetical order

…64 helper, unroll attributes

- blake2b_compress: const-correct signature (ulong* const h, ulong const* const m,
  ulong const t/f); build_dag dag arg is now __global uint* const restrict
- add bswap64 to common/opencl/rotate_byte.cl and use it (plus existing bswap32)
  in the DAG kernel instead of inlined as_ulong/as_uchar8 byte-swaps
- DAG loop counters renamed to i; add the missing unroll hint on the message-init loop
- replace #pragma unroll / unroll N with __attribute__((opencl_unroll_hint[(N)]))
  across dag/search/verify kernels (house convention)
- 2 blank lines before each __kernel
- CMakeLists: order autolykos_v2 entries alphabetically (after common/)
@yuzi-co

yuzi-co commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

Pushed 3a543e2 addressing the latest review.

const-correctness

  • blake2b_compress signature is now void blake2b_compress(ulong* const h, ulong const* const m, ulong const t, ulong const f).
  • autolykos_v2_build_dag takes __global uint* const restrict dag.

bswap helpers

  • Added bswap64 to sources/common/opencl/rotate_byte.cl (alongside the existing bswap32). The DAG kernel now calls bswap64(...)/bswap32(...) instead of the inlined as_ulong(as_uchar8(...).s76543210) / as_uint(as_uchar4(...).s3210) byte-swaps. rotate_byte.cl is already prepended to all three kernels, so the helper is in scope everywhere.

loop counters / unroll

  • DAG loop counters renamed to i; added the missing unroll hint on the message-init loop.
  • Replaced every #pragma unroll / #pragma unroll N with __attribute__((opencl_unroll_hint)) / __attribute__((opencl_unroll_hint(N))) across the dag/search/verify kernels.

layout

  • 2 blank lines before each __kernel.

CMakeLists

  • Ordered the autolykos_v2 entries alphabetically (placed after the common/ block) in both BENCHMARK_OPENCL_FILES and the copy_benchmark_opencl DEPENDS list. I left the pre-existing progpow/kheavyhash order untouched since it's outside this PR — happy to fully sort the list if you'd prefer.

One open question: the verify/search B2B_* macros and REVERSE_BYTES_INT still use inline byte ops rather than the common helpers — I kept those as-is to keep these frozen snapshots byte-identical to the production assembly, but I can route them through rotate_byte.cl too if you'd like consistency there.

@luminousmining

Copy link
Copy Markdown
Owner

If you can create common functions that can be used in other kernels and that are not "specific" to AutolykosV2, then I would be happy to have those functions in rotate_byte.cl.

If you are referring to:

#define B2B_G(v, a, b, c, d, x, y)

This part is too specific to AutolykosV2 and should therefore be kept exclusively for AutolykosV2.

yuzi-co added 2 commits June 14, 2026 10:54
The search and verify benchmark kernels carried a local
REVERSE_BYTES_INT(input, output) macro that is byte-for-byte identical to
the bswap32 helper already provided by common/opencl/rotate_byte.cl (both
reverse the four bytes of a uint). rotate_byte.cl is already prepended to
every autolykos_v2 benchmark kernel, so route the call sites through
bswap32 and drop the duplicate macro from both files. The B2B_* blake2b
macros remain AutolykosV2-specific.

No functional change: GPU-verified on an RX 9070 XT (search ~1.49 GH/s,
verify ~69 MH/s), kernels compile and run unchanged.
The kernels were renamed to autolykos_v2_search_lm0 /
autolykos_v2_verify_lm0, but the AMD benchmark config still referenced the
pre-rename names (autolykos_v2_lm1 / autolykos_v2_verify). isKernelEnabled
therefore matched nothing and the autolykos_v2 table rendered empty.
Update the config to the current kernel names so the benchmark times the
kernels the driver actually builds.
@yuzi-co

yuzi-co commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

Done — pushed 5ce5b34.

Generic helper → rotate_byte.cl
The only generic byte op left in the search/verify kernels was REVERSE_BYTES_INT(input, output), which is byte-for-byte identical to the bswap32 already in common/opencl/rotate_byte.cl (both reverse the 4 bytes of a uint). Since rotate_byte.cl is already prepended to every autolykos_v2 benchmark kernel, I routed all call sites through bswap32 and dropped the duplicate macro from both files.

B2B_G / B2B_* — kept AutolykosV2-specific as you asked; they stay in the kernels.

Also fixed: the AMD benchmark config.json still referenced the pre-rename kernel names (autolykos_v2_lm1 / autolykos_v2_verify), so isKernelEnabled matched nothing and the autolykos table rendered empty. Updated it to autolykos_v2_search_lm0 / autolykos_v2_verify_lm0.

GPU re-verified on an RX 9070 XT (gfx1201): autolykos_v2_search_lm0 ~1.49 GH/s, autolykos_v2_verify_lm0 ~69 MH/s — kernels compile and run unchanged.

@luminousmining luminousmining merged commit 665718d into luminousmining:main Jun 14, 2026
12 checks passed
@yuzi-co yuzi-co deleted the bench/amd-autolykos-v2 branch June 19, 2026 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants