Integrate ARM64 NUDUPL and add MASM VDF path and harden Windows CI/builds by hoffmang9 · Pull Request #300 · Chia-Network/chiavdf

hoffmang9 · 2026-02-06T03:50:54Z

Summary

Integrate the ARM64 NUDUPL VDF path and related core plumbing, including the new base-integer VDF implementation and listener/profile support.
Expand and harden cross-platform build support across CMake, Makefile.vdf-client, Python packaging metadata, and Rust workflow pieces.
Improve Windows compatibility in asm/runtime paths (ABI handling, AVX-related behavior, and portability fixes) needed for reliable Windows execution.
Update CI workflows (test.yaml, rust.yml) and project docs to match the current Windows runner/toolchain setup and remove stale perf-only assumptions.

Test plan

GitHub Actions test.yaml passes on Ubuntu, macOS, and Windows.
GitHub Actions rust.yml passes.
vdf_client, vdf_bench, and VDF tests build and run in CI for the enabled configurations.

Note

High Risk
Touches performance-critical VDF/asm codegen, platform ABI/stack handling, and proof-threading/callback synchronization, so regressions could affect correctness or stability across architectures despite added CI coverage.

Overview
Adds Windows coverage to CI for vdf_client/tests/bench by extending test.yaml with a Windows matrix, Boost/MPIR caching + setup, CMake/Ninja/clang-cl builds, Windows test execution, and platform-specific benchmarking; rust.yml and cibuildwheel macOS setup are hardened (idempotent installs/retries, ensure CMake availability).

Modernizes the CMake build to support standalone builds of vdf_client, vdf_bench, tests, and hardware tools via new BUILD_* options, plus an optional generated GNU-asm pipeline (compile_asm) including Windows-specific linking/env handling and Boost header discovery.

Refactors core C++ for portability and correctness: AVX feature detection becomes std::call_once + std::atomic with new runtime env flags (enable/disable/force AVX2 and AVX-512 IFMA) and optional logging; asm generation and VM/prologue/stack/register handling gain Windows ABI support; multiple codepaths are gated behind CHIA_DISABLE_ASM, and several proof/callback APIs switch to safer value/copy semantics with bounds checks, locking, and proper thread lifecycle management.

^{Written by Cursor Bugbot for commit 46bd58e. This will update automatically on new commits. Configure here.}

hoffmang9 · 2026-02-07T23:55:12Z

@cursor review

hoffmang9 · 2026-02-09T22:48:22Z

@cursor review

… docs. This captures the current branch updates, including the TwoWesolowski position-locking fix and related CMake/parameter/readme adjustments for current CI work. Co-authored-by: Cursor <cursoragent@cursor.com>

Prevent `emu_hw_test` and `emu_hw_vdf_client` from being defined on Windows so CMake does not try to compile sources that depend on POSIX headers. Co-authored-by: Cursor <cursoragent@cursor.com>

Set forms_capacity when allocating FastAlgorithmCallback forms so all WesolowskiCallback subclasses consistently initialize capacity metadata for safe bounds checks. Co-authored-by: Cursor <cursoragent@cursor.com>

Add the same a_end_index range guard used on Linux before the CMP/JE chain so out-of-range values jump to the error path instead of falling through to a kernel label. Co-authored-by: Cursor <cursoragent@cursor.com>

Match the Makefile behavior so compile_asm emits Mach-O-compatible assembly on Intel macOS instead of falling back to Linux/ELF code paths. Co-authored-by: Cursor <cursoragent@cursor.com>

Allow asm_function Windows stack-arg loading to be toggled for internal call sites, and drop dead <excpt.h> from threading since no SEH constructs are used. Co-authored-by: Cursor <cursoragent@cursor.com>

Keep Linux absolute addressing and restore macOS/Linux jump-dispatch parity with main, while making detached-thread fallback logging debug-only. Co-authored-by: Cursor <cursoragent@cursor.com>

hoffmang9 · 2026-02-13T05:58:46Z

Code Review: `nudupl-ci-windows-runner` branch by Opus 4.6

Summary

This branch ports the x86 ASM pipeline to Windows (Win64 ABI with clang-cl + GNU assembler), adds a non-ASM NUDUPL fallback for ARM/non-x86, and introduces numerous hardening improvements (thread safety, bounds checks, thread-local scratch contexts). Reviewed 37 changed files, ~1,500 new lines.

1. Linux/macOS ASM Paths: No Performance Regression

Verdict: No regressions. Some minor hot-path improvements.

All changes to the ASM code generator are gated behind #ifdef CHIA_WINDOWS or #if defined(CHIA_WINDOWS) with corresponding #else/#elif leaving the existing Linux and macOS code paths byte-for-byte identical. Specifically:

asm_vm.h -- The non-Windows prologue (MOV RAX, RSP; AND RSP, -64; SUB RSP, 64; MOV [RSP], RAX) and epilogue (MOV RSP, [RSP]) are enclosed in the #else of #ifdef CHIA_WINDOWS. The callee-saved register set and argument register set for SysV are untouched.
asm_types.h -- Spill offset calculation value - spill_bytes (negative) is preserved in the #else branch.
asm_base.h -- RIP-relative (macOS) vs absolute (Linux) addressing patterns are preserved. Windows gets .section .rdata,"dr" but the other branches are untouched.
asm_gcd_unsigned.h -- Linux's O(1) jump table dispatch is in the #else branch, fully preserved.
All other asm_gcd_*.h files -- Windows additions are purely additive.

Hot-path improvements that help all platforms

nucomp.h (qfb_nudupl): mpz_init/mpz_clear of 15 mpz_t temporaries per call replaced with thread_local scratch context. This is a meaningful win for ARM's NUDUPL path and affects any platform using C++ nudupl.
xgcd_partial.c: Three mpz_tdiv_q_2exp + mpz_get_ui calls per inner loop iteration replaced with the inline chiavdf_mpz_extract_uword_from_shift_nonneg that reads limbs directly. Thread-local q/r temporaries eliminate per-call alloc churn. Good improvement.
nucomp.h: mpz_fdiv_r replaced with mpz_tdiv_r + sign fixup -- valid because the modulus a1 is always positive. Typically faster.
generic.h: static to thread_local for stringstream helpers. Correctness fix (was data-race on concurrent calls), not in hot path.

2. Windows ASM Port: Correctness Review

Overall verdict: The core Win64 ABI implementation is correct and well-structured.

2a. Win64 ABI Compliance in `asm_vm.h` -- Correct

The prologue/epilogue correctly handles all Win64 ABI requirements:

Callee-saved registers: Line 44 correctly adds RSI and RDI to all_save_regs (these are volatile on SysV but non-volatile on Win64).
Argument registers: Line 49 correctly uses RCX, RDX, R8, R9 for args 1-4, with R10/R11 reserved for args 5-6.
Stack arguments: Lines 108-113 correctly load args 5+ from [RSP+0x28]/[RSP+0x30] (the Win64 stack arg area past the shadow space) before any PUSH instructions modify RSP. The gcd_base function (6 args, called from C) is the primary consumer and this is correct for it.
XMM6-XMM15 save/restore: Lines 145-166 (save) and 201-222 (restore) correctly preserve all 10 non-volatile XMM registers. The code properly uses VMOVDQU/YMM for AVX codegen and MOVDQU/XMM for non-AVX -- correct.
Stack frame layout: The frame calculation at line 136-137 properly accounts for saved_rsp(8) + spill_bytes + shadow_space(0x20) + xmm_save_area. The 64-byte alignment is maintained correctly.

2b. Spill Area Layout in `asm_types.h` -- Correct

On Windows, spill addresses use [RSP + value + 8] (positive offset, above the saved RSP at [RSP]). Non-Windows uses [RSP + value - spill_bytes] (negative offset). Both correctly map the spill area. The +8 accounts for the saved original RSP stored at [RSP+0].

2c. Read-only Data Sections -- Correct

All constant data (lookup tables, mask constants) correctly uses .section .rdata,"dr" for Windows PE/COFF, which is the standard read-only data section. This is the correct counterpart to .text 1 (Linux) and .text (macOS).

2d. RIP-relative Addressing -- Correct

Windows correctly uses [RIP+label] for all data accesses, matching the macOS pattern. This is required for Windows PE/COFF. The changes in asm_base.h, asm_gcd_128.h, asm_gcd_base_continued_fractions.h, asm_gcd_base_divide_table.h, and asm_avx512_ifma.h all look correct.

3. Issues Found

ISSUE 1 (Bug, Pre-existing, Medium Severity): macOS gcd_unsigned dispatch off-by-one

In src/asm_gcd_unsigned.h, the macOS dispatch compares spill_a_end_index (a 0-based index, per comment at line 364) against size = end_index + 1:

#elif defined(CHIAOSX)
        APPEND_M(str( ".text" ));

        APPEND_M(str( "MOV `tmp, `spill_a_end_index" ));

        for (int end_index=0;end_index<int_size;++end_index) {
            int size=end_index+1;
            // ...
            APPEND_M(str( "CMP `tmp, #", size ));       // <-- compares against size, not end_index
            APPEND_M(str( "JE ")+asmprefix+str("multiply_uv_size_#", mapped_size ));
        }

This is an off-by-one: when spill_a_end_index = N, it matches at end_index = N-1 (where size = N), dispatching to the multiply function for N limbs instead of N+1 limbs.

This causes incorrect dispatch at mapped-size boundaries. For example with spill_a_end_index = 4 (5 limbs, should dispatch to multiply_uv_size_8), macOS instead dispatches to multiply_uv_size_4, processing only 4 limbs and missing a potentially non-zero limb.

The new Windows code correctly uses end_index instead of size. Recommended fix for macOS:

// Fix: compare against end_index, not size
APPEND_M(str( "CMP `tmp, #", end_index ));

This is pre-existing (present on main) and not introduced by this PR, but since the Windows path gets it right, it would be good to fix macOS in this PR too.

ISSUE 2 (Performance, Minor): Windows gcd_unsigned uses CMP chain instead of jump table

Linux uses an O(1) indirect jump table; Windows (and macOS) use O(n) CMP+JE chains. With int_size = 20, this is up to 20 comparisons per dispatch. In the hot GCD loop, this adds a small overhead per iteration. The branch predictor will likely mitigate this significantly, and the PE/COFF relocations may make jump tables impractical on Windows, but it's worth noting.

The bounds check (CMP tmp, int_size-1; JA error) before the chain is a good defensive addition that Linux lacks.

ISSUE 3 (Latent, Low): AVX512 functions on Windows untested

The AVX512 add_* and multiply_* functions (5 C-ABI args) correctly use windows_stack_args=true (default) to load arg 5 from [RSP+0x28]. The C-to-asm linkage is correct. However, Windows CI explicitly disables AVX512:

"CHIA_ENABLE_AVX512_IFMA=0" | Out-File -Append -FilePath $env:GITHUB_ENV

So the AVX512 ASM path is compiled but never exercised on Windows. This is fine for now but should get test coverage if AVX512 is ever enabled on Windows.

ISSUE 4 (Hardening, Low): Missing bounds check in `OneWesolowskiCallback::OnIteration`

TwoWesolowskiCallback::OnIteration has a bounds check against forms_capacity:

if (pos < 0 || static_cast<size_t>(pos) >= forms_capacity) {
    throw std::runtime_error("TwoWesolowskiCallback::OnIteration out of bounds");
}

But OneWesolowskiCallback::OnIteration (line 104-117 of callback.h) does not, despite having similar index arithmetic. For consistency and safety, consider adding one.

ISSUE 5 (Correctness, Low): `d_windows_stack_args` parameter is unused

The windows_stack_args parameter was added to asm_function but no caller ever passes false. The comment at line 106-107 says "Internal asm-to-asm calls may pass args 5/6 in R10/R11 and opt out" -- but this never happens in practice. If internal asm-to-asm calls with 5+ args are never planned, consider removing this parameter to reduce cognitive load. If they are planned, consider adding an assert or guard.

4. Other Observations

Good changes

parameters.h: The move to std::atomic<bool> + std::call_once for AVX flag initialization is correct and eliminates a potential data race. The env_flag() helper for runtime tuning (CHIA_DISABLE_AVX2, etc.) is well-designed.
callback.h: The std::mutex in TwoWesolowskiCallback with GetFormCopy() returning by value eliminates a use-after-free / data race hazard. The cached_form approach in TwoWesolowskiProver is safe given the single-threaded proof generation loop.
include.h: The __clang__ && __SIZEOF_INT128__ check to use native 128-bit integers on clang-cl is correct -- clang-cl supports __int128 even though MSVC doesn't.
xgcd_partial.c: The MSVC _BitScanReverse/_BitScanReverse64 fallbacks for __builtin_clz* are correct. The 32-bit target fallback with two 32-bit scans is also correct.
2weso_test.cpp: The RAII stop_worker lambda with try/catch ensures the VDF worker thread is always joined, preventing test hangs on failure.
prover_test.cpp: The CHIAVDF_PROVER_TEST_FAST environment variable is a good approach for CI -- the 5-minute soak test was likely dominating CI time.
CMakeLists.txt: The comprehensive Windows build support with compile_asm integration, MPIR discovery, and the modular option system (BUILD_VDF_CLIENT, BUILD_VDF_TESTS, etc.) is well-structured.
Benchmark iterations bumped from 400K to 2M in CI -- good for getting more stable perf numbers.

Nit

In vdf.h, repeated_square_nudupl takes integer& D and integer& L as non-const, and the caller uses const_cast:

integer& D_nc = const_cast<integer&>(D);
integer& L_nc = const_cast<integer&>(L);
repeated_square_nudupl(f, D_nc, L_nc, ...);

If nudupl_form and reduce don't actually modify D/L, these should take const integer& to avoid the const_cast.

5. Summary Table

Area	Verdict
Linux ASM perf	No regression
macOS ASM perf	No regression
Windows ASM correctness	Correct (Win64 ABI properly implemented)
Windows ASM perf	Good; minor cost from CMP chain vs jump table
Thread safety	Improved (mutex, atomics, thread_local)
Build system	Well-structured Windows CMake + CI support
Test coverage	Good; prover_test fast mode is a good addition

Recommendation: The branch is in good shape for merge. The one actionable item is fixing the macOS gcd_unsigned dispatch off-by-one (Issue 1) since you're already touching that code and the Windows version gets it right.

This replies to Opus review feedback by fixing macOS gcd_unsigned end-index dispatch, aligning bounds checks across platforms and callbacks, enabling Windows AVX512 CI coverage, and removing unused asm/cast paths. Co-authored-by: Cursor <cursoragent@cursor.com>

Make prover form retrieval value-based and throw on thread-start failure so TwoWesolowski recursion preserves parallel proof generation instead of silently serializing work. Co-authored-by: Cursor <cursoragent@cursor.com>

Require OSXSAVE and XCR0 ZMM/opmask state in init_avx_flags() before enabling AVX-512 IFMA to prevent illegal-instruction crashes on unsupported OS configurations. Co-authored-by: Cursor <cursoragent@cursor.com>

hoffmang9 · 2026-02-13T07:12:08Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

wjblanke · 2026-02-13T19:43:05Z

    uint64_t switch_index;
    int64_t switch_iters;
    uint32_t kl;
+    std::mutex forms_mutex;


these locks are really going to slow things down. im not sure they are needed?

Planning to Implement atomics + publish invariant in TwoWesolowskiCallback. Looks a smidge better?

wjblanke · 2026-02-13T19:47:41Z

                    if (!PerformExtraStep()) return;
-                    tmp = GetForm(i);
-                    nucomp_form(ys[b], ys[b], *tmp, D, L);
+                    form tmp = GetForm(i);


this is pretty curious. it would be nice to have things left as a pointer instead of making a copy into an instance but maybe the old code isnt safe

hoffmang9 marked this pull request as draft February 6, 2026 04:03

hoffmang9 added 3 commits February 6, 2026 18:38

Windows asm/ABI and runtime fixes

f4bd736

CMake: Windows + perf experiment options

e7f37d7

CI: perf experiments and Windows benchmarking

003ecf3

hoffmang9 force-pushed the nudupl-ci-windows-runner branch from 2a3e3ec to 003ecf3 Compare February 7, 2026 02:39

hoffmang9 added 5 commits February 7, 2026 15:25

ci: streamline test workflow

b1d9057

build: remove perf-only CMake toggles

71121ab

core: gate AVX logging behind env flag

679a387

asm: document Windows ABI restore

70c50ac

docs: remove Windows perf notes

e89c805

hoffmang9 force-pushed the nudupl-ci-windows-runner branch from 0206ae9 to e89c805 Compare February 7, 2026 23:31

hoffmang9 added 2 commits February 7, 2026 15:39

Nit note in README and clean up some language in test.yaml

980b697

Fix step name

d04e5fb

hoffmang9 added the enhancement New feature or request label Feb 7, 2026

cursor Bot reviewed Feb 8, 2026

View reviewed changes

Comment thread .github/workflows/test.yaml

Comment thread .github/workflows/test.yaml

Comment thread src/asm_base.h Outdated

Comment thread .github/workflows/test.yaml Outdated

hoffmang9 added 10 commits February 9, 2026 12:53

various workflow clean ups and fix AVX512 flag

ad87252

fix windows runners in test.yaml

5a7ee42

more cmake instruction issues

880d11f

harden finding boost on Mac intel

f3a5229

now cmake isn't always there...

d713c5e

more install cmake

8fc91e6

fix ASAN, remove TSAN for windows runners

313c353

initiate windows dev env on ASAN runner

a9c401f

more ASAN windows issues - harden brew handling in cibuildwheel

e4d821e

giving up on ASAN

7df51ce

hoffmang9 changed the title ~~Add Windows x86-64 CI using clang + GNU asm for vdf tools~~ Integrate ARM64 NUDUPL VDF path and harden Windows CI/builds Feb 9, 2026

cursor Bot reviewed Feb 9, 2026

View reviewed changes

Comment thread src/CMakeLists.txt

Comment thread src/vdf_base_integer.cpp Outdated

Add required linker flag for Windows asm - fix unchecked vector access

f0e2ce6