Skip to content

Integrate ARM64 NUDUPL and add MASM VDF path and harden Windows CI/builds#300

Merged
hoffmang9 merged 51 commits into
mainfrom
nudupl-ci-windows-runner
Feb 13, 2026
Merged

Integrate ARM64 NUDUPL and add MASM VDF path and harden Windows CI/builds#300
hoffmang9 merged 51 commits into
mainfrom
nudupl-ci-windows-runner

Conversation

@hoffmang9
Copy link
Copy Markdown
Member

@hoffmang9 hoffmang9 commented Feb 6, 2026

Summary

  • Integrate the ARM64 NUDUPL VDF path and related core plumbing, including the new base-integer VDF implementation and listener/profile support.
  • Expand and harden cross-platform build support across CMake, Makefile.vdf-client, Python packaging metadata, and Rust workflow pieces.
  • Improve Windows compatibility in asm/runtime paths (ABI handling, AVX-related behavior, and portability fixes) needed for reliable Windows execution.
  • Update CI workflows (test.yaml, rust.yml) and project docs to match the current Windows runner/toolchain setup and remove stale perf-only assumptions.

Test plan

  • GitHub Actions test.yaml passes on Ubuntu, macOS, and Windows.
  • GitHub Actions rust.yml passes.
  • vdf_client, vdf_bench, and VDF tests build and run in CI for the enabled configurations.

Note

High Risk
Touches performance-critical VDF/asm codegen, platform ABI/stack handling, and proof-threading/callback synchronization, so regressions could affect correctness or stability across architectures despite added CI coverage.

Overview
Adds Windows coverage to CI for vdf_client/tests/bench by extending test.yaml with a Windows matrix, Boost/MPIR caching + setup, CMake/Ninja/clang-cl builds, Windows test execution, and platform-specific benchmarking; rust.yml and cibuildwheel macOS setup are hardened (idempotent installs/retries, ensure CMake availability).

Modernizes the CMake build to support standalone builds of vdf_client, vdf_bench, tests, and hardware tools via new BUILD_* options, plus an optional generated GNU-asm pipeline (compile_asm) including Windows-specific linking/env handling and Boost header discovery.

Refactors core C++ for portability and correctness: AVX feature detection becomes std::call_once + std::atomic with new runtime env flags (enable/disable/force AVX2 and AVX-512 IFMA) and optional logging; asm generation and VM/prologue/stack/register handling gain Windows ABI support; multiple codepaths are gated behind CHIA_DISABLE_ASM, and several proof/callback APIs switch to safer value/copy semantics with bounds checks, locking, and proper thread lifecycle management.

Written by Cursor Bugbot for commit 46bd58e. This will update automatically on new commits. Configure here.

@hoffmang9 hoffmang9 marked this pull request as draft February 6, 2026 04:03
@hoffmang9 hoffmang9 force-pushed the nudupl-ci-windows-runner branch from 2a3e3ec to 003ecf3 Compare February 7, 2026 02:39
@hoffmang9 hoffmang9 force-pushed the nudupl-ci-windows-runner branch from 0206ae9 to e89c805 Compare February 7, 2026 23:31
@hoffmang9 hoffmang9 added the enhancement New feature or request label Feb 7, 2026
@hoffmang9
Copy link
Copy Markdown
Member Author

@cursor review

Comment thread .github/workflows/test.yaml
Comment thread .github/workflows/test.yaml
Comment thread src/asm_base.h Outdated
Comment thread .github/workflows/test.yaml Outdated
@hoffmang9 hoffmang9 changed the title Add Windows x86-64 CI using clang + GNU asm for vdf tools Integrate ARM64 NUDUPL VDF path and harden Windows CI/builds Feb 9, 2026
@hoffmang9
Copy link
Copy Markdown
Member Author

@cursor review

Comment thread src/CMakeLists.txt
Comment thread src/vdf_base_integer.cpp Outdated
Comment thread src/CMakeLists.txt Outdated
Comment thread src/parameters.h
Comment thread src/callback.h
… docs.

This captures the current branch updates, including the TwoWesolowski position-locking fix and related CMake/parameter/readme adjustments for current CI work.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/CMakeLists.txt Outdated
Comment thread src/CMakeLists.txt
Prevent `emu_hw_test` and `emu_hw_vdf_client` from being defined on Windows so CMake does not try to compile sources that depend on POSIX headers.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/callback.h
Set forms_capacity when allocating FastAlgorithmCallback forms so all WesolowskiCallback subclasses consistently initialize capacity metadata for safe bounds checks.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/CMakeLists.txt
Comment thread src/asm_gcd_unsigned.h
hoffmang9 and others added 2 commits February 12, 2026 00:32
Add the same a_end_index range guard used on Linux before the CMP/JE chain so out-of-range values jump to the error path instead of falling through to a kernel label.

Co-authored-by: Cursor <cursoragent@cursor.com>
Match the Makefile behavior so compile_asm emits Mach-O-compatible assembly on Intel macOS instead of falling back to Linux/ELF code paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/asm_vm.h
Comment thread src/threading.h Outdated
Allow asm_function Windows stack-arg loading to be toggled for internal call sites, and drop dead <excpt.h> from threading since no SEH constructs are used.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hoffmang9 hoffmang9 changed the title Integrate ARM64 NUDUPL VDF path and harden Windows CI/builds Integrate ARM64 NUDUPL and add MASM VDF path and harden Windows CI/builds Feb 12, 2026
Comment thread src/asm_base.h
Comment thread src/parameters.h
Comment thread src/provers.h Outdated
Comment thread src/vdf.h
Keep Linux absolute addressing and restore macOS/Linux jump-dispatch parity with main, while making detached-thread fallback logging debug-only.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/provers.h Outdated
Comment thread src/provers.h
Comment thread src/provers.h Outdated
Comment thread src/provers.h
@hoffmang9
Copy link
Copy Markdown
Member Author

Code Review: nudupl-ci-windows-runner branch by Opus 4.6

Summary

This branch ports the x86 ASM pipeline to Windows (Win64 ABI with clang-cl + GNU assembler), adds a non-ASM NUDUPL fallback for ARM/non-x86, and introduces numerous hardening improvements (thread safety, bounds checks, thread-local scratch contexts). Reviewed 37 changed files, ~1,500 new lines.


1. Linux/macOS ASM Paths: No Performance Regression

Verdict: No regressions. Some minor hot-path improvements.

All changes to the ASM code generator are gated behind #ifdef CHIA_WINDOWS or #if defined(CHIA_WINDOWS) with corresponding #else/#elif leaving the existing Linux and macOS code paths byte-for-byte identical. Specifically:

  • asm_vm.h -- The non-Windows prologue (MOV RAX, RSP; AND RSP, -64; SUB RSP, 64; MOV [RSP], RAX) and epilogue (MOV RSP, [RSP]) are enclosed in the #else of #ifdef CHIA_WINDOWS. The callee-saved register set and argument register set for SysV are untouched.

  • asm_types.h -- Spill offset calculation value - spill_bytes (negative) is preserved in the #else branch.

  • asm_base.h -- RIP-relative (macOS) vs absolute (Linux) addressing patterns are preserved. Windows gets .section .rdata,"dr" but the other branches are untouched.

  • asm_gcd_unsigned.h -- Linux's O(1) jump table dispatch is in the #else branch, fully preserved.

  • All other asm_gcd_*.h files -- Windows additions are purely additive.

Hot-path improvements that help all platforms

  • nucomp.h (qfb_nudupl): mpz_init/mpz_clear of 15 mpz_t temporaries per call replaced with thread_local scratch context. This is a meaningful win for ARM's NUDUPL path and affects any platform using C++ nudupl.

  • xgcd_partial.c: Three mpz_tdiv_q_2exp + mpz_get_ui calls per inner loop iteration replaced with the inline chiavdf_mpz_extract_uword_from_shift_nonneg that reads limbs directly. Thread-local q/r temporaries eliminate per-call alloc churn. Good improvement.

  • nucomp.h: mpz_fdiv_r replaced with mpz_tdiv_r + sign fixup -- valid because the modulus a1 is always positive. Typically faster.

  • generic.h: static to thread_local for stringstream helpers. Correctness fix (was data-race on concurrent calls), not in hot path.


2. Windows ASM Port: Correctness Review

Overall verdict: The core Win64 ABI implementation is correct and well-structured.

2a. Win64 ABI Compliance in asm_vm.h -- Correct

The prologue/epilogue correctly handles all Win64 ABI requirements:

  • Callee-saved registers: Line 44 correctly adds RSI and RDI to all_save_regs (these are volatile on SysV but non-volatile on Win64).

  • Argument registers: Line 49 correctly uses RCX, RDX, R8, R9 for args 1-4, with R10/R11 reserved for args 5-6.

  • Stack arguments: Lines 108-113 correctly load args 5+ from [RSP+0x28]/[RSP+0x30] (the Win64 stack arg area past the shadow space) before any PUSH instructions modify RSP. The gcd_base function (6 args, called from C) is the primary consumer and this is correct for it.

  • XMM6-XMM15 save/restore: Lines 145-166 (save) and 201-222 (restore) correctly preserve all 10 non-volatile XMM registers. The code properly uses VMOVDQU/YMM for AVX codegen and MOVDQU/XMM for non-AVX -- correct.

  • Stack frame layout: The frame calculation at line 136-137 properly accounts for saved_rsp(8) + spill_bytes + shadow_space(0x20) + xmm_save_area. The 64-byte alignment is maintained correctly.

2b. Spill Area Layout in asm_types.h -- Correct

On Windows, spill addresses use [RSP + value + 8] (positive offset, above the saved RSP at [RSP]). Non-Windows uses [RSP + value - spill_bytes] (negative offset). Both correctly map the spill area. The +8 accounts for the saved original RSP stored at [RSP+0].

2c. Read-only Data Sections -- Correct

All constant data (lookup tables, mask constants) correctly uses .section .rdata,"dr" for Windows PE/COFF, which is the standard read-only data section. This is the correct counterpart to .text 1 (Linux) and .text (macOS).

2d. RIP-relative Addressing -- Correct

Windows correctly uses [RIP+label] for all data accesses, matching the macOS pattern. This is required for Windows PE/COFF. The changes in asm_base.h, asm_gcd_128.h, asm_gcd_base_continued_fractions.h, asm_gcd_base_divide_table.h, and asm_avx512_ifma.h all look correct.


3. Issues Found

ISSUE 1 (Bug, Pre-existing, Medium Severity): macOS gcd_unsigned dispatch off-by-one

In src/asm_gcd_unsigned.h, the macOS dispatch compares spill_a_end_index (a 0-based index, per comment at line 364) against size = end_index + 1:

#elif defined(CHIAOSX)
        APPEND_M(str( ".text" ));

        APPEND_M(str( "MOV `tmp, `spill_a_end_index" ));

        for (int end_index=0;end_index<int_size;++end_index) {
            int size=end_index+1;
            // ...
            APPEND_M(str( "CMP `tmp, #", size ));       // <-- compares against size, not end_index
            APPEND_M(str( "JE ")+asmprefix+str("multiply_uv_size_#", mapped_size ));
        }

This is an off-by-one: when spill_a_end_index = N, it matches at end_index = N-1 (where size = N), dispatching to the multiply function for N limbs instead of N+1 limbs.

This causes incorrect dispatch at mapped-size boundaries. For example with spill_a_end_index = 4 (5 limbs, should dispatch to multiply_uv_size_8), macOS instead dispatches to multiply_uv_size_4, processing only 4 limbs and missing a potentially non-zero limb.

The new Windows code correctly uses end_index instead of size. Recommended fix for macOS:

// Fix: compare against end_index, not size
APPEND_M(str( "CMP `tmp, #", end_index ));

This is pre-existing (present on main) and not introduced by this PR, but since the Windows path gets it right, it would be good to fix macOS in this PR too.

ISSUE 2 (Performance, Minor): Windows gcd_unsigned uses CMP chain instead of jump table

Linux uses an O(1) indirect jump table; Windows (and macOS) use O(n) CMP+JE chains. With int_size = 20, this is up to 20 comparisons per dispatch. In the hot GCD loop, this adds a small overhead per iteration. The branch predictor will likely mitigate this significantly, and the PE/COFF relocations may make jump tables impractical on Windows, but it's worth noting.

The bounds check (CMP tmp, int_size-1; JA error) before the chain is a good defensive addition that Linux lacks.

ISSUE 3 (Latent, Low): AVX512 functions on Windows untested

The AVX512 add_* and multiply_* functions (5 C-ABI args) correctly use windows_stack_args=true (default) to load arg 5 from [RSP+0x28]. The C-to-asm linkage is correct. However, Windows CI explicitly disables AVX512:

"CHIA_ENABLE_AVX512_IFMA=0" | Out-File -Append -FilePath $env:GITHUB_ENV

So the AVX512 ASM path is compiled but never exercised on Windows. This is fine for now but should get test coverage if AVX512 is ever enabled on Windows.

ISSUE 4 (Hardening, Low): Missing bounds check in OneWesolowskiCallback::OnIteration

TwoWesolowskiCallback::OnIteration has a bounds check against forms_capacity:

if (pos < 0 || static_cast<size_t>(pos) >= forms_capacity) {
    throw std::runtime_error("TwoWesolowskiCallback::OnIteration out of bounds");
}

But OneWesolowskiCallback::OnIteration (line 104-117 of callback.h) does not, despite having similar index arithmetic. For consistency and safety, consider adding one.

ISSUE 5 (Correctness, Low): d_windows_stack_args parameter is unused

The windows_stack_args parameter was added to asm_function but no caller ever passes false. The comment at line 106-107 says "Internal asm-to-asm calls may pass args 5/6 in R10/R11 and opt out" -- but this never happens in practice. If internal asm-to-asm calls with 5+ args are never planned, consider removing this parameter to reduce cognitive load. If they are planned, consider adding an assert or guard.


4. Other Observations

Good changes

  • parameters.h: The move to std::atomic<bool> + std::call_once for AVX flag initialization is correct and eliminates a potential data race. The env_flag() helper for runtime tuning (CHIA_DISABLE_AVX2, etc.) is well-designed.

  • callback.h: The std::mutex in TwoWesolowskiCallback with GetFormCopy() returning by value eliminates a use-after-free / data race hazard. The cached_form approach in TwoWesolowskiProver is safe given the single-threaded proof generation loop.

  • include.h: The __clang__ && __SIZEOF_INT128__ check to use native 128-bit integers on clang-cl is correct -- clang-cl supports __int128 even though MSVC doesn't.

  • xgcd_partial.c: The MSVC _BitScanReverse/_BitScanReverse64 fallbacks for __builtin_clz* are correct. The 32-bit target fallback with two 32-bit scans is also correct.

  • 2weso_test.cpp: The RAII stop_worker lambda with try/catch ensures the VDF worker thread is always joined, preventing test hangs on failure.

  • prover_test.cpp: The CHIAVDF_PROVER_TEST_FAST environment variable is a good approach for CI -- the 5-minute soak test was likely dominating CI time.

  • CMakeLists.txt: The comprehensive Windows build support with compile_asm integration, MPIR discovery, and the modular option system (BUILD_VDF_CLIENT, BUILD_VDF_TESTS, etc.) is well-structured.

  • Benchmark iterations bumped from 400K to 2M in CI -- good for getting more stable perf numbers.

Nit

In vdf.h, repeated_square_nudupl takes integer& D and integer& L as non-const, and the caller uses const_cast:

integer& D_nc = const_cast<integer&>(D);
integer& L_nc = const_cast<integer&>(L);
repeated_square_nudupl(f, D_nc, L_nc, ...);

If nudupl_form and reduce don't actually modify D/L, these should take const integer& to avoid the const_cast.


5. Summary Table

Area Verdict
Linux ASM perf No regression
macOS ASM perf No regression
Windows ASM correctness Correct (Win64 ABI properly implemented)
Windows ASM perf Good; minor cost from CMP chain vs jump table
Thread safety Improved (mutex, atomics, thread_local)
Build system Well-structured Windows CMake + CI support
Test coverage Good; prover_test fast mode is a good addition

Recommendation: The branch is in good shape for merge. The one actionable item is fixing the macOS gcd_unsigned dispatch off-by-one (Issue 1) since you're already touching that code and the Windows version gets it right.

This replies to Opus review feedback by fixing macOS gcd_unsigned end-index dispatch, aligning bounds checks across platforms and callbacks, enabling Windows AVX512 CI coverage, and removing unused asm/cast paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/parameters.h Outdated
hoffmang9 and others added 2 commits February 12, 2026 22:21
Make prover form retrieval value-based and throw on thread-start failure so TwoWesolowski recursion preserves parallel proof generation instead of silently serializing work.

Co-authored-by: Cursor <cursoragent@cursor.com>
Require OSXSAVE and XCR0 ZMM/opmask state in init_avx_flags() before enabling AVX-512 IFMA to prevent illegal-instruction crashes on unsupported OS configurations.

Co-authored-by: Cursor <cursoragent@cursor.com>
@hoffmang9
Copy link
Copy Markdown
Member Author

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@hoffmang9 hoffmang9 merged commit 767a534 into main Feb 13, 2026
89 of 91 checks passed
@hoffmang9 hoffmang9 deleted the nudupl-ci-windows-runner branch February 13, 2026 19:28
Comment thread src/callback.h
uint64_t switch_index;
int64_t switch_iters;
uint32_t kl;
std::mutex forms_mutex;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these locks are really going to slow things down. im not sure they are needed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Planning to Implement atomics + publish invariant in TwoWesolowskiCallback. Looks a smidge better?

Comment thread src/prover_parallel.hpp
if (!PerformExtraStep()) return;
tmp = GetForm(i);
nucomp_form(ys[b], ys[b], *tmp, D, L);
form tmp = GetForm(i);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty curious. it would be nice to have things left as a pointer instead of making a copy into an instance but maybe the old code isnt safe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants