feat: kv-store over IPC; aztec-kvdb binary; LMDBStore NAPI scaffold [PR 4]#23238
Closed
charlielye wants to merge 0 commit into
Closed
feat: kv-store over IPC; aztec-kvdb binary; LMDBStore NAPI scaffold [PR 4]#23238charlielye wants to merge 0 commit into
charlielye wants to merge 0 commit into
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
Collaborator
|
This issue was automatically closed because it was referenced in PR #23469 which has been merged to the default branch. |
1c8d4f9 to
6867e96
Compare
5 tasks
danielntmd
pushed a commit
to danielntmd/aztec-packages
that referenced
this pull request
Jun 4, 2026
…AztecProtocol#23469) ## Summary `aztec start --local-network` reliably SIGBUSes a few blocks into a run on macOS arm64 (since `v5.0.0-nightly.20260520`, i.e. after AztecProtocol#21625 shipped the `shared_ptr` use-after-free fix). This is a **different** fault from the one AztecProtocol#21625 fixed: a stack-guard violation (stack overflow) on a `nodejs_module.node` worker thread running AVM-simulation code, not a use-after-free. This pins an explicit, generous stack size on the `ThreadedAsyncOperation` worker thread. ## Root cause `ThreadedAsyncOperation::Queue()` (introduced in AztecProtocol#21138) runs the AVM simulation (`_fn`) directly on a bare `std::thread(...).detach()`. A `std::thread` uses the OS default stack for non-main threads, which is **512 KB on macOS** versus **8 MB on Linux**. The AVM-simulation call chain is deep enough to overflow 512 KB, so on macOS arm64 the worker writes into its stack-guard page and the process aborts with: ``` EXC_BAD_ACCESS / SIGBUS, KERN_PROTECTION_FAILURE "Could not determine thread index for stack guard region" #0 _platform_memmove #1.. nodejs_module.node bb::nodejs (AVM simulation path) ``` Linux is unaffected because its 8 MB default is comfortably large. The previous `AsyncOperation` path never hit this either: it ran on the libuv threadpool, whose threads are sized from `RLIMIT_STACK` (8 MB soft on macOS), not the 512 KB raw-thread default. ## Fix `std::thread` can't set a stack size, so launch the worker via `pthreads` with `pthread_attr_setstacksize` pinned to a generous `WORKER_STACK_SIZE` (32 MB — 4× the 8 MB that the libuv path proved sufficient, with headroom for deeper future call chains). Falls back to a default-stack `std::thread` only if pthreads is unavailable (`_WIN32`) or `pthread_create` fails. The shared_ptr lifetime model from AztecProtocol#21625 is preserved exactly — both the worker lambda and the `BlockingCall` completion callback still capture `self`, so this does not reintroduce the use-after-free. Only the thread-launch mechanism changed. ## Testing - The full bb build is too heavy to run in this session, so this is **not yet a local end-to-end repro/fix verification** — it relies on CI for compilation and on a macOS arm64 `aztec start --local-network` run to confirm the crash is gone. - The pthread/`std::function` trampoline was compiled and run standalone under `-std=c++20 -Wall -Wextra -Werror`: the worker thread receives a 32 MB stack (`pthread_get_stacksize_np` reports `33554432`), and the work runs and completes. - **Requested:** verify against tonight's nightly on macOS arm64 (M3) — the reporter's exact repro. ## Notes for reviewers - Targets `next` (not `merge-train/barretenberg`) to match AztecProtocol#21625's base and to make the nightly, since this is an urgent release-affecting crash. Happy to retarget if you'd prefer it go through the merge train. - 32 MB is a deliberate over-provision; if you'd rather mirror the libuv path precisely we could instead size from `getrlimit(RLIMIT_STACK)`. The fixed constant is simpler and the virtual reservation only commits pages as touched. - The longer-term fix is the NAPI→IPC migration (AztecProtocol#21331 / AztecProtocol#23196 / AztecProtocol#23238), which removes this in-process worker entirely. This is a targeted stop-gap for the shipping NAPI path. Related: AztecProtocol#21138 (introduced the threaded model), AztecProtocol#21625 (use-after-free fix), AztecProtocol#21629 (open alternative). --- *Created by [claudebox](https://claudebox.work/v2/sessions/4bd36dc505c20254) · group: `slackbot`*
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on PR 3b (#23196). Adds an out-of-process `aztec-kvdb` binary that owns LMDB and serves it over UDS / MPSC SHM with the same `TypedMessage
` wire format `yarn-project/native/MsgpackChannel` already speaks. AztecLMDBStoreV2 will be migrated from `NativeLMDBStore` (NAPI) to `KvdbBackend` (IPC) in a follow-up commit on this same PR.
This first commit is inert: the binary builds and is shipped, but nothing in yarn-project uses it yet.
Why
After PR 3b lands, only one load-bearing NAPI consumer remains: the LMDB store used by archiver, p2p, pxe, slasher, and validator-ha-signer. Moving it out of NAPI removes the last embedding of a C++ subsystem in the Node.js process. The NAPI module is reduced to a thin SHM transport stub (`MsgpackClient`/`MsgpackClientAsync`, ~400 LOC) — pure IPC plumbing, no domain logic.
SHM is the production transport (~1–10 µs round-trip via futex doorbell); UDS is the dev/test fallback. This also activates the SHM code-path that has been dead in production since PR 1.
What's in this PR (so far)
C++:
TypeScript (bb.js):
What's still ahead in this PR
Stack
Test plan