Skip to content

Fully switch to async KVStore persistence#919

Merged
tnull merged 17 commits into
lightningdevkit:mainfrom
tnull:2026-06-async-kvstore-persistence
Jun 9, 2026
Merged

Fully switch to async KVStore persistence#919
tnull merged 17 commits into
lightningdevkit:mainfrom
tnull:2026-06-async-kvstore-persistence

Conversation

@tnull

@tnull tnull commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

We planned to make the full switch to the async KVStore for a while. Here we do just that: first step-by-step moving the remaining dependencies over to use KVStore, then finally dropping KVStoreSync and all implementations.

Depends on lightningdevkit/rust-lightning#4658 (will drop the DROPME commit once that's merged).

@ldk-reviews-bot

ldk-reviews-bot commented Jun 3, 2026

Copy link
Copy Markdown

👋 Thanks for assigning @joostjager as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@tnull tnull marked this pull request as draft June 3, 2026 11:00
@tnull tnull force-pushed the 2026-06-async-kvstore-persistence branch 2 times, most recently from 42c16e2 to 4371f6d Compare June 3, 2026 11:25
@tnull tnull mentioned this pull request Jun 3, 2026
9 tasks
@tnull tnull force-pushed the 2026-06-async-kvstore-persistence branch 2 times, most recently from 05ebc28 to 911c169 Compare June 3, 2026 12:15
@tnull tnull added this to the 0.8 milestone Jun 3, 2026
@tnull tnull force-pushed the 2026-06-async-kvstore-persistence branch 5 times, most recently from 1b4d24f to c62a503 Compare June 3, 2026 12:28
@tnull tnull marked this pull request as ready for review June 3, 2026 12:28
@tnull tnull requested a review from joostjager June 3, 2026 12:28

@joostjager joostjager left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall happy with this simplification. Big changeset though, even though not all that much is happening. Risk nonetheless.

Bigger open question for me is what the latest is on runtime abstraction and how that is going to work in combination with potential wasm support.

Also, are there sync public api calls that can now be made async such as export_pathfinding_scores?

Comment thread src/payment/asynchronous/static_invoice_store.rs Outdated
Comment thread src/peer_store.rs
Comment thread src/event.rs
Comment thread src/data_store.rs
Comment thread src/data_store.rs
Comment thread src/io/vss_store.rs Outdated
Comment thread src/io/test_utils.rs
signer_provider: &'a test_utils::TestKeysInterface,
}

impl<K: KVStore + Sync> TestMonitorUpdatePersister<'_, K> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message says dropping the trait, but this looks more substantial?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, move it to another commit for now, but will also put up a minor change in LDK so we can reuse the test utils wihtout rewriting this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should allow us to avoid it: lightningdevkit/rust-lightning#4665

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will allow us to simply do:

  type TestMonitorPersister<K> = MonitorUpdatingPersisterAsync<
        TestStoreRef<K>,
        TestFutureSpawner,
        &'static test_utils::TestLogger,
        &'static test_utils::TestKeysInterface,
        &'static test_utils::TestKeysInterface,
        &'static test_utils::TestBroadcaster,
        &'static test_utils::TestFeeEstimator,
  >;

  type TestAsyncPersister<K> = lightning::chain::chainmonitor::AsyncPersister<
        TestStoreRef<K>,
        TestFutureSpawner,
        &'static test_utils::TestLogger,
        &'static test_utils::TestKeysInterface,
        &'static test_utils::TestKeysInterface,
        &'static test_utils::TestBroadcaster,
        &'static test_utils::TestFeeEstimator,
  >;
(..)
  let monitor_persister = TestMonitorPersister::new(
        store,
        TestFutureSpawner::new(Arc::clone(&runtime)),
        &chanmon_cfg.logger,
        max_pending_updates,
        &chanmon_cfg.keys_manager,
        &chanmon_cfg.keys_manager,
        &chanmon_cfg.tx_broadcaster,
        &chanmon_cfg.fee_estimator,
  );

  let event_notifier = Arc::new(Notifier::new());
  let persister = lightning::chain::chainmonitor::AsyncPersister::new_test(
        monitor_persister,
        Arc::clone(&event_notifier),
  );

(...)

  test_utils::TestChainMonitor::new_with_event_notifier(
        Some(&chanmon_cfg.chain_source),
        &chanmon_cfg.tx_broadcaster,
        &chanmon_cfg.logger,
        &chanmon_cfg.fee_estimator,
        &persister.persister,
        &chanmon_cfg.keys_manager,
        Arc::clone(&persister.event_notifier),
  )

.. rather than the custom TestMonitorPersister

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will await the discussion there. Would be nice to use the incremental persister in tests.

Comment thread src/io/test_utils.rs
pub(crate) fn create_persister<'a, K: KVStoreSync + Sync>(
store: &'a K, chanmon_cfg: &'a TestChanMonCfg, max_pending_updates: u64,
pub(crate) fn create_persister<'a, K: KVStore + Sync>(
store: &'a K, chanmon_cfg: &'a TestChanMonCfg, _max_pending_updates: u64,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't max pending updates exercised in a test?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we can't use it right now as LDK seems not to expose the right APIs to use async path with the test utils, will open a fix in a bit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it tracked somewhere?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread src/io/vss_store.rs
);

let runtime_handle = internal_runtime.handle();
let schema_version = tokio::task::block_in_place(|| {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a problem that setup can't fail fast here anymore?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We fail fast elsewhere via setup_schema_version.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't the schema_version then be stored as a normal field? It seems the code is currently halfway between lazy init and explicit setup.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, no, we need to discover it based on the server-side state for backwards compatibility, as unfortunately schema version wasn't always there.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you removed the lazy init and made it a field? That was the suggestion indeed

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you removed the lazy init and made it a field? That was the suggestion indeed

Well, we just returned to the previous behavior using the internal runtime to block_on on the version negotiation. We'll need to return to the lazy init likely in the next step though.

Comment thread src/io/vss_store.rs Outdated
@tnull tnull force-pushed the 2026-06-async-kvstore-persistence branch 2 times, most recently from 290b0a5 to 9ad8f2d Compare June 3, 2026 15:42

@benthecarman benthecarman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the vss ci job hung for 2 hours hopefully a transient issue

Comment thread src/peer_store.rs
Comment thread src/io/utils.rs
@benthecarman

Copy link
Copy Markdown
Contributor

codex seems to agree on the locks

  • [P2] Serialize peer-store persistence updates — /home/ben/projects/ldk-node/src/peer_store.rs:43-52
    When two peer-store mutations overlap, this now releases the peer map lock before the async write completes, so an
    older snapshot can persist after a newer one. For example, concurrent add_peer calls can leave memory with both
    peers but the stored peer list with only the first one, causing reconnect peers to be missing after restart. Please
    serialize the mutation plus persistence, as DataStore now does with a separate async mutation lock.

  • [P2] Preserve node-metrics update ordering — /home/ben/projects/ldk-node/src/io/utils.rs:346-351
    If two node-metrics updates run concurrently, the lock is released before the async KV write, so an earlier encoded
    snapshot can finish after a later one and overwrite it. This can drop one of the timestamp fields on disk and make
    the node believe a sync/broadcast task has not run after restart. The previous code explicitly held the write lock
    through persistence to avoid this, so this path needs equivalent serialization for the async write.

@tnull tnull force-pushed the 2026-06-async-kvstore-persistence branch from 9ad8f2d to 8b93895 Compare June 5, 2026 08:04
@tnull

tnull commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

As discussed offline, now updated:

  1. Re-added the internal runtimes for Postgres / VSS to avoid tokio getting stuck if we end up using block_on while the same thread holds the IO driver. We'll be able to revert this/drop the runtimes once we go fully async API in a following PR.
  2. Added mutation locks for node metrics and peer store persistence for now, but we'll be able to simplify more once we switch to tokio::sync, once we go fully-async.
  3. Added a commit that enables eager handoff of the IO driver if we're built under cfg(tokio_unstable) and enables that for bindings builds.

@tnull tnull requested review from benthecarman and joostjager June 5, 2026 08:25
@tnull tnull force-pushed the 2026-06-async-kvstore-persistence branch from 326cbf7 to be2b74d Compare June 5, 2026 08:35
@tnull

tnull commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

Hmm, seems Github CI might have an outage, at least it's not running right now..

@tnull tnull removed the request for review from joostjager June 5, 2026 08:55
Comment thread src/runtime.rs
Comment thread src/peer_store.rs
Comment thread src/runtime.rs
Handle(tokio::runtime::Handle),
}

pub(crate) struct StoreRuntime {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also use some serious documentation on why it is needed, even if temporary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wondering if it isn't possible to unify this with the existing Runtime wrapper? That might also make the various modes more clear: current runtime, passed runtime, new runtime, or some combo.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wondering if it isn't possible to unify this with the existing Runtime wrapper? That might also make the various modes more clear: current runtime, passed runtime, new runtime, or some combo.

Considered that, but given the plan is to remove it ASAP, I decided to keep it separate which will make the cleanup much more straightforward.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think clean up is really much different, but in the mean time reusing the Runtime does make it more clear that we have different variations of the same concept. Maybe it takes longer than we'd like to get rid of it. And can reuse the thread naming thing too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume my last comment didn't change your view on it. It's temporary, so it's fine.

Comment thread src/runtime.rs
format!("{}-{}", thread_name_prefix, id)
})
.worker_threads(worker_threads)
.max_blocking_threads(worker_threads)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this setting?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't matter as we shouldn't hit blocking calls in the persistence pipeline anyways now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still curious why it is needed

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still curious why it is needed

Well, I don't think it is (given we should never use blocking threads on these runtimes), but also doesn't hurt to bound the number to be more in-line with the API-configured limit.

Comment thread src/runtime.rs
Comment thread src/io/vss_store.rs
tokio::task::block_in_place(move || drop(internal_runtime));
if let Some(runtime) = self.internal_runtime.take() {
if let Ok(runtime) = Arc::try_unwrap(runtime) {
runtime.shutdown_background();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean if we skip shutdown here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm following what you mean with 'skip' in this context?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the Drop path conditionally does not call shutdown_background_tasks, so the question is what state/resources/tasks may remain when that branch is taken?

Comment thread src/io/postgres_store/mod.rs Outdated
Comment thread src/runtime.rs
runtime: Option<tokio::runtime::Runtime>,
}

impl StoreRuntime {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name describes usage, not really what it does. IsolatedRuntime?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the thing itself also doesn't isolate the runtime, no? It's just used in an isolated manner?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this improves anything? Note it's not the object that isolates anything really, but its usage.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean that if you want to use this generally-usable object for something else, you need to rename it even though the code stays the same.

task.await.map_err(|e| {
io::Error::new(
io::ErrorKind::Other,
format!("PostgreSQL runtime task failed: {}", e),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the same message for all of these, would be better if it had the function name in it for easier debugging. however this is removed in the next pr so not very important

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the duplication

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we had helper previously, but Joost asked to drop it and move it to StoreRuntime, which however doesn't work well either (given the error messages are store-specific), so decided to simply inline them, esp. given it's just 5 lines, and hiding them in a helper just obfuscates what happens concretely, i.e., what runtime exactly is used to await what etc..

Comment thread src/io/vss_store.rs
.await
});
task.await.map_err(|e| {
io::Error::new(io::ErrorKind::Other, format!("VSS runtime task failed: {}", e))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread tests/common/mod.rs
}
}

struct TestSyncStoreInner {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be called TestStoreInner/TestStore since its all async now? Also noticed I missed adding postgres to this

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be called TestStoreInner/TestStore since its all async now?

No, this would be pretty confusing, as the Sync part of it is important (i.e., the purpose is to assert all stores are kept in-sync). Note we also have TestStore in LDK test utils which we used to use here, but which unfortunately is now unusable for us as we decided that KVStore path it requires some additional per-operation confirmations that we won't be able to run in LDK Node, hence why we introduced InMemoryStore.

Also noticed I missed adding postgres to this

Ah, right, feels a bit orthogonal to this PR, feel free to open a quick follow-up.

joostjager
joostjager previously approved these changes Jun 9, 2026

@joostjager joostjager left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally happy with where this landed. No blockers.

One process point: the PR is quite large and that makes review and re-review inefficient. Also managing the many discussing threads becomes unwieldy and risks loosing track of something that might be important. In my opinion better to avoid in the future.

I think the extensive comments that you and/or AI added are good context for future humans or AIs reading it.

I’ll take a look at the follow-up PR too, to see how the final async end state takes shape.

@tnull tnull force-pushed the 2026-06-async-kvstore-persistence branch from 9e095d1 to 78708a3 Compare June 9, 2026 07:54
@tnull

tnull commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Generally happy with where this landed. No blockers.

One process point: the PR is quite large and that makes review and re-review inefficient. Also managing the many discussing threads becomes unwieldy and risks loosing track of something that might be important. In my opinion better to avoid in the future.

I think the extensive comments that you and/or AI added are good context for future humans or AIs reading it.

I’ll take a look at the follow-up PR too, to see how the final async end state takes shape.

Squashed fixups without further changes. Given how invasive the changeset is I'd like to move on this one rather fast, any remaining comments / improvements can happen in follow-ups IMO.

@tnull tnull requested a review from joostjager June 9, 2026 07:58
@tnull

tnull commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Ah, grr, still blocked on lightningdevkit/rust-lightning#4661.

@tnull tnull removed the request for review from joostjager June 9, 2026 07:59
tnull added 17 commits June 9, 2026 10:13
Bump LDK to 3dfcc4cca1866c5e5d4d4eaf3b82e09584e2ce5c

Co-Authored-By: HAL 9000
Read and write BDK wallet state through async KVStore helpers while
keeping the current WalletPersister entry points bridged through the
node runtime. This reduces the wallet persistence surface that still
depends on KVStoreSync.

Co-Authored-By: HAL 9000
Static invoice persistence already runs from async handlers, so use
KVStore directly instead of routing those reads and writes through
the blocking KVStoreSync trait.

Co-Authored-By: HAL 9000
Persist peer store updates through async KVStore operations. The
synchronous node APIs keep bridging at their runtime boundary while
async event handling awaits peer persistence directly.

Co-Authored-By: HAL 9000
Persist DataStore mutations through async KVStore operations while
keeping the existing synchronous APIs bridged through the node
runtime. Async event handling now awaits payment store writes
directly.

Co-Authored-By: HAL 9000
Persist node metric updates through async KVStore writes and await
them from the chain, gossip, and scoring tasks. This removes the
remaining blocking metrics writer while keeping the helper name
stable.

Co-Authored-By: HAL 9000
Persist the on-chain wallet through BDK's AsyncWalletPersister so
wallet state writes use the async KVStore path. Existing synchronous
wallet APIs keep bridging through the node runtime until their
callers are made async.

Co-Authored-By: HAL 9000
Open filesystem stores through the async LDK migration helper so
v1-to-v2 store migration no longer depends on the blocking KVStoreSync
migration path.

Co-Authored-By: HAL 9000
Move the existing in-memory test store into a shared module without
changing its behavior. This lets integration tests reuse it while
keeping the later async TestSyncStore change separate.

Co-Authored-By: HAL 9000
Exercise async KVStore operations in TestSyncStore and filesystem migration tests while keeping the temporary sync comparison path until the final KVStoreSync removal.

Co-Authored-By: HAL 9000
Use a local async KVStore-backed monitor persister in store tests so the later blocking KVStore removal can drop the old synchronous MonitorUpdatingPersister path without mixing the test adapter into that change.

Co-Authored-By: HAL 9000
Drop the remaining synchronous KV store trait bounds and implementations. After the preceding migrations, custom stores only need to provide async KVStore persistence, and pathfinding score export also reads from the async store.

Co-Authored-By: HAL 9000
Add a crate-local runtime wrapper for store backends that need to keep their I/O isolated while shutting down safely from async contexts.

Co-Authored-By: HAL 9000
Enable Tokio's eager driver handoff when building with tokio_unstable so node-owned runtimes can use the dedicated driver handoff path where available.

Build binding artifacts and selected CI coverage with tokio_unstable so the cfg-gated runtime path remains exercised.

Co-Authored-By: HAL 9000
Give ldk-node-created runtime workers a stable ldk-node-prefixed thread name so runtime activity is easier to identify in debuggers and logs, matching the store runtime naming convention.

Co-Authored-By: HAL 9000
@tnull tnull force-pushed the 2026-06-async-kvstore-persistence branch from 78708a3 to a3e7653 Compare June 9, 2026 08:15
@tnull

tnull commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Ah, grr, still blocked on lightningdevkit/rust-lightning#4661.

Okay, now converted the DROPME commit to an actual bump commit for the 0.3 revision post-merge of that PR.

@tnull tnull requested a review from joostjager June 9, 2026 08:17

@joostjager joostjager left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No CI. I assume the tests pass locally?

@tnull tnull merged commit 00dba45 into lightningdevkit:main Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants