Skip to content

Deterministic runtime crate#5016

Open
Shubham8287 wants to merge 53 commits into
masterfrom
shub/sim
Open

Deterministic runtime crate#5016
Shubham8287 wants to merge 53 commits into
masterfrom
shub/sim

Conversation

@Shubham8287
Copy link
Copy Markdown
Contributor

@Shubham8287 Shubham8287 commented May 13, 2026

Description of Changes.

Introduces deterministic runtime crate.
Integrate it with RelationalDB.

I think best steps to review:

  • Read the README of runtime crate.
  • Look at the integration with existing crates - durability, core, snapshot, etc.
  • Read runtime crate's code.

Draft branch to Test code - #5019

API and ABI breaking changes

NA

Expected complexity level and risk

Does not intend to change any production functionality, but it's big code.

Testing

  • new crate contains unit and integration tests.
  • Existing tests should work for production.

@Shubham8287 Shubham8287 self-assigned this May 14, 2026
Comment thread crates/runtime/src/lib.rs
Tokio(tokio::task::JoinHandle<T>),
#[cfg(feature = "simulation")]
Simulation(sim::JoinHandle<T>),
Detached(PhantomData<T>),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is interesting and could use some commentary!

Comment thread crates/runtime/src/sim_std.rs
Comment thread crates/runtime/DETERMINISM_COVERAGE.md
Comment thread crates/runtime/src/sim/time/mod.rs Outdated
Comment thread crates/snapshot/tests/remote.rs
let original = PTHREAD_ATTR_INIT.get_or_init(|| unsafe {
// `RTLD_NEXT` skips this interposed function and finds the libc
// implementation that would have been called without the simulator.
let ptr = libc::dlsym(libc::RTLD_NEXT, c"pthread_attr_init".as_ptr().cast());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wild.

Copy link
Copy Markdown
Contributor Author

@Shubham8287 Shubham8287 May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overriding libc methods turning out to be limited for detecting source of nondeterminism. I am experimenting with eBPF, which looks more powerful.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about eBPF. Interesting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be used to do so many things, https://github.com/clockworklabs/SpacetimeDB/pull/5177/changes#diff-8c7a3b299146fed405fed8a131643bb6aaac19152857cb129826ef17457d1a89. This is some experimental low level code to detect futex call (under simulation, this means determinism leakage).

@Shubham8287 Shubham8287 requested review from jsdt and removed request for Centril May 28, 2026 15:50
Comment thread crates/runtime/README.md

- **Zero dependency.** The simulation core in `sim/` is already `no_std + alloc`. The `sim_std` module is a thin OS-facing wrapper — the std dependency lives there, not in the simulation core itself. It stays until the application logic above this crate also moves to `no_std`.

## Current Limitations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You touched on this in the point about it being a single-threaded runtime, but this is also limited in how granularly it can interleave executions of different tasks, which is going to make it impossible to produce some race conditions that would be possible in a multi-threaded runtime. If I'm reading this correctly, the sim executor picks one of the tasks, then it polls that task (which runs the task until it finishes or an await returns Pending), then it picks a task again. This means we can only mess with scheduling when a task yields.

For a simple example of a case of an execution that we can't simulate:

async fn task_a(mut rx: Receiver<Msg>) {
    log("A: waiting");
    rx.recv().await;      // Pending until Task B sends.
    log("A: received a message");
    do_a_work();          // Cannot run until Task B yields or finishes.
}

async fn task_b(mut tx: Sender<Msg>) {
    log("B: running");
    tx.send(Msg).await;   // Ready immediately; wakes Task A but does not yield.
    do_b_work();      // Always runs before Task A resumes.
    log("B: finished working");
}

We could never produce:

A: waiting
B: running
A: received a message
B: finished working

So if there is a bug when do_a_work runs before do_b_work, we would never be able to catch it.

I think the fix for this specific case would be to have channels be part of the runtime crate, so that when send is called, we can insert a yield_now call in the sim version (if send is async).

More generally, if the goals is to be able to simulate any possible race condition, we need to be able to suspend/interleave execution at any point in the code that either affects other tasks/threads or is affected by other tasks/threads.

For things with an async api, having a sim version that inserts a yield_now will make it possible to interleave, but addressing this for sync apis that affect other threads will require changing the executor. Even channels have some sync functions that we would want to be able to suspend (like try_recv or send on an UnboundedSender).

If we want this to work for sync functions that interact with other tasks, then the sim executor will need more than one thread.

Copy link
Copy Markdown
Contributor Author

@Shubham8287 Shubham8287 Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we also discussed this out of band, but putting it here too:

More generally, if the goals is to be able to simulate any possible race condition, we need to be able to suspend/interleave execution at any point in the code that either affects other tasks/threads or is affected by other tasks/threads.

Agreed. With the current DB architecture, a single-threaded executor cannot simulate every possible race condition.

That said, I think the broader goal should be to move toward an architecture where data races are impossible by design. In practice, that means moving toward a single-threaded runtime model, which also aligns with our thread-per-core principle.

Practically, I think this would mean allowing only the database thread to touch the datastore, while potentially keeping separate threads for things like ws message handoff. If the “deep core” part of our DB runs single-threaded, we should be able to simulate all of it. The parts that must remain multi-threaded should live outside the “core,” and have to be tackled later.

If we want this to work for sync functions that interact with other tasks, then the sim executor will need more than one thread.

I think putting more than one thread in executor make the tests not replayable by seed.

So, maybe I think current scope should be to keep executor single-threaded only, even if does not let us explore all race conditions, it is still very helpful for other classes of bugs - Schema bugs, replayability bugs, fault-tolerance related, etc from day 1.

Copy link
Copy Markdown
Contributor Author

@Shubham8287 Shubham8287 Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I’m also checking out https://docs.rs/loom/latest/loom/ to understand how it detects multithread-specific bugs while still being deterministic, and whether we can leverage something from it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think putting more than one thread in executor make the tests not replayable by seed.

I'm sure that is not the case. We can have multiple threads while staying replayable/deterministic as long as we ensure that only one thread is unparked at any given time. I'm working on a PoC of that today.

I think current scope should be to keep executor single-threaded only, even if does not let us explore all race conditions, it is still very helpful for other classes of bugs - Schema bugs, replayability bugs, fault-tolerance related, etc from day 1.

I think that is reasonable. Maybe if the multi-threaded version is easy to implement, it would make sense to add it sooner, but it can probably be added later without breaking what we have now.

That said, I think the broader goal should be to move toward an architecture where data races are impossible by design. In practice, that means moving toward a single-threaded runtime model, which also aligns with our thread-per-core principle.

Making races impossible sounds great, but even with a thread-per-core design, we will almost certainly still have multiple threads interacting with each other.

Copy link
Copy Markdown
Contributor Author

@Shubham8287 Shubham8287 Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure that is not the case. We can have multiple threads while staying replayable/deterministic as long as we ensure that only one thread is unparked at any given time. I'm working on a PoC of that today.

Right, that's a good approach which you already explain but missed from my mind :)

Comment thread crates/runtime/src/lib.rs
Tokio(TokioHandle),
#[cfg(feature = "simulation")]
Simulation(sim::Handle),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want this comment to block merging this PR, so feel free to close it, but I'm curious about the chose of enums vs traits to implement this. Seems reasonable, but could you comment on that choice briefly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong reasons, but this felt more natural due to small set of variants. Using trait would have require to wrap TokioHandle in some newtype to be able to implement custom trait.

/// Paused-node tasks are diverted into that node's paused buffer instead of
/// being polled immediately.
fn run_all_ready(&self) {
while let Some(runnable) = self.queue.try_recv_random(&self.rng) {
Copy link
Copy Markdown
Contributor

@cloutiertyler cloutiertyler Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, I also think that round robin order is correct and possibly preferrable (within a single node).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would round-robin order be preferrable? That seems like it would limit the set race conditions that can be simulated significantly.

@@ -0,0 +1,51 @@
use crate::sim::Runtime;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: You might consider having sim be a separate crate, because eventually you're going to want to mock out things like the disk/network/etc. which isn't really part of the runtime.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all very similar to what I have in my experimental project.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all reasonable, but I do want to flag that we should work towards just not using these things inside the deep database at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's how this crate is structured. everything inside sim module is no_std. sim_std is helper wrapper until dependent code also uses std.

Copy link
Copy Markdown
Contributor

@cloutiertyler cloutiertyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite good overall. I think it's a good start.

Copy link
Copy Markdown
Contributor

@jsdt jsdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a solid start

};
}

if self.time.wake_next_timer() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only advance timers when none of the other tasks are runnable, are we going to be unable to simulate some timing scenarios? We could randomly decide to wake a timer instead of running a task sometimes (in run_all_ready), though that might just cause a lot of time outs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right! I think more subtle way to do this would be to occasionally advance time inside run_all_ready by few microseconds. This can be done by calling self.time.advace() api. Which will also take care to wake the registered sleepers, if needed.
This will advance the clock more naturally, and let timer tasks to be waken up along with other tasks?

time.wake_next_timer() thing could still be there when no tasks presents, to accelerate the time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think time.wake_next_timer() does need to be there so avoid getting stuck (for cases where we are actually hitting a time out).

The idea of randomly/periodically advancing the clock makes sense to me. We could have a parameter like polls_per_second or avg_time_per_poll to control how quickly the clock advances as tasks are executed. With a avg_time_per_poll: Duration parameter, we could just advance time by that amount after every piece of work, or we could make it random with something like avg_time_per_poll.mul_f64(-random::<f64>().ln()).


## Status Definitions

- `Controlled`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need one more status type to distinguish between a source of nondeterminism that we can make deterministic (which is what you are calling controlled), vs a source of non-determinism which we can deterministically simulate all possible outcomes.

Some examples that are deterministic with this framework, but can result in race conditions that we can't create are channels and atomic variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants