Deterministic runtime crate#5016
Conversation
Co-authored-by: Shubham Mishra <shivam828787@gmail.com> Signed-off-by: Shubham Mishra <shivam828787@gmail.com>
Signed-off-by: Shubham Mishra <shivam828787@gmail.com>
| Tokio(tokio::task::JoinHandle<T>), | ||
| #[cfg(feature = "simulation")] | ||
| Simulation(sim::JoinHandle<T>), | ||
| Detached(PhantomData<T>), |
There was a problem hiding this comment.
This one is interesting and could use some commentary!
| let original = PTHREAD_ATTR_INIT.get_or_init(|| unsafe { | ||
| // `RTLD_NEXT` skips this interposed function and finds the libc | ||
| // implementation that would have been called without the simulator. | ||
| let ptr = libc::dlsym(libc::RTLD_NEXT, c"pthread_attr_init".as_ptr().cast()); |
There was a problem hiding this comment.
overriding libc methods turning out to be limited for detecting source of nondeterminism. I am experimenting with eBPF, which looks more powerful.
There was a problem hiding this comment.
TIL about eBPF. Interesting.
There was a problem hiding this comment.
It can be used to do so many things, https://github.com/clockworklabs/SpacetimeDB/pull/5177/changes#diff-8c7a3b299146fed405fed8a131643bb6aaac19152857cb129826ef17457d1a89. This is some experimental low level code to detect futex call (under simulation, this means determinism leakage).
|
|
||
| - **Zero dependency.** The simulation core in `sim/` is already `no_std + alloc`. The `sim_std` module is a thin OS-facing wrapper — the std dependency lives there, not in the simulation core itself. It stays until the application logic above this crate also moves to `no_std`. | ||
|
|
||
| ## Current Limitations |
There was a problem hiding this comment.
You touched on this in the point about it being a single-threaded runtime, but this is also limited in how granularly it can interleave executions of different tasks, which is going to make it impossible to produce some race conditions that would be possible in a multi-threaded runtime. If I'm reading this correctly, the sim executor picks one of the tasks, then it polls that task (which runs the task until it finishes or an await returns Pending), then it picks a task again. This means we can only mess with scheduling when a task yields.
For a simple example of a case of an execution that we can't simulate:
async fn task_a(mut rx: Receiver<Msg>) {
log("A: waiting");
rx.recv().await; // Pending until Task B sends.
log("A: received a message");
do_a_work(); // Cannot run until Task B yields or finishes.
}
async fn task_b(mut tx: Sender<Msg>) {
log("B: running");
tx.send(Msg).await; // Ready immediately; wakes Task A but does not yield.
do_b_work(); // Always runs before Task A resumes.
log("B: finished working");
}
We could never produce:
A: waiting
B: running
A: received a message
B: finished working
So if there is a bug when do_a_work runs before do_b_work, we would never be able to catch it.
I think the fix for this specific case would be to have channels be part of the runtime crate, so that when send is called, we can insert a yield_now call in the sim version (if send is async).
More generally, if the goals is to be able to simulate any possible race condition, we need to be able to suspend/interleave execution at any point in the code that either affects other tasks/threads or is affected by other tasks/threads.
For things with an async api, having a sim version that inserts a yield_now will make it possible to interleave, but addressing this for sync apis that affect other threads will require changing the executor. Even channels have some sync functions that we would want to be able to suspend (like try_recv or send on an UnboundedSender).
If we want this to work for sync functions that interact with other tasks, then the sim executor will need more than one thread.
There was a problem hiding this comment.
As we also discussed this out of band, but putting it here too:
More generally, if the goals is to be able to simulate any possible race condition, we need to be able to suspend/interleave execution at any point in the code that either affects other tasks/threads or is affected by other tasks/threads.
Agreed. With the current DB architecture, a single-threaded executor cannot simulate every possible race condition.
That said, I think the broader goal should be to move toward an architecture where data races are impossible by design. In practice, that means moving toward a single-threaded runtime model, which also aligns with our thread-per-core principle.
Practically, I think this would mean allowing only the database thread to touch the datastore, while potentially keeping separate threads for things like ws message handoff. If the “deep core” part of our DB runs single-threaded, we should be able to simulate all of it. The parts that must remain multi-threaded should live outside the “core,” and have to be tackled later.
If we want this to work for sync functions that interact with other tasks, then the sim executor will need more than one thread.
I think putting more than one thread in executor make the tests not replayable by seed.
So, maybe I think current scope should be to keep executor single-threaded only, even if does not let us explore all race conditions, it is still very helpful for other classes of bugs - Schema bugs, replayability bugs, fault-tolerance related, etc from day 1.
There was a problem hiding this comment.
Btw, I’m also checking out https://docs.rs/loom/latest/loom/ to understand how it detects multithread-specific bugs while still being deterministic, and whether we can leverage something from it.
There was a problem hiding this comment.
I think putting more than one thread in executor make the tests not replayable by seed.
I'm sure that is not the case. We can have multiple threads while staying replayable/deterministic as long as we ensure that only one thread is unparked at any given time. I'm working on a PoC of that today.
I think current scope should be to keep executor single-threaded only, even if does not let us explore all race conditions, it is still very helpful for other classes of bugs - Schema bugs, replayability bugs, fault-tolerance related, etc from day 1.
I think that is reasonable. Maybe if the multi-threaded version is easy to implement, it would make sense to add it sooner, but it can probably be added later without breaking what we have now.
That said, I think the broader goal should be to move toward an architecture where data races are impossible by design. In practice, that means moving toward a single-threaded runtime model, which also aligns with our thread-per-core principle.
Making races impossible sounds great, but even with a thread-per-core design, we will almost certainly still have multiple threads interacting with each other.
There was a problem hiding this comment.
I'm sure that is not the case. We can have multiple threads while staying replayable/deterministic as long as we ensure that only one thread is unparked at any given time. I'm working on a PoC of that today.
Right, that's a good approach which you already explain but missed from my mind :)
| Tokio(TokioHandle), | ||
| #[cfg(feature = "simulation")] | ||
| Simulation(sim::Handle), | ||
| } |
There was a problem hiding this comment.
I don't want this comment to block merging this PR, so feel free to close it, but I'm curious about the chose of enums vs traits to implement this. Seems reasonable, but could you comment on that choice briefly?
There was a problem hiding this comment.
No strong reasons, but this felt more natural due to small set of variants. Using trait would have require to wrap TokioHandle in some newtype to be able to implement custom trait.
| /// Paused-node tasks are diverted into that node's paused buffer instead of | ||
| /// being polled immediately. | ||
| fn run_all_ready(&self) { | ||
| while let Some(runnable) = self.queue.try_recv_random(&self.rng) { |
There was a problem hiding this comment.
As we discussed, I also think that round robin order is correct and possibly preferrable (within a single node).
There was a problem hiding this comment.
Why would round-robin order be preferrable? That seems like it would limit the set race conditions that can be simulated significantly.
| @@ -0,0 +1,51 @@ | |||
| use crate::sim::Runtime; | |||
There was a problem hiding this comment.
NIT: You might consider having sim be a separate crate, because eventually you're going to want to mock out things like the disk/network/etc. which isn't really part of the runtime.
There was a problem hiding this comment.
This is all very similar to what I have in my experimental project.
There was a problem hiding this comment.
This is all reasonable, but I do want to flag that we should work towards just not using these things inside the deep database at all.
There was a problem hiding this comment.
Yeah, that's how this crate is structured. everything inside sim module is no_std. sim_std is helper wrapper until dependent code also uses std.
cloutiertyler
left a comment
There was a problem hiding this comment.
This is quite good overall. I think it's a good start.
jsdt
left a comment
There was a problem hiding this comment.
This looks like a solid start
| }; | ||
| } | ||
|
|
||
| if self.time.wake_next_timer() { |
There was a problem hiding this comment.
If we only advance timers when none of the other tasks are runnable, are we going to be unable to simulate some timing scenarios? We could randomly decide to wake a timer instead of running a task sometimes (in run_all_ready), though that might just cause a lot of time outs.
There was a problem hiding this comment.
Oh, right! I think more subtle way to do this would be to occasionally advance time inside run_all_ready by few microseconds. This can be done by calling self.time.advace() api. Which will also take care to wake the registered sleepers, if needed.
This will advance the clock more naturally, and let timer tasks to be waken up along with other tasks?
time.wake_next_timer() thing could still be there when no tasks presents, to accelerate the time.
There was a problem hiding this comment.
I think time.wake_next_timer() does need to be there so avoid getting stuck (for cases where we are actually hitting a time out).
The idea of randomly/periodically advancing the clock makes sense to me. We could have a parameter like polls_per_second or avg_time_per_poll to control how quickly the clock advances as tasks are executed. With a avg_time_per_poll: Duration parameter, we could just advance time by that amount after every piece of work, or we could make it random with something like avg_time_per_poll.mul_f64(-random::<f64>().ln()).
|
|
||
| ## Status Definitions | ||
|
|
||
| - `Controlled` |
There was a problem hiding this comment.
I think we need one more status type to distinguish between a source of nondeterminism that we can make deterministic (which is what you are calling controlled), vs a source of non-determinism which we can deterministically simulate all possible outcomes.
Some examples that are deterministic with this framework, but can result in race conditions that we can't create are channels and atomic variables.
Description of Changes.
Introduces deterministic runtime crate.
Integrate it with RelationalDB.
I think best steps to review:
durability,core,snapshot, etc.Draft branch to Test code - #5019
API and ABI breaking changes
NA
Expected complexity level and risk
Does not intend to change any production functionality, but it's big code.
Testing