Add docs/design directory along with one initial design doc.#26621
Add docs/design directory along with one initial design doc.#26621sbc100 merged 1 commit intoemscripten-core:mainfrom
Conversation
18b8ce5 to
b08e859
Compare
|
I add some more context to the README. I didn't completely remove the reference to AI here, since that is one of the main motivations here, but I did re-phrase to make it clear its only part of the rationale. |
f0eefd3 to
bb72808
Compare
kripken
left a comment
There was a problem hiding this comment.
lgtm in general, and also the first design doc seems generally good to me, though I'm not an expert on the locking stuff.
| The current implementation uses a 1ms wakeup interval for the main runtime | ||
| thread and a 100ms interval for cancellable pthreads. This leads to unnecessary |
There was a problem hiding this comment.
This is going to be out of date once the new design lands. Should this be phrased as "the previous implementation" or edited out entirely to future-proof the doc?
There was a problem hiding this comment.
I'm not really sure how to address this problem. The perhaps by updating these documents if/when the bugs get closed. i.e. perhaps when Draft is changed to Done or Complete?
|
|
||
| // A counter that is incremented every time the thread wakes up from a futex wait. | ||
| // Used by wakers to ensure the target thread has actually acknowledged the wake. | ||
| _Atomic(uint32_t) wait_counter; |
There was a problem hiding this comment.
Maybe this should be named wake_counter.
There was a problem hiding this comment.
Its the waiter that increments it though. i.e. it counts how much times a given threads has called futex_wait.
There was a problem hiding this comment.
Yes, but it increments upon waking. The current name suggests to me that it should be incremented when going to sleep, but in fact it is incremented when waking.
| 1. **Pre-check**: Check `wait_reasons`. If non-zero, handle the reasons (e.g., process mailbox or handle cancellation). | ||
| 2. **Publish**: Set `waiting_on_address = addr`. | ||
| 3. **Counter Snapshot**: Read `current_counter = wait_counter`. | ||
| 4. **Double-check**: This is critical to avoid the race where a reason was added just before `waiting_on_address` was set. If `wait_reasons` is now non-zero, clear `waiting_on_address` and go to step 1. | ||
| 5. **Wait**: Call `ret = __builtin_wasm_memory_atomic_wait32(addr, val, timeout)`. | ||
| 6. **Unpublish**: | ||
| - Set `waiting_on_address = NULL`. | ||
| - Atomically increment `wait_counter`. | ||
| 7. **Post-check**: Check `wait_reasons`. If non-zero, handle the reasons. | ||
| 8. **Return**: Return the result of the wait to the caller. | ||
| - If `ret == ATOMICS_WAIT_OK`, return `0`. | ||
| - If `ret == ATOMICS_WAIT_TIMED_OUT`, return `-ETIMEDOUT`. | ||
| - If `ret == ATOMICS_WAIT_NOT_EQUAL`, return `-EWOULDBLOCK`. |
There was a problem hiding this comment.
As a meta-comment, I'm not sure it's worth checking in a design doc if it is going to contain precise pseudocode listing like this. It might as well be real code.
| 3. If `target_addr` is not NULL: | ||
| - Read `start_c = target->wait_counter` (`SEQ_CST`). | ||
| - Enter a loop: | ||
| - Call `emscripten_futex_wake(target_addr, 1)`. |
There was a problem hiding this comment.
Why wake just one waiter instead of all waiters at once? If all were woken at once, we could get rid of the wait_counter entirely, right? And maybe get rid of waiting_on_address as well?
There was a problem hiding this comment.
I agree, I don't see any reason not to wake all waiters.
Regarding getting rid waiting_on_address .. I don't see how that is possible? If you need to wake a given thread you need to know what address its currently waiting on don't you?
Regarding getting rid of wait_counter, did you grok the reason why its actually needed? i.e. to fix the possible race between setting waiting_on_address and the actual atomic.wait instruction. Maybe you have some other idea about how to solve that race?
There was a problem hiding this comment.
Ok, I got it. We need the waiting_on_address so we know what address to notify when we want to send a waiting thread a message. It's possible we could send the notification before the waiting thread actually goes to sleep, so to protect against that we notify it repeatedly until we observe it changing its wake count, at which point we know it will proceed to handle the notification.
But instead of a wake count, perhaps we could just have the waiting thread clear its wait_reasons when it reads them after waking up. Then the notifying thread can just observe the wait_reasons get cleared to know the target thread has woken up.
| ### 5. Overlapping and Spurious Wakeups | ||
| The design must handle cases where "real" wakeups (triggered by the application) and "side-channel" wakeups (cancellation/mailbox) occur simultaneously. | ||
|
|
||
| 1. **Spurious Wakeups for Other Threads**: If multiple threads are waiting on the same address (e.g., a shared mutex), a side-channel `atomic_wake(addr, 1)` targeted at Thread A might be delivered by the kernel to Thread B. |
There was a problem hiding this comment.
There is no kernel in this case. Is this describing how native futexes work?
There was a problem hiding this comment.
I guess "kernel" should be replaced by "host" here. Would that makes sense here. This sentence does make sense to me with that replacment.
Its referring to the fact that normal wakups can overlap with the side channel wakeups.
|
|
||
| 1. **Pre-check**: Check `wait_reasons`. If non-zero, handle the reasons (e.g., process mailbox or handle cancellation). | ||
| 2. **Publish**: Set `waiting_on_address = addr`. | ||
| 3. **Counter Snapshot**: Read `current_counter = wait_counter`. |
There was a problem hiding this comment.
This doesn't appear to be used for anything. (But it would be easier to be sure if looking at real code.)
| 1. **Pre-check**: Check `wait_reasons`. If non-zero, handle the reasons (e.g., process mailbox or handle cancellation). | ||
| 2. **Publish**: Set `waiting_on_address = addr`. | ||
| 3. **Counter Snapshot**: Read `current_counter = wait_counter`. | ||
| 4. **Double-check**: This is critical to avoid the race where a reason was added just before `waiting_on_address` was set. If `wait_reasons` is now non-zero, clear `waiting_on_address` and go to step 1. |
There was a problem hiding this comment.
This could be slightly simplified by using the low bits of waiting_on_address as the wait_reason, so both could atomically checked and updated in a CAS loop.
There was a problem hiding this comment.
Actually I think we can just remove wait_reasons completely.
That was something that Gemini added rather than me, and I don't think its actually safe since we could have multiple wakers trying to wake a sleeping thread for different reasons.
This is also why the counter is needed since it removes the multiple-wakers problem.
| 3. If `target_addr` is not NULL: | ||
| - Read `start_c = target->wait_counter` (`SEQ_CST`). | ||
| - Enter a loop: | ||
| - Call `emscripten_futex_wake(target_addr, 1)`. |
There was a problem hiding this comment.
Ok, I got it. We need the waiting_on_address so we know what address to notify when we want to send a waiting thread a message. It's possible we could send the notification before the waiting thread actually goes to sleep, so to protect against that we notify it repeatedly until we observe it changing its wake count, at which point we know it will proceed to handle the notification.
But instead of a wake count, perhaps we could just have the waiting thread clear its wait_reasons when it reads them after waking up. Then the notifying thread can just observe the wait_reasons get cleared to know the target thread has woken up.
|
Proposed simpler protocol using a bit instead of a counter: |
|
I like it! |
No description provided.