Skip to content

Add docs/design directory along with one initial design doc.#26621

Merged
sbc100 merged 1 commit intoemscripten-core:mainfrom
sbc100:design_docs
Apr 6, 2026
Merged

Add docs/design directory along with one initial design doc.#26621
sbc100 merged 1 commit intoemscripten-core:mainfrom
sbc100:design_docs

Conversation

@sbc100
Copy link
Copy Markdown
Collaborator

@sbc100 sbc100 commented Apr 3, 2026

No description provided.

@sbc100 sbc100 requested review from kripken and tlively April 3, 2026 18:32
@sbc100 sbc100 force-pushed the design_docs branch 3 times, most recently from 18b8ce5 to b08e859 Compare April 6, 2026 16:01
@sbc100
Copy link
Copy Markdown
Collaborator Author

sbc100 commented Apr 6, 2026

I add some more context to the README. I didn't completely remove the reference to AI here, since that is one of the main motivations here, but I did re-phrase to make it clear its only part of the rationale.

@sbc100 sbc100 force-pushed the design_docs branch 2 times, most recently from f0eefd3 to bb72808 Compare April 6, 2026 18:35
@sbc100 sbc100 requested a review from kripken April 6, 2026 18:35
Copy link
Copy Markdown
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm in general, and also the first design doc seems generally good to me, though I'm not an expert on the locking stuff.

@sbc100 sbc100 merged commit ecb9e39 into emscripten-core:main Apr 6, 2026
8 of 20 checks passed
@sbc100 sbc100 deleted the design_docs branch April 6, 2026 18:46
Comment on lines +14 to +15
The current implementation uses a 1ms wakeup interval for the main runtime
thread and a 100ms interval for cancellable pthreads. This leads to unnecessary
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be out of date once the new design lands. Should this be phrased as "the previous implementation" or edited out entirely to future-proof the doc?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure how to address this problem. The perhaps by updating these documents if/when the bugs get closed. i.e. perhaps when Draft is changed to Done or Complete?


// A counter that is incremented every time the thread wakes up from a futex wait.
// Used by wakers to ensure the target thread has actually acknowledged the wake.
_Atomic(uint32_t) wait_counter;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be named wake_counter.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the waiter that increments it though. i.e. it counts how much times a given threads has called futex_wait.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it increments upon waking. The current name suggests to me that it should be incremented when going to sleep, but in fact it is incremented when waking.

Comment on lines +64 to +76
1. **Pre-check**: Check `wait_reasons`. If non-zero, handle the reasons (e.g., process mailbox or handle cancellation).
2. **Publish**: Set `waiting_on_address = addr`.
3. **Counter Snapshot**: Read `current_counter = wait_counter`.
4. **Double-check**: This is critical to avoid the race where a reason was added just before `waiting_on_address` was set. If `wait_reasons` is now non-zero, clear `waiting_on_address` and go to step 1.
5. **Wait**: Call `ret = __builtin_wasm_memory_atomic_wait32(addr, val, timeout)`.
6. **Unpublish**:
- Set `waiting_on_address = NULL`.
- Atomically increment `wait_counter`.
7. **Post-check**: Check `wait_reasons`. If non-zero, handle the reasons.
8. **Return**: Return the result of the wait to the caller.
- If `ret == ATOMICS_WAIT_OK`, return `0`.
- If `ret == ATOMICS_WAIT_TIMED_OUT`, return `-ETIMEDOUT`.
- If `ret == ATOMICS_WAIT_NOT_EQUAL`, return `-EWOULDBLOCK`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a meta-comment, I'm not sure it's worth checking in a design doc if it is going to contain precise pseudocode listing like this. It might as well be real code.

3. If `target_addr` is not NULL:
- Read `start_c = target->wait_counter` (`SEQ_CST`).
- Enter a loop:
- Call `emscripten_futex_wake(target_addr, 1)`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wake just one waiter instead of all waiters at once? If all were woken at once, we could get rid of the wait_counter entirely, right? And maybe get rid of waiting_on_address as well?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't see any reason not to wake all waiters.

Regarding getting rid waiting_on_address .. I don't see how that is possible? If you need to wake a given thread you need to know what address its currently waiting on don't you?

Regarding getting rid of wait_counter, did you grok the reason why its actually needed? i.e. to fix the possible race between setting waiting_on_address and the actual atomic.wait instruction. Maybe you have some other idea about how to solve that race?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you grok

Probably not :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I got it. We need the waiting_on_address so we know what address to notify when we want to send a waiting thread a message. It's possible we could send the notification before the waiting thread actually goes to sleep, so to protect against that we notify it repeatedly until we observe it changing its wake count, at which point we know it will proceed to handle the notification.

But instead of a wake count, perhaps we could just have the waiting thread clear its wait_reasons when it reads them after waking up. Then the notifying thread can just observe the wait_reasons get cleared to know the target thread has woken up.

### 5. Overlapping and Spurious Wakeups
The design must handle cases where "real" wakeups (triggered by the application) and "side-channel" wakeups (cancellation/mailbox) occur simultaneously.

1. **Spurious Wakeups for Other Threads**: If multiple threads are waiting on the same address (e.g., a shared mutex), a side-channel `atomic_wake(addr, 1)` targeted at Thread A might be delivered by the kernel to Thread B.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no kernel in this case. Is this describing how native futexes work?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess "kernel" should be replaced by "host" here. Would that makes sense here. This sentence does make sense to me with that replacment.

Its referring to the fact that normal wakups can overlap with the side channel wakeups.

sbc100 added a commit to sbc100/emscripten that referenced this pull request Apr 6, 2026
sbc100 added a commit to sbc100/emscripten that referenced this pull request Apr 6, 2026
sbc100 added a commit to sbc100/emscripten that referenced this pull request Apr 6, 2026
sbc100 added a commit to sbc100/emscripten that referenced this pull request Apr 6, 2026

1. **Pre-check**: Check `wait_reasons`. If non-zero, handle the reasons (e.g., process mailbox or handle cancellation).
2. **Publish**: Set `waiting_on_address = addr`.
3. **Counter Snapshot**: Read `current_counter = wait_counter`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be used for anything. (But it would be easier to be sure if looking at real code.)

1. **Pre-check**: Check `wait_reasons`. If non-zero, handle the reasons (e.g., process mailbox or handle cancellation).
2. **Publish**: Set `waiting_on_address = addr`.
3. **Counter Snapshot**: Read `current_counter = wait_counter`.
4. **Double-check**: This is critical to avoid the race where a reason was added just before `waiting_on_address` was set. If `wait_reasons` is now non-zero, clear `waiting_on_address` and go to step 1.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be slightly simplified by using the low bits of waiting_on_address as the wait_reason, so both could atomically checked and updated in a CAS loop.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we can just remove wait_reasons completely.

That was something that Gemini added rather than me, and I don't think its actually safe since we could have multiple wakers trying to wake a sleeping thread for different reasons.

This is also why the counter is needed since it removes the multiple-wakers problem.

3. If `target_addr` is not NULL:
- Read `start_c = target->wait_counter` (`SEQ_CST`).
- Enter a loop:
- Call `emscripten_futex_wake(target_addr, 1)`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I got it. We need the waiting_on_address so we know what address to notify when we want to send a waiting thread a message. It's possible we could send the notification before the waiting thread actually goes to sleep, so to protect against that we notify it repeatedly until we observe it changing its wake count, at which point we know it will proceed to handle the notification.

But instead of a wake count, perhaps we could just have the waiting thread clear its wait_reasons when it reads them after waking up. Then the notifying thread can just observe the wait_reasons get cleared to know the target thread has woken up.

@tlively
Copy link
Copy Markdown
Member

tlively commented Apr 6, 2026

Proposed simpler protocol using a bit instead of a counter:

// Side-channel notify a target thread after enqueuing work.
addr = fetch_or(&target->wait_addr, NOTIFY_BIT);
if (addr == NULL) {
  // The thread is not waiting. We can leave the notify bit set.
  return;
}
if ((addr & NOTIFY_BIT) != 0) {
  // Some other thread is already notifying this target.
  return;
}
// We set the bit and are responsible for notifying the target. 
while (atomic_load(&target->wait_addr) == (addr | NOTIFY_BIT)) {
  // The target is not awake yet.
  notify_all(addr)
  sched_yield()
}
// Futex
while (!cas_weak(&self->wait_addr, NULL, addr)) {
  // Handle notifications
  atomic_store(&target->wait_addr, NULL)
  read_mailbox()
}
// From here, anyone trying to side-channel notify us will keep trying until they observe us wake up.
ret = atomic_wait(addr, expected, timeout)
if (atomic_exchange(&self->wait_addr, NULL) != addr) {
  // Handle notifications received while sleeping.
  read_mailbox();
}
return ret;

@sbc100
Copy link
Copy Markdown
Collaborator Author

sbc100 commented Apr 6, 2026

I like it!

sbc100 added a commit to sbc100/emscripten that referenced this pull request Apr 7, 2026
sbc100 added a commit to sbc100/emscripten that referenced this pull request Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants