Fix pthreads normal-mutex debug implementation #26622
Fix pthreads normal-mutex debug implementation #26622slowriot wants to merge 10 commits intoemscripten-core:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a debug-only regression in Emscripten’s musl pthread normal-mutex path under -pthread (notably affecting dlmalloc with -sWASM_WORKERS) by restoring historical _m_lock semantics while keeping same-thread relock deadlock trapping via debug-only thread-local ownership tracking. It also adds browser regression tests covering the allocator abort and the subsequent stale-owner deadlock false-positive.
Changes:
- Restore
PTHREAD_MUTEX_NORMALdebug behavior to keep_m_lockin the historical0/EBUSYencoding while maintaining deadlock trapping. - Add debug-only thread-local tracking of held normal mutexes using
_m_prev/_m_next, and use it for deadlock assertions. - Add two browser regression tests for the allocator regression and the reporting/teardown false-positive.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
system/lib/libc/musl/src/thread/pthread_mutex_trylock.c |
Implements debug normal-mutex lock-word preservation and hooks normal-mutex ownership tracking. |
system/lib/libc/musl/src/thread/pthread_mutex_unlock.c |
Hooks normal-mutex unlock to remove from debug thread-local ownership tracking. |
system/lib/libc/musl/src/thread/pthread_mutex_timedlock.c |
Updates debug deadlock assertions to consult thread-local normal-mutex ownership list. |
system/lib/libc/musl/src/internal/pthread_impl.h |
Adds debug-only thread-local list + helpers for normal-mutex ownership tracking. |
test/test_browser.py |
Registers two new browser regression tests under the wasm worker suite. |
test/wasm_worker/pthread_mutex_debug_allocator_regression.cpp |
New browser repro focused on allocator churn under -pthread + wasm workers. |
test/wasm_worker/pthread_mutex_debug_reporting_teardown.cpp |
New browser repro focused on the follow-on stale-owner deadlock assert. |
AUTHORS |
Adds contributor entry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…/emscripten into pthread-mutex-debug-fix
|
It appears the failing test is failing on |
|
I'm not sure the root cause analysis here is correct/complete. My guess is that the issue here is simply that pthead_mutex API was never really tested or guardnattes to work when called from a wasm worker. In particular IIUC the pthread_self() function always return NULL for all wasm workers, which is likely to confuse pretty much all pthread APIs. The fact that normal pthread_mutex lock happened to work before #24607 seems like more of a lucky coincidence perhaps? In particular I don't under understand "The problem is that this changed the observable semantics of the mutex lock word for normal mutexes in debug builds". Surely the contents of the mutex word should be exterally observed by the user program? Its purely internal the pthread implementation right? When you say "That broke dlmalloc under -pthread debug system-library builds".. I'm not sure I understand why it broke exactly? Do you know why it breaks in this one configuration? What exactly is the issue? |
|
I think its rather just revert #24607 than add more complex emscripten patches to the internal pthread API. However, I think it might be better to try to make it so that calling pthead APIs from wasm workers is actually well defined. |
|
@sbc100 apologies if I've been unclear - I'm only becoming familiar with this code as I'm working through it, and learning more as I go. My understanding at this point is:
The changes in this PR stop The advantage over simply reverting, is that this change preserves the intended post-#24607 debug behaviour: same-thread relock of a normal mutex still traps instead of silently hanging forever. Simply reverting would discard that safety improvement. |
This PR fixes a regression in Emscripten's debug-only
PTHREAD_MUTEX_NORMALpath that showed up under-pthread, and adds regression coverage for the two failure modes we reduced from the original application.See #26619 for details on the regression.
Problem
In
4.0.11, Emscripten changed debug builds soPTHREAD_MUTEX_NORMALno longer always takes the old fasta_cas(..., 0, EBUSY)path. The motivation was valid: in debug builds, a same-thread relock of a normal mutex should trap instead of hanging forever.The problem is that this changed the observable semantics of the mutex lock word for normal mutexes in debug builds. In particular, normal mutexes started carrying owner-style state through the slow path instead of preserving the historical
0/EBUSYencoding.That broke
dlmallocunder-pthreaddebug system-library builds.dlmallocuses normal pthread mutexes for its internal locks, and with the new debug path we were able to trigger allocator aborts with a minimal browser repro consisting only of:-pthread-sWASM_WORKERSNo application logic, rendering, or complex containers were required.
The first part of the fix was therefore to restore historical
_m_locksemantics for normal mutexes in debug Emscripten builds while keeping the new debug deadlock trapping behavior.However, that exposed a second issue.
To preserve the debug deadlock check without using
_m_lock, the intermediate fix stored normal-mutex ownership in_m_count. That was enough to fix the allocator regression, but it introduced a second false-positive deadlock under contention:_m_countis shared mutable state, so contending threads can observe stale owner information and incorrectly conclude that the current thread already owns the mutex.This second issue showed up in the harness-facing repro as a deadlock assert during steady-state execution/reporting, even though the mutex was not actually being recursively acquired by the same thread.
Root Cause
There are really two related problems here:
Debug normal mutexes were no longer preserving the historical
0/EBUSY_m_lockencoding.That broke internal users such as
dlmalloc._m_countis not a safe place to track debug-only ownership for normal mutexes under contention.It is shared state, so reading it from another thread is inherently racy and can produce stale-owner false positives.
So the correct design is:
_m_locksemantics unchanged for normal mutexesFix
This PR implements that design.
Changes:
system/lib/libc/musl/src/thread/pthread_mutex_trylock.csystem/lib/libc/musl/src/thread/pthread_mutex_unlock.csystem/lib/libc/musl/src/thread/pthread_mutex_timedlock.csystem/lib/libc/musl/src/internal/pthread_impl.hWhat the fix does:
PTHREAD_MUTEX_NORMALstill keeps_m_lockin the historical0/EBUSYform.struct pthread._m_countand not_m_lock.Why this design:
4.0.11debug change: same-thread relock of a normal mutex still traps instead of silently hanging._m_locksemantics thatdlmallocdepended on.A subtle but important implementation detail is that debug normal-mutex ownership is tracked in a per-thread list anchored on
struct pthread, rather than in a separate_Thread_localvariable or in shared mutex fields such as_m_lock/_m_count. This preserves the historical mutex lock-word semantics for normal mutexes, avoids racy shared ownership state, and also avoids introducing extra TLS state that brokeother.test_threadprofilerduring CI.Why not simpler alternatives
I intentionally did not take these approaches:
Reverting the
4.0.11debug change entirely.That would discard the intended "trap instead of deadlock" behavior that motivated the change.
Keeping
_m_countas the owner field.That fixes the first failure but causes the second one because
_m_countis shared and racy.Disabling assertions / forcing NDEBUG.
That works around the issue but does not fix the debug runtime.
The patch here is meant to be a real upstream fix, not a workaround.
Tests
This PR adds two browser regression tests in
test/test_browser.py:test/wasm_worker/pthread_mutex_debug_allocator_regression.cppThis captures the original allocator failure mode: debug
-pthread + -sWASM_WORKERSplus trivial main/worker allocation traffic must stay clean.test/wasm_worker/pthread_mutex_debug_reporting_teardown.cppThis captures the stale-owner false-positive that appeared after the first fix was applied.
I also verified that the existing deadlock behavior is still intact via:
other.test_pthread_mutex_deadlockother.test_threadprofilerSo the final state is:
--threadprofilerstartup behavior preservedValidation
The fix is validated with:
browser.test_wasm_worker_pthread_mutex_debug_allocator_regressionbrowser.test_wasm_worker_pthread_mutex_debug_reporting_teardownother.test_pthread_mutex_deadlockother.test_threadprofiler