Skip to content

spotupnp: fix teardown deadlock when removing a renderer#65

Open
007hacky007 wants to merge 1 commit into
philippe44:masterfrom
007hacky007:fix/spotupnp-renderer-teardown-deadlock
Open

spotupnp: fix teardown deadlock when removing a renderer#65
007hacky007 wants to merge 1 commit into
philippe44:masterfrom
007hacky007:fix/spotupnp-renderer-teardown-deadlock

Conversation

@007hacky007
Copy link
Copy Markdown

Hit this on my setup: one of my UPnP renderers vanished from Spotify Connect and spotupnp sat at 100% CPU until I restarted it. Traced it to a lock-ordering bug between UpdateThread and the CSpotPlayer task.

What happens

UpdateThread holds Device->Mutex across spotDeletePlayer(). For example the presence-timeout path at spotupnp.c:811:

pthread_mutex_lock(&Device->Mutex);
spotDeletePlayer(Device->SpotPlayer);   // blocks inside ~CSpotPlayer
DelMRDevice(Device);

~CSpotPlayer() sets isRunning = false, unregisters mDNS (so the renderer disappears from Spotify Connect), then waits on runningMutex for runTask() to finish. But runTask()'s teardown drives shadowRequest(), which also wants Device->Mutex. That lock is still held by UpdateThread, which is itself waiting on the destructor. Classic AB/BA deadlock.

The 100% CPU falls out of the same state: once isRunning=false, CSpotPlayer::writePCM() returns 0 for every write, and cspot's TrackPlayer data-feed loop retries without backoff. That thread pegs a core for as long as the destructor is parked, i.e. until restart.

The logs line up with this: you see player <X> deletion pending but never the subsequent disconnecting player <X> - runTask() can't exit.

Fix

Drop Device->Mutex before spotDeletePlayer() at every site that currently holds it across the call:

  • spotupnp.c: presence-timeout removal, BYE_BYE removal, Sonos master-to-slave transition.
  • mr_util.c: FlushMRDevices() on shutdown.

Pattern:

dying = Device->SpotPlayer;
Device->SpotPlayer = NULL;
pthread_mutex_unlock(&Device->Mutex);
spotDeletePlayer(dying);
pthread_mutex_lock(&Device->Mutex);   // only where DelMRDevice follows

Device->SpotPlayer is nulled out before unlocking so any concurrent consumer on Device->Mutex sees a stable "no player" sentinel instead of a pointer to a dying object. Most of those consumers already null-check via notify(). The exception is spotGetMetaForUrl() (called from ActionHandler at spotupnp.c:560), so add an if (!self) return false; at the top of getMetaForUrl() in spotify.cpp.

DelMRDevice()'s contract is unchanged: still expects Device->Mutex held on entry, still returns it unlocked. The reacquire before DelMRDevice() preserves that.

Notes

After the fix, the renderer comes back on its own via the next UPnP search cycle (about 30 s), no restart required.

This doesn't prevent the presence-timeout path from firing in the first place - if LastSeen is only refreshed from SSDP and SSDP is late or filtered, a healthy renderer can still be flagged as gone and then re-added a cycle later. A separate patch refreshing LastSeen from UPnP events and successful control responses handles that and is complementary to this one.

UpdateThread holds Device->Mutex across spotDeletePlayer(). ~CSpotPlayer()
then waits on runningMutex for CSpotPlayer::runTask() to exit, but the
task's teardown path drives shadowRequest() which itself needs Device->Mutex.
Classic AB/BA deadlock: the renderer disappears from Spotify, mDNS is
unregistered, and the process stays stuck (with CPU pegged by the cspot
TrackPlayer data-feed loop against a writePCM that now returns 0) until
restart.

Drop Device->Mutex before spotDeletePlayer() in every site that currently
holds it across the call:

  * presence-timeout removal in UpdateThread (spotupnp.c)
  * SSDP BYE_BYE removal (spotupnp.c)
  * Sonos master-to-slave transition (spotupnp.c)
  * process-shutdown sweep in FlushMRDevices (mr_util.c)

Device->SpotPlayer is cleared to NULL before unlocking so any concurrent
consumer sees a consistent sentinel instead of a dying object. getMetaForUrl
in spotify.cpp gains a null guard so that sentinel is safe for its one
caller (spotGetMetaForUrl via ActionHandler) which, unlike every other path,
does not go through the already-null-safe notify().

DelMRDevice()'s contract is unchanged: it still expects Device->Mutex held
on entry and returns it unlocked. The reacquire before DelMRDevice()
preserves that contract.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant