Skip to content

fix(event cache): don't deadlock when event-focused cache and redecryption are invoked#6542

Merged
jmartinesp merged 1 commit intomatrix-org:mainfrom
bnjbvr:bnjbvr/fix-deadlock-with-ef-cache-and-redecryptor
May 6, 2026
Merged

fix(event cache): don't deadlock when event-focused cache and redecryption are invoked#6542
jmartinesp merged 1 commit intomatrix-org:mainfrom
bnjbvr:bnjbvr/fix-deadlock-with-ef-cache-and-redecryptor

Conversation

@bnjbvr
Copy link
Copy Markdown
Contributor

@bnjbvr bnjbvr commented May 5, 2026

The following ABBA situation would happen before this fix:

  • an event-focused cache starts pagination (holding onto the event-focused inner lock), and then takes the event cache room state lock (while saving the events it's received in the pagination).
  • redecryption takes the event cache room state lock, to update internal room events; then it would try to go through all the event-focused caches to replace their UTDs (thus taking all the inner locks, in turn).

So, having an event-focused cache paginating while the redecryption happens would cause a deadlock. The fix consists in changing the order of operations when re-decrypting: instead of holding onto the event cache room state lock all the time, first perform all the replacements necessary in the room cache, and clone the event-focused / pinned-event locks; then, replace UTDs in those (after the event cache room state lock has been released).

  • No public API changes.
  • This PR was made with the help of AI. (Used Claude to find the reason for the deadlock, based on the repro steps from the issue.)

(Likely) fixes #6524.

cc @jmartinesp @poljar


Signed-off-by: benjamin@bouvier.cc

…ption are invoked

The following ABBA situation would happen before this fix:

- an event-focused cache starts pagination (holding onto the
event-focused inner lock), and then takes the event cache room state
lock (while saving the events it's received in the pagination).
- redecryption takes the event cache room state lock, to update internal
room events; then it would try to go through all the event-focused
caches to replace their UTDs (thus taking all the inner locks, in turn).

So, having an event-focused cache paginating while the redecryption
happens would cause a deadlock. The fix consists in changing the order
of operations when re-decrypting: instead of holding onto the event
cache room state lock all the time, first perform all the replacements
necessary in the room cache, *and* clone the event-focused /
pinned-event locks; then, replace UTDs in those (after the event cache
room state lock has been released).
@bnjbvr bnjbvr requested a review from a team as a code owner May 5, 2026 17:22
@bnjbvr bnjbvr requested review from andybalaam and removed request for a team May 5, 2026 17:22
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 98.68421% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 89.97%. Comparing base (2ec031b) to head (bba09c9).
⚠️ Report is 9 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
crates/matrix-sdk/src/event_cache/redecryptor.rs 98.68% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6542      +/-   ##
==========================================
+ Coverage   89.92%   89.97%   +0.04%     
==========================================
  Files         381      381              
  Lines      105914   105989      +75     
  Branches   105914   105989      +75     
==========================================
+ Hits        95243    95362     +119     
+ Misses       7029     6994      -35     
+ Partials     3642     3633       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 5, 2026

Merging this PR will not alter performance

✅ 50 untouched benchmarks


Comparing bnjbvr:bnjbvr/fix-deadlock-with-ef-cache-and-redecryptor (bba09c9) with main (2ec031b)

Open in CodSpeed

@jmartinesp
Copy link
Copy Markdown
Contributor

jmartinesp commented May 6, 2026

Thanks, I'll test it on Element X Android, but your explanation matches what I saw when trying to reproduce the issue, so I have high hopes!


I wasn't able to reproduce with my previous setup, which was almost guaranteed to make the issue happen, so I think it's fixed!

Copy link
Copy Markdown
Contributor

@jmartinesp jmartinesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but a 2nd review from someone more familiar with r2d2 or the event caches would be appreciated!

@Hywan Hywan self-requested a review May 6, 2026 06:33
Copy link
Copy Markdown
Member

@Hywan Hywan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jmartinesp jmartinesp enabled auto-merge (rebase) May 6, 2026 08:55
@andybalaam andybalaam removed their request for review May 6, 2026 08:57
@jmartinesp jmartinesp merged commit d0cb664 into matrix-org:main May 6, 2026
80 of 82 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A deadlock can happen when using event focused timelines

3 participants