Following up from discussion last week after we ran into some pretty dire deadlock scenarios involving partition rotation metadata locks, I wanted to investigate whether it would be worthwhile to change the way score-statistics-processor, potentially avoiding opening a transaction in the first place, or if doing so, minimising the amount of time spent within the transaction.
The imminent issue was fixed by reducing the timeout on osu-web rank lookup requests, but I don't see this as a permanent fix because:
- It is relying on things "eventually succeeding". Individual scores will fail to complete and enter the retry queue during partition switching.
- Any non-database component failure which is accessed inside the transaction could result in similar database-wise locks, causing full system downtime.
Let's assess the issues at hand:
UserStats write usages
These processor-external usages can be easily fixes, and probably should be as right now they are writing full row data even when they don't need to.
TotalScoreProcessorTests.cs (1 usages)
Commands\Maintenance\VerifyUserRankCounts.cs (5 usages)
Commands\Maintenance\VerifyUserRankedScore.cs (1 usages)
The cases below are a touch more delicate. Because of codependencies and medal awarders, we need to be updating the UserStats tracking object, or be re-querying on subsequent read usages.
Processors\HitStatisticsProcessor.cs (4 usages)
countMiss 1
count50 1
count100 1
count300 1
Processors\LastPlayedDateProcessor.cs (1 usages)
last_played
Processors\MaxComboProcessor.cs (1 usages)
max_combo
Processors\PlayCountProcessor.cs (2 usages)
playcount 1 2
Processors\PlayTimeProcessor.cs (2 usages)
total_seconds_played
Processors\RankedScoreProcessor.cs (1 usages)
ranked_score
Processors\TotalScoreProcessor.cs (6 usages)
total_score
level
Processors\UserRankCountProcessor.cs (5 usages)
xh_rank_count
x_rank_count
sh_rank_count
s_rank_count
a_rank_count
Processors\UserTotalPerformanceProcessor.cs (3 usages)
rank_score
accuracy_new
rank_score_index
READ COMMITTED is not safe
We switched to this transaction mode previously to avoid actual database deadlocks. I knew at the time that this was not robust, but it did fix the issue and in majority of cases would not cause an issue.
While I was investigating this new issue and writing up documentation, I wanted to write this:
// Opening a transaction here guarantees a mutex over the user which is being processed.
//
// This is important as queue processors are made to be run in parallel at high scale,
// which means two queue processors could end up operating on the same user in an edge case.
//
// Transaction mode READ COMMITTED means that subsequent queries may see different underlying
// data, but...
using (var transaction = conn.BeginTransaction(IsolationLevel.ReadCommitted))
And that's where I stopped because I realised that the part I wrote before the transaction explanation was actually not true. We would need to do SELECT .. FOR UPDATE or change the isolation mode to guarantee this.
Proposed paths forward
Here are our options:
- Do nothing. Things are working until they aren't.
- Will potentially still cause sentry error reports each day during partition cycling
- Any longer failures that take more than a few seconds to timeout could definitely cause system-wide failures.
- Some updates may not be correct due to the transaction mode. For instance, if two scores by the same user are processed in parallel, a single playcount increment may be missed.
- Move all write operations to local
UPDATE calls
- This fixes transaction issues for the majority case as the moment an update occurs on
user_stats, a row lock will be taken.
- Requires consideration for subsequent reads (ie medal awarder). We'd need to update the local mode in addition to the database, probably using a SELECT to ensure we have a correct state post-lock.
- More RTT to database due to increased query count.
- Does not fix whole system falling over from holding long transactions.
- Don't start transaction during processing. Store delegates of changes and apply in one go at end.
- As we're no longer taking out a transaction, there's no guarantee of data consistency during processing. But this is probably fine as long as we are reading from the
UserStats object.
- Fixes cases such as
playcount incrementing, as long as UserStats is re-retrieved before applying delegate changes.
- With the caveat that there could still be two playcount medals awarded for the same user due to lack of 100% transaction coverage. Imagine two scores being processed in parallel from the same base
playcount number.
- We will need to do a pass of
UPDATE calls which are inline in apply/revert methods.
- Proof of concept here. Have not tested yet.
Following up from discussion last week after we ran into some pretty dire deadlock scenarios involving partition rotation metadata locks, I wanted to investigate whether it would be worthwhile to change the way
score-statistics-processor, potentially avoiding opening a transaction in the first place, or if doing so, minimising the amount of time spent within the transaction.The imminent issue was fixed by reducing the timeout on osu-web rank lookup requests, but I don't see this as a permanent fix because:
Let's assess the issues at hand:
UserStats write usages
These processor-external usages can be easily fixes, and probably should be as right now they are writing full row data even when they don't need to.
TotalScoreProcessorTests.cs(1 usages)Commands\Maintenance\VerifyUserRankCounts.cs(5 usages)Commands\Maintenance\VerifyUserRankedScore.cs(1 usages)The cases below are a touch more delicate. Because of codependencies and medal awarders, we need to be updating the
UserStatstracking object, or be re-querying on subsequent read usages.Processors\HitStatisticsProcessor.cs(4 usages)countMiss1count501count1001count3001Processors\LastPlayedDateProcessor.cs(1 usages)last_playedProcessors\MaxComboProcessor.cs(1 usages)max_comboProcessors\PlayCountProcessor.cs(2 usages)playcount1 2Processors\PlayTimeProcessor.cs(2 usages)total_seconds_playedProcessors\RankedScoreProcessor.cs(1 usages)ranked_scoreProcessors\TotalScoreProcessor.cs(6 usages)total_scorelevelProcessors\UserRankCountProcessor.cs(5 usages)xh_rank_countx_rank_countsh_rank_counts_rank_counta_rank_countProcessors\UserTotalPerformanceProcessor.cs(3 usages)rank_scoreaccuracy_newrank_score_indexREAD COMMITTED is not safe
We switched to this transaction mode previously to avoid actual database deadlocks. I knew at the time that this was not robust, but it did fix the issue and in majority of cases would not cause an issue.
While I was investigating this new issue and writing up documentation, I wanted to write this:
And that's where I stopped because I realised that the part I wrote before the transaction explanation was actually not true. We would need to do
SELECT .. FOR UPDATEor change the isolation mode to guarantee this.Proposed paths forward
Here are our options:
UPDATEcallsuser_stats, a row lock will be taken.UserStatsobject.playcountincrementing, as long asUserStatsis re-retrieved before applying delegate changes.playcountnumber.UPDATEcalls which are inline in apply/revert methods.Footnotes
has dependent medal ↩ ↩2 ↩3 ↩4 ↩5
has codependent processor ↩