Skip to content

Transaction lock time reduction investigation #341

@peppy

Description

@peppy

Following up from discussion last week after we ran into some pretty dire deadlock scenarios involving partition rotation metadata locks, I wanted to investigate whether it would be worthwhile to change the way score-statistics-processor, potentially avoiding opening a transaction in the first place, or if doing so, minimising the amount of time spent within the transaction.

The imminent issue was fixed by reducing the timeout on osu-web rank lookup requests, but I don't see this as a permanent fix because:

  • It is relying on things "eventually succeeding". Individual scores will fail to complete and enter the retry queue during partition switching.
  • Any non-database component failure which is accessed inside the transaction could result in similar database-wise locks, causing full system downtime.

Let's assess the issues at hand:

UserStats write usages

These processor-external usages can be easily fixes, and probably should be as right now they are writing full row data even when they don't need to.

TotalScoreProcessorTests.cs (1 usages)
Commands\Maintenance\VerifyUserRankCounts.cs (5 usages)
Commands\Maintenance\VerifyUserRankedScore.cs (1 usages)

The cases below are a touch more delicate. Because of codependencies and medal awarders, we need to be updating the UserStats tracking object, or be re-querying on subsequent read usages.

Processors\HitStatisticsProcessor.cs (4 usages)

  • countMiss 1
  • count50 1
  • count100 1
  • count300 1
    Processors\LastPlayedDateProcessor.cs (1 usages)
  • last_played
    Processors\MaxComboProcessor.cs (1 usages)
  • max_combo
    Processors\PlayCountProcessor.cs (2 usages)
  • playcount 1 2
    Processors\PlayTimeProcessor.cs (2 usages)
  • total_seconds_played
    Processors\RankedScoreProcessor.cs (1 usages)
  • ranked_score
    Processors\TotalScoreProcessor.cs (6 usages)
  • total_score
  • level
    Processors\UserRankCountProcessor.cs (5 usages)
  • xh_rank_count
  • x_rank_count
  • sh_rank_count
  • s_rank_count
  • a_rank_count
    Processors\UserTotalPerformanceProcessor.cs (3 usages)
  • rank_score
  • accuracy_new
  • rank_score_index

READ COMMITTED is not safe

We switched to this transaction mode previously to avoid actual database deadlocks. I knew at the time that this was not robust, but it did fix the issue and in majority of cases would not cause an issue.

While I was investigating this new issue and writing up documentation, I wanted to write this:

// Opening a transaction here guarantees a mutex over the user which is being processed.
//
// This is important as queue processors are made to be run in parallel at high scale,
// which means two queue processors could end up operating on the same user in an edge case.
//
// Transaction mode READ COMMITTED means that subsequent queries may see different underlying
// data, but...
using (var transaction = conn.BeginTransaction(IsolationLevel.ReadCommitted))

And that's where I stopped because I realised that the part I wrote before the transaction explanation was actually not true. We would need to do SELECT .. FOR UPDATE or change the isolation mode to guarantee this.

Proposed paths forward

Here are our options:

  • Do nothing. Things are working until they aren't.
    • Will potentially still cause sentry error reports each day during partition cycling
    • Any longer failures that take more than a few seconds to timeout could definitely cause system-wide failures.
    • Some updates may not be correct due to the transaction mode. For instance, if two scores by the same user are processed in parallel, a single playcount increment may be missed.
  • Move all write operations to local UPDATE calls
    • This fixes transaction issues for the majority case as the moment an update occurs on user_stats, a row lock will be taken.
    • Requires consideration for subsequent reads (ie medal awarder). We'd need to update the local mode in addition to the database, probably using a SELECT to ensure we have a correct state post-lock.
    • More RTT to database due to increased query count.
    • Does not fix whole system falling over from holding long transactions.
  • Don't start transaction during processing. Store delegates of changes and apply in one go at end.
    • As we're no longer taking out a transaction, there's no guarantee of data consistency during processing. But this is probably fine as long as we are reading from the UserStats object.
    • Fixes cases such as playcount incrementing, as long as UserStats is re-retrieved before applying delegate changes.
    • With the caveat that there could still be two playcount medals awarded for the same user due to lack of 100% transaction coverage. Imagine two scores being processed in parallel from the same base playcount number.
    • We will need to do a pass of UPDATE calls which are inline in apply/revert methods.
    • Proof of concept here. Have not tested yet.

Footnotes

  1. has dependent medal 2 3 4 5

  2. has codependent processor

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions