Adjust mod multiplier recalculation processes to minimise number of required database writes#374
Draft
bdach wants to merge 5 commits into
Draft
Conversation
By itself this does nothing, the old API worked *fine* as far as the backpopulation was concerned. However doing this allows for an opportunity of automatically handling the change to mania key mod multipliers that was done live without backwards adjustments. More on this later.
…y number of rows As the inline commentary states, the number of rows that require writes of `total_score_without_mods` can be significantly reduced because of two facts: - Firstly, when it comes to scores set on stable, `total_score_without_mods` as well as `total_score` are quantities derived directly from `legacy_total_score` as well as beatmap attributes. Therefore, it is not actually necessary to write `total_score_without_mods` for those scores, and instead the score conversion algorithm can be used to recalculate the multipliers instead. - Secondly, when it comes to lazer scores specifically, this change assumes the invariant that if there are no mods on a score, then `total_score == total_score_without_mods` - so there is no point writing out values.
… key mod multipliers Migrating to the new multiplier API, which already has to handle multiplier versioning and the directly-applied change to mania key mod multipliers, means that it is now possible to leverage the versioned logic and automatically correct for the key mod multiplier change by filling out `ScoreInfo.ClientVersion` using `osu_builds`.
…necessary number of rows with `total_score_without_mods` Adds an alternate processing path for stable scores which leverages the algorithm of converting from legacy total score to implement the multiplier recalculation. The command was already skipping scores with no mods at all, so no adjustment required there.
Written using real-world cases pulled from live production scores (via data.ppy.sh dumps plus some manual trawling for the mania key mod cases).
21 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PRing early as draft primarily for visibility reasons.
The original plan for deploying the mod multiplier changes was to first ensure that all scores in the database have
total_score_without_modspresent, and then write out the updatedtotal_scoreacross the whole table as the new multipliers times that populatedtotal_score_without_modsquantity.Aside: Why is the migration not run in-place?
The primary reason for me approaching things in this way rather than running the recalculation in-place is because running the recalculation in-place would be an inherently lossy process.
Assuming everything goes right, because total score is rounded to full integers and mod multipliers are very much not integers, the possibility of rounding errors and floating-point inaccuracies is very real. Therefore, any in-place recalculation of multipliers would be a process that ideally would only ever occur once and never again, because the more times it is performed, the bigger the degradation incurred with every in-place conversion.
And that is assuming everything goes right. In case of any mistake or incorrectly-applied migration, the original data would be lost in an in-place migration, and would not be recoverable quickly or at all because of how complicated the scoring algorithm is at the explicit demand of the user base. The closest thing to a recovery procedure in that case would be replays - when there are replays - and even that would be slow.
It is my former professional experience that tells me you never want to run huge data migrations in-place which incur chances of irretrievable or difficult-to-recover-from data loss.
In recent days, an attempt was made to run the backpopulation of
total_score_without_modson the productonscorestable. Unfortunately at this time the original plan appears to be mostly unviable.The number of rows in the production
scorestable that require this backfill is estimated to be around 2.38 billion rows. This would incur a storage overhead in the hundreds of gigabytes (not only from the actual rows themselves, but also from a transient spike in the size of binlog, which is relevant when replication is depended upon) and take roughly a month to execute.This PR contains a set of changes that is designed to reduce the number of writes required to execute this migration. The cost here is the increased complexity of the migration process. While I am somewhat confident this is going to work, I will want to test this more thoroughly on the actual changes and maybe even do some test runs locally using data dumps to make sure I have not missed anything.
Additionally, comments from #269 (review) are addressed here.
As for the gory details:
Switch to new mod multiplier calculation API in backpopulation command
By itself this does nothing, the old API worked fine as far as the backpopulation was concerned. However doing this allows for an opportunity of automatically handling the change to mania key mod multipliers that was done live without backwards adjustments. More on this later.
Adjust query in backpopulation command to target the minimum necessary number of rows
As the inline commentary states, the number of rows that require writes of
total_score_without_modscan be significantly reduced because of two facts:total_score_without_modsas well astotal_scoreare quantities derived directly fromlegacy_total_scoreas well as beatmap attributes. Therefore, it is not actually necessary to writetotal_score_without_modsfor those scores, and instead the score conversion algorithm can be used to recalculate the multipliers instead.total_score == total_score_without_mods- so there is no point writing out values.Adjust backpopulation command to automatically handle change to mania key mod multipliers
Migrating to the new multiplier API, which already has to handle multiplier versioning and the directly-applied change to mania key mod multipliers, means that it is now possible to leverage the versioned logic and automatically correct for the key mod multiplier change by filling out
ScoreInfo.ClientVersionusingosu_builds.Adjust mod multiplier recalculation command to work with the minimum necessary number of rows with
total_score_without_modsAdds an alternate processing path for stable scores which leverages the algorithm of converting from legacy total score to implement the multiplier recalculation.
The command was already skipping scores with no mods at all, so no adjustment required there.
Add test coverage of commands
Written using real-world cases pulled from live production scores (via data.ppy.sh dumps plus some manual trawling for the mania key mod cases).
In the current state the most useful tests are the ones checking correctness of the mania key mod recalculation. Everything else will need updating once the new proposed values are out.