perf(worker): Bulk fetch and update testruns in flake processing#856
perf(worker): Bulk fetch and update testruns in flake processing#856sentry[bot] wants to merge 1 commit intomainfrom
Conversation
| # Bulk update all testrun outcomes in a single query | ||
| Testrun.objects.bulk_update(all_testruns, ["outcome"]) |
There was a problem hiding this comment.
Bug: The bulk_create for Flake and bulk_update for Testrun are not in an atomic transaction, risking data inconsistency if the process fails between them.
Severity: CRITICAL
Suggested Fix
Wrap the Flake.objects.bulk_create and Testrun.objects.bulk_update calls within a single atomic transaction by adding the @transaction.atomic() decorator to the process_flakes_for_commit function. This ensures that both database operations either complete successfully together or are both rolled back in case of a failure.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: apps/worker/services/test_analytics/ta_process_flakes.py#L147-L148
Potential issue: The function `process_flakes_for_commit` performs a
`Flake.objects.bulk_create` followed by a `Testrun.objects.bulk_update` without wrapping
them in a database transaction. If the worker process terminates between these two
operations, for instance due to a timeout or out-of-memory error, the `Flake` records
will be updated, but the `Testrun` outcomes will not be changed to "flaky_fail". This
results in a permanent data inconsistency between the two tables. Furthermore,
re-running the process after such a failure would cause data corruption by
double-counting flakes, as the logic is not idempotent.
Did we get this right? 👍 / 👎 to inform future reviews.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #856 +/- ##
=======================================
Coverage 92.25% 92.25%
=======================================
Files 1307 1307
Lines 48017 48021 +4
Branches 1636 1636
=======================================
+ Hits 44299 44303 +4
Misses 3407 3407
Partials 311 311
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Fixes WORKER-Y8P. The issue was that: N+1 query in
process_flakes_for_commitfetchingTestrunobjects individually for eachReportSessioncauses task timeout.get_testrunsto accept a list ofupload_idsto fetch testruns for multiple uploads in a single query.process_single_uploadto receive a pre-filtered list ofTestrunobjects directly.process_flakes_for_commit, all relevant testruns for a commit's uploads are now fetched in one go.process_single_upload.Testrunoutcome updates are now performed in a singlebulk_updatecall at the end ofprocess_flakes_for_commit, reducing database writes.This fix was generated by Seer in Sentry, triggered automatically. 👁️ Run ID: 13523599
Not quite right? Click here to continue debugging with Seer.
Legal Boilerplate
Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.
Note
Medium Risk
Changes query and update strategy for
Testrunrecords during flake processing; while intended as a performance optimization, it could affect which testruns are processed/updated and when outcomes are persisted.Overview
Speeds up flake processing by eliminating an N+1 query pattern in
process_flakes_for_commit.Instead of fetching testruns per
ReportSession, the worker now fetches all recentTestruns for the commit’s upload IDs in a single query, groups them byupload_id, and processes each upload from the pre-fetched list.Testrunoutcome writes are deferred and consolidated into a singlebulk_updateat the end of commit processing.Reviewed by Cursor Bugbot for commit 139138e. Bugbot is set up for automated code reviews on this repo. Configure here.