perf(worker): Batch testrun fetching and updating for flake processing#869
perf(worker): Batch testrun fetching and updating for flake processing#869sentry[bot] wants to merge 1 commit intomainfrom
Conversation
| log.info( | ||
| "process_flakes_for_commit: processed upload", | ||
| extra={"upload": upload.id}, | ||
| ) | ||
|
|
||
| # Bulk-update all testruns whose outcome may have been changed to "flaky_fail" | ||
| Testrun.objects.bulk_update(all_testruns, ["outcome"]) |
There was a problem hiding this comment.
Bug: An exception during the upload processing loop will cause all previously processed data from that batch to be lost, as database updates now only occur after the entire loop finishes.
Severity: MEDIUM
Suggested Fix
To restore the previous fault-tolerant behavior, move the Testrun.objects.bulk_update and Flake.objects.bulk_create calls back inside the for loop that iterates through uploads. This could be done by collecting testruns and flakes per-upload and saving them at the end of each loop iteration, or by wrapping each iteration in a transaction.atomic() block to ensure atomicity per upload.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: apps/worker/services/test_analytics/ta_process_flakes.py#L128-L134
Potential issue: The refactoring moves the `Testrun.objects.bulk_update` and
`Flake.objects.bulk_create` calls from inside the per-upload processing loop to after
the loop completes. In the original code, if an exception occurred while processing one
upload, the results from previously completed uploads in the same batch were already
persisted. In the new code, if any exception occurs at any point within the loop over
uploads, the entire operation is aborted, and all in-memory changes to `Testrun` objects
and newly created `Flake` objects from preceding, successfully processed uploads are
discarded and never written to the database. While the likelihood of an exception in
`process_single_upload` is low, this change represents a regression in fault tolerance.
Did we get this right? 👍 / 👎 to inform future reviews.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #869 +/- ##
=======================================
Coverage 92.25% 92.25%
=======================================
Files 1307 1307
Lines 48017 48021 +4
Branches 1636 1636
=======================================
+ Hits 44299 44303 +4
Misses 3407 3407
Partials 311 311
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Fixes WORKER-Y93. The issue was that: Iterating through uploads and querying testruns for each individually causes an N+1 query problem.
get_testrunstoget_testruns_for_uploadsto accept a list of upload IDs.process_single_uploadto receive testruns directly, removing its internal query and bulk update.process_flakes_for_committo fetch all relevant testruns for all uploads in a single batched query.Testrunbulk update operation to occur once after processing all uploads, improving database efficiency.This fix was generated by Seer in Sentry, triggered automatically. 👁️ Run ID: 13568247
Not quite right? Click here to continue debugging with Seer.
Legal Boilerplate
Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.
Note
Medium Risk
Reduces DB load by changing flake processing to batch-fetch and bulk-update
Testrunrows across uploads; risk is moderate due to altered query/update sequencing that could affect which runs get processed or updated.Overview
Improves flake processing performance by eliminating per-upload
Testrunqueries (N+1) and instead fetching all recent testruns for a commit’s uploads in one batched query, grouping them byupload_idfor processing.Moves
Testrunoutcome persistence from per-upload updates to a singlebulk_updateafter all uploads are processed, while keeping flake creation/upsert behavior the same.Reviewed by Cursor Bugbot for commit 35e31ba. Bugbot is set up for automated code reviews on this repo. Configure here.