Skip to content

perf(worker): Bulk fetch and update testruns in flake processing#856

Open
sentry[bot] wants to merge 1 commit intomainfrom
seer/perf/ta-bulk-testruns
Open

perf(worker): Bulk fetch and update testruns in flake processing#856
sentry[bot] wants to merge 1 commit intomainfrom
seer/perf/ta-bulk-testruns

Conversation

@sentry
Copy link
Copy Markdown
Contributor

@sentry sentry Bot commented Apr 18, 2026

Fixes WORKER-Y8P. The issue was that: N+1 query in process_flakes_for_commit fetching Testrun objects individually for each ReportSession causes task timeout.

  • Modified get_testruns to accept a list of upload_ids to fetch testruns for multiple uploads in a single query.
  • Updated process_single_upload to receive a pre-filtered list of Testrun objects directly.
  • In process_flakes_for_commit, all relevant testruns for a commit's uploads are now fetched in one go.
  • Testruns are grouped by upload ID before being passed to process_single_upload.
  • All Testrun outcome updates are now performed in a single bulk_update call at the end of process_flakes_for_commit, reducing database writes.

This fix was generated by Seer in Sentry, triggered automatically. 👁️ Run ID: 13523599

Not quite right? Click here to continue debugging with Seer.

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.


Note

Medium Risk
Changes query and update strategy for Testrun records during flake processing; while intended as a performance optimization, it could affect which testruns are processed/updated and when outcomes are persisted.

Overview
Speeds up flake processing by eliminating an N+1 query pattern in process_flakes_for_commit.

Instead of fetching testruns per ReportSession, the worker now fetches all recent Testruns for the commit’s upload IDs in a single query, groups them by upload_id, and processes each upload from the pre-fetched list.

Testrun outcome writes are deferred and consolidated into a single bulk_update at the end of commit processing.

Reviewed by Cursor Bugbot for commit 139138e. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment on lines +147 to +148
# Bulk update all testrun outcomes in a single query
Testrun.objects.bulk_update(all_testruns, ["outcome"])
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The bulk_create for Flake and bulk_update for Testrun are not in an atomic transaction, risking data inconsistency if the process fails between them.
Severity: CRITICAL

Suggested Fix

Wrap the Flake.objects.bulk_create and Testrun.objects.bulk_update calls within a single atomic transaction by adding the @transaction.atomic() decorator to the process_flakes_for_commit function. This ensures that both database operations either complete successfully together or are both rolled back in case of a failure.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: apps/worker/services/test_analytics/ta_process_flakes.py#L147-L148

Potential issue: The function `process_flakes_for_commit` performs a
`Flake.objects.bulk_create` followed by a `Testrun.objects.bulk_update` without wrapping
them in a database transaction. If the worker process terminates between these two
operations, for instance due to a timeout or out-of-memory error, the `Flake` records
will be updated, but the `Testrun` outcomes will not be changed to "flaky_fail". This
results in a permanent data inconsistency between the two tables. Furthermore,
re-running the process after such a failure would cause data corruption by
double-counting flakes, as the logic is not idempotent.

Did we get this right? 👍 / 👎 to inform future reviews.

@sentry
Copy link
Copy Markdown
Contributor Author

sentry Bot commented Apr 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.25%. Comparing base (0ad8a0c) to head (139138e).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #856   +/-   ##
=======================================
  Coverage   92.25%   92.25%           
=======================================
  Files        1307     1307           
  Lines       48017    48021    +4     
  Branches     1636     1636           
=======================================
+ Hits        44299    44303    +4     
  Misses       3407     3407           
  Partials      311      311           
Flag Coverage Δ
workerintegration 58.53% <10.00%> (-0.02%) ⬇️
workerunit 90.39% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants