roachtest: requeue preempted benchmarks for one non-spot retry by dt · Pull Request #170652 · cockroachdb/cockroach

dt · 2026-05-20T15:16:23Z

A benchmark's value is the per-nightly time-series data point it produces. When a benchmark runs on a spot VM and the VM is preempted, the runner today marks the failure as a flake (routed to test-eng, no GitHub issue) and moves on: --count=1 means no in-run retry, and the test selector only forces re-selection on the next nightly. The data point is silently lost.

Requeue a benchmark exactly once when a preemption is detected post-test. The retry is queued with Cluster.UseSpotVMs=false, which also serves as the loop-breaker: only spot runs can preempt, so the retry can't itself qualify for another retry. Non-benchmark tests are unaffected.

Release note: None
Epic: none

trunk-io · 2026-05-20T15:16:29Z

Merging to master in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

cockroach-teamcity · 2026-05-20T15:16:44Z

This change is

williamchoe3

Hey @dt thanks for opening this, if the goal is to be more resilient when spot clusters preempt but still get the cost benefit of attempting to use spot VMs then I think this change makes sense.

I do have a request though. I don't think the work.requeueForPreemptionRetry() should be in testRunner.runTest() because we're breaking an abstraction barrier i.e. runTest being solely concerned about test execution. After thinking about it, I think testRunner.runWorker() would be a better spot (1 call up). Seems like a more natural spot to add retry scheduling, maybe something like

r.runTest(...)
// Check if t.Failed() and flag
if shouldRetryBenchmarkPreemption(t, testToRun.spec, roachtestflags.RetryBenchOnNonSpot) {
    retrySpec := testToRun.spec
    retrySpec.Cluster.UseSpotVMs = false
    work.requeuePreemptionRetry(ctx, workerL, retrySpec)
}

I would also like to gate this behind a feature flag, maybe something like, which we can set to True on nightlies, but default to false.

RetryBenchOnNonSpot bool
_                       = registerRunFlag(&RetryBenchOnNonSpot, FlagInfo{
	Name: "retry-bench-on-non-spot",
	Usage: `
		Retry a Benchmark: true test once on non-spot VMs if its spot VM run
		fails due to VM preemption. Intended to preserve nightly benchmark
		datapoints without changing non-benchmark retry behavior.`,
})

There may be some side effects in testRunner.runTest() and testRunner.runWorker() that need to be handled because roachtest currently doesn't have a retry on generic failure cond feature. If the above sounds sane, I will help take a look into any possible side effects that need to be handled. Also if there's a new feature flag there's some plumbing to do on the CI side, but I can also help out with that.

Also codex pointed out the following that I haven't looked at yet

One caveat worth calling out: the PR’s current requeue approach assumes --count=1. If --count > 1, adding a duplicate work item with the same test name can interact badly with decTestLocked, which identifies work by name.

Feel free to push back on any of this.

A benchmark's value is the per-nightly time-series data point it produces. When a benchmark runs on a spot VM and the VM is preempted, the runner today marks the failure as a flake (routed to test-eng, no GitHub issue) and moves on: --count=1 means no in-run retry, and the test selector only forces re-selection on the *next* nightly. The data point is silently lost. Requeue a benchmark exactly once when a preemption is detected post-test. The retry is queued with Cluster.UseSpotVMs=false, which also serves as the loop-breaker: only spot runs can preempt, so the retry can't itself qualify for another retry. Non-benchmark tests are unaffected. Release note: None Epic: none

dt · 2026-05-20T17:10:54Z

Thanks Will — agree on (1) and (3); pushed those.

For (1), runTest no longer takes a *workPool. It now sets t.markPreempted() when the post-failure check attributes the failure to preemption, and the requeue lives in runWorker after runTest returns.

For (3), the requeue is gated on testToRun.runCount == 1. You're right that decTestLocked keys by name, so duplicating an entry while the original still has remaining runs would corrupt the per-run accounting. The requeue is targeted at the nightly's one-shot data point (where the runner uses --count=1); when a dev runs with --count > 1 they're already gathering N samples and don't need the spot-preemption do-over. Skipping there matches intent rather than being a workaround.

For (2) I am not quite seeing the need for a flag so far: we're pretty narrowly targeting just benchmarks that happen to hit spot preemptions already; if a nightly (or human engineer) is running a benchmark to get a measurement, is there some reason not to try to get them that measurement, even if an initial attempt hits a preemption?

DarrylWong · 2026-05-20T17:27:48Z

Drive by comments:

Worth noting that if a test is preempted, it is rerun the next day and guaranteed to be on a non preemptible VM.

cockroach/pkg/cmd/roachtest/test_runner.go

Line 341 in 9593896

!tests[i].IsLastFailurePreempt() &&
Agree on the flag being unnecessary. Spot VMs are already opt in only (set to true for nightlies), if you run a test locally it never uses them.
The argument against allowing retries (in my mind) is that too many causes us to hit the 24 hr team city limit. Preemptions tend to come in batches where its not uncommon to see <10 tests preempted one day and over 40 the next. Ofc, defer to test eng for what they think about this.

dt · 2026-05-20T17:48:50Z

if a test is preempted, it is rerun the next day

I think that's fine for non-benchmarks, but the benchmarks are supposed to be providing steady signal over time and help us spot when something changed. They can already be a little noisy and having missing days makes that worse when trying to see if there's a single step change or series of smaller steps over time or something.

retries ... too many

We'd just retry any given test once (since the second try won't use spot so can't also be retried), right? Since only tests marked as a benchmark can get retried and can be retried at most once it seems like it shouldn't add much to the overall run time?

Alternatively we could pull benchmark tests into a separate run if we need to but I'd rather leave that to its own change.

williamchoe3 · 2026-05-20T17:51:02Z

For (2) I am not quite seeing the need for a flag

More of a personal preference to describe behavior, but not a hard requirement. What you and Darryl said makes to me so I'm ok without it.

The argument against allowing retries (in my mind) is that too many causes us to hit the 24 hr team city limit.

I agree with that concern. I'll add a note about this change in our test selection strategy issue #169712 so this behavior change isn't overlooked during that work

williamchoe3 · 2026-05-20T17:53:09Z

We'd just retry any given test once

I assumed Darryl was talking about retries in the general case (because I mentioned that in my initial comment). But regardless it is a valid concern as we've been struggling with timeouts recently. But we also need accurate benchmarks.

williamchoe3 · 2026-05-20T17:55:13Z

Alternatively we could pull benchmark tests into a separate run if we need to but I'd rather leave that to its own change.

I had a similar idea https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1776979960500769?thread_ts=1776977242.525669&cid=C023S0V4YEB I think it's worth considering, I'll add a note to the test selection issue if it's not already there as an option

williamchoe3

Thanks for making the changes! I'll capture the conversation points in the relevant issues.

williamchoe3 · 2026-05-20T18:10:22Z

hey @dt just for my understanding

Worth noting that if a test is preempted, it is rerun the next day and guaranteed to be on a non preemptible VM.

Given the current preemption retry strategy for the next day, did you notice significant days inbetween runs? Because if so I just realized this might be more of an issue with selection.

Or is it important that specific days produce results in which case an individual preemption would be unideal.

DarrylWong · 2026-05-20T18:34:58Z

but the benchmarks are supposed to be providing steady signal over time and help us spot when something changed

Agreed but I'd point out that due to test selection, (1/3 chance for a test to be chosen every nightly) its already possible for a test to get unlucky and be skipped for an entire week. Given that we have 700+ tests, its actually pretty likely to have a handful of tests to go the entire week without a test run.

If the pain point is that we run benchmarks on an inconsistent schedule, then it seems like we need to teach test selection to have a different heuristic than just random draw.

If the pain point is that we can go long periods without a benchmark run, then it seems like the issue is not the run next day on non spot vm pattern adding an extra day, but rather that test selection can go so long without selecting a test in the first place. e.g. we need to tighten the run every 7 day requirement to say 3 for benchmarks

That being said I think fixing either issue would be a much larger scope than your changes described here. I think my comments are more geared towards aspirational long term fixes, rather than blocking requests on your PR.

dt · 2026-05-20T20:02:27Z

If the pain point is that we run benchmarks on an inconsistent schedule, then it seems like we need to teach test selection to have a different heuristic than just random draw.

Indeed, I actually thought / had heard (incorrectly) that setting Benchmark on a test already forced it into daily selection to ensure consistent datapoints. I see now this isn't currently the case; this seems like something that likely regressed when we added test selection heuristics to roachtest and stopped running all the tests -- including the benchmarks we were relying on to emit data points -- every night.

I still think this PR makes sense -- if we set out to get a run from a benchmark we should go ahead and get it and not let the fact spot instances might get killed change that -- but seems like I likely need to follow-up as well to actually get what I want, which is reliable, routine runs of benchmarks to maintain the reporting data they're intended to produce.

dt requested a review from msbutler May 20, 2026 15:16

dt requested a review from a team as a code owner May 20, 2026 15:16

dt requested review from srosenberg and williamchoe3 and removed request for a team May 20, 2026 15:16

williamchoe3 requested changes May 20, 2026

View reviewed changes

dt force-pushed the roachtest-retry-bench branch from 47adca8 to 155a03d Compare May 20, 2026 17:01

williamchoe3 approved these changes May 20, 2026

View reviewed changes

williamchoe3 mentioned this pull request May 20, 2026

roachtest: increase observability into test selection #168933

Open

williamchoe3 mentioned this pull request May 20, 2026

roachtest: replace Snowflake-based test selection with Go-side selection backed by a CRDB cluster #169712

Open

Conversation

dt commented May 20, 2026

Uh oh!

trunk-io Bot commented May 20, 2026

Uh oh!

cockroach-teamcity commented May 20, 2026

Uh oh!

williamchoe3 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dt commented May 20, 2026

Uh oh!

DarrylWong commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dt commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

williamchoe3 commented May 20, 2026

Uh oh!

williamchoe3 commented May 20, 2026

Uh oh!

williamchoe3 commented May 20, 2026

Uh oh!

williamchoe3 left a comment

Choose a reason for hiding this comment

Uh oh!

williamchoe3 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarrylWong commented May 20, 2026

Uh oh!

dt commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

williamchoe3 left a comment •

edited

Loading

DarrylWong commented May 20, 2026 •

edited

Loading

dt commented May 20, 2026 •

edited

Loading

williamchoe3 commented May 20, 2026 •

edited

Loading