Skip to content

roachtest: requeue preempted benchmarks for one non-spot retry#170652

Open
dt wants to merge 1 commit into
cockroachdb:masterfrom
dt:roachtest-retry-bench
Open

roachtest: requeue preempted benchmarks for one non-spot retry#170652
dt wants to merge 1 commit into
cockroachdb:masterfrom
dt:roachtest-retry-bench

Conversation

@dt
Copy link
Copy Markdown
Contributor

@dt dt commented May 20, 2026

A benchmark's value is the per-nightly time-series data point it produces. When a benchmark runs on a spot VM and the VM is preempted, the runner today marks the failure as a flake (routed to test-eng, no GitHub issue) and moves on: --count=1 means no in-run retry, and the test selector only forces re-selection on the next nightly. The data point is silently lost.

Requeue a benchmark exactly once when a preemption is detected post-test. The retry is queued with Cluster.UseSpotVMs=false, which also serves as the loop-breaker: only spot runs can preempt, so the retry can't itself qualify for another retry. Non-benchmark tests are unaffected.

Release note: None
Epic: none

@dt dt requested a review from msbutler May 20, 2026 15:16
@dt dt requested a review from a team as a code owner May 20, 2026 15:16
@dt dt requested review from srosenberg and williamchoe3 and removed request for a team May 20, 2026 15:16
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 20, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@williamchoe3 williamchoe3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dt thanks for opening this, if the goal is to be more resilient when spot clusters preempt but still get the cost benefit of attempting to use spot VMs then I think this change makes sense.

I do have a request though. I don't think the work.requeueForPreemptionRetry() should be in testRunner.runTest() because we're breaking an abstraction barrier i.e. runTest being solely concerned about test execution. After thinking about it, I think testRunner.runWorker() would be a better spot (1 call up). Seems like a more natural spot to add retry scheduling, maybe something like

r.runTest(...)
// Check if t.Failed() and flag
if shouldRetryBenchmarkPreemption(t, testToRun.spec, roachtestflags.RetryBenchOnNonSpot) {
    retrySpec := testToRun.spec
    retrySpec.Cluster.UseSpotVMs = false
    work.requeuePreemptionRetry(ctx, workerL, retrySpec)
}

I would also like to gate this behind a feature flag, maybe something like, which we can set to True on nightlies, but default to false.

RetryBenchOnNonSpot bool
_                       = registerRunFlag(&RetryBenchOnNonSpot, FlagInfo{
	Name: "retry-bench-on-non-spot",
	Usage: `
		Retry a Benchmark: true test once on non-spot VMs if its spot VM run
		fails due to VM preemption. Intended to preserve nightly benchmark
		datapoints without changing non-benchmark retry behavior.`,
})

There may be some side effects in testRunner.runTest() and testRunner.runWorker() that need to be handled because roachtest currently doesn't have a retry on generic failure cond feature. If the above sounds sane, I will help take a look into any possible side effects that need to be handled. Also if there's a new feature flag there's some plumbing to do on the CI side, but I can also help out with that.

Also codex pointed out the following that I haven't looked at yet

One caveat worth calling out: the PR’s current requeue approach assumes --count=1. If --count > 1, adding a duplicate work item with the same test name can interact badly with decTestLocked, which identifies work by name.

Feel free to push back on any of this.

A benchmark's value is the per-nightly time-series data point it
produces. When a benchmark runs on a spot VM and the VM is preempted,
the runner today marks the failure as a flake (routed to test-eng,
no GitHub issue) and moves on: --count=1 means no in-run retry, and
the test selector only forces re-selection on the *next* nightly. The
data point is silently lost.

Requeue a benchmark exactly once when a preemption is detected
post-test. The retry is queued with Cluster.UseSpotVMs=false, which
also serves as the loop-breaker: only spot runs can preempt, so the
retry can't itself qualify for another retry. Non-benchmark tests
are unaffected.

Release note: None
Epic: none
@dt dt force-pushed the roachtest-retry-bench branch from 47adca8 to 155a03d Compare May 20, 2026 17:01
@dt
Copy link
Copy Markdown
Contributor Author

dt commented May 20, 2026

Thanks Will — agree on (1) and (3); pushed those.

For (1), runTest no longer takes a *workPool. It now sets t.markPreempted() when the post-failure check attributes the failure to preemption, and the requeue lives in runWorker after runTest returns.

For (3), the requeue is gated on testToRun.runCount == 1. You're right that decTestLocked keys by name, so duplicating an entry while the original still has remaining runs would corrupt the per-run accounting. The requeue is targeted at the nightly's one-shot data point (where the runner uses --count=1); when a dev runs with --count > 1 they're already gathering N samples and don't need the spot-preemption do-over. Skipping there matches intent rather than being a workaround.

For (2) I am not quite seeing the need for a flag so far: we're pretty narrowly targeting just benchmarks that happen to hit spot preemptions already; if a nightly (or human engineer) is running a benchmark to get a measurement, is there some reason not to try to get them that measurement, even if an initial attempt hits a preemption?

@DarrylWong
Copy link
Copy Markdown
Contributor

DarrylWong commented May 20, 2026

Drive by comments:

  1. Worth noting that if a test is preempted, it is rerun the next day and guaranteed to be on a non preemptible VM.

    !tests[i].IsLastFailurePreempt() &&

  2. Agree on the flag being unnecessary. Spot VMs are already opt in only (set to true for nightlies), if you run a test locally it never uses them.

  3. The argument against allowing retries (in my mind) is that too many causes us to hit the 24 hr team city limit. Preemptions tend to come in batches where its not uncommon to see <10 tests preempted one day and over 40 the next. Ofc, defer to test eng for what they think about this.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented May 20, 2026

if a test is preempted, it is rerun the next day

I think that's fine for non-benchmarks, but the benchmarks are supposed to be providing steady signal over time and help us spot when something changed. They can already be a little noisy and having missing days makes that worse when trying to see if there's a single step change or series of smaller steps over time or something.

retries ... too many

We'd just retry any given test once (since the second try won't use spot so can't also be retried), right? Since only tests marked as a benchmark can get retried and can be retried at most once it seems like it shouldn't add much to the overall run time?

Alternatively we could pull benchmark tests into a separate run if we need to but I'd rather leave that to its own change.

@williamchoe3
Copy link
Copy Markdown
Contributor

For (2) I am not quite seeing the need for a flag

More of a personal preference to describe behavior, but not a hard requirement. What you and Darryl said makes to me so I'm ok without it.

The argument against allowing retries (in my mind) is that too many causes us to hit the 24 hr team city limit.

I agree with that concern. I'll add a note about this change in our test selection strategy issue #169712 so this behavior change isn't overlooked during that work

@williamchoe3
Copy link
Copy Markdown
Contributor

We'd just retry any given test once

I assumed Darryl was talking about retries in the general case (because I mentioned that in my initial comment). But regardless it is a valid concern as we've been struggling with timeouts recently. But we also need accurate benchmarks.

@williamchoe3
Copy link
Copy Markdown
Contributor

Alternatively we could pull benchmark tests into a separate run if we need to but I'd rather leave that to its own change.

I had a similar idea https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1776979960500769?thread_ts=1776977242.525669&cid=C023S0V4YEB I think it's worth considering, I'll add a note to the test selection issue if it's not already there as an option

Copy link
Copy Markdown
Contributor

@williamchoe3 williamchoe3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes! I'll capture the conversation points in the relevant issues.

@williamchoe3
Copy link
Copy Markdown
Contributor

williamchoe3 commented May 20, 2026

hey @dt just for my understanding

Worth noting that if a test is preempted, it is rerun the next day and guaranteed to be on a non preemptible VM.

Given the current preemption retry strategy for the next day, did you notice significant days inbetween runs? Because if so I just realized this might be more of an issue with selection.

Or is it important that specific days produce results in which case an individual preemption would be unideal.

@DarrylWong
Copy link
Copy Markdown
Contributor

but the benchmarks are supposed to be providing steady signal over time and help us spot when something changed

Agreed but I'd point out that due to test selection, (1/3 chance for a test to be chosen every nightly) its already possible for a test to get unlucky and be skipped for an entire week. Given that we have 700+ tests, its actually pretty likely to have a handful of tests to go the entire week without a test run.

If the pain point is that we run benchmarks on an inconsistent schedule, then it seems like we need to teach test selection to have a different heuristic than just random draw.

If the pain point is that we can go long periods without a benchmark run, then it seems like the issue is not the run next day on non spot vm pattern adding an extra day, but rather that test selection can go so long without selecting a test in the first place. e.g. we need to tighten the run every 7 day requirement to say 3 for benchmarks

That being said I think fixing either issue would be a much larger scope than your changes described here. I think my comments are more geared towards aspirational long term fixes, rather than blocking requests on your PR.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented May 20, 2026

If the pain point is that we run benchmarks on an inconsistent schedule, then it seems like we need to teach test selection to have a different heuristic than just random draw.

Indeed, I actually thought / had heard (incorrectly) that setting Benchmark on a test already forced it into daily selection to ensure consistent datapoints. I see now this isn't currently the case; this seems like something that likely regressed when we added test selection heuristics to roachtest and stopped running all the tests -- including the benchmarks we were relying on to emit data points -- every night.

I still think this PR makes sense -- if we set out to get a run from a benchmark we should go ahead and get it and not let the fact spot instances might get killed change that -- but seems like I likely need to follow-up as well to actually get what I want, which is reliable, routine runs of benchmarks to maintain the reporting data they're intended to produce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants