roachtest: requeue preempted benchmarks for one non-spot retry#170652
roachtest: requeue preempted benchmarks for one non-spot retry#170652dt wants to merge 1 commit into
Conversation
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
There was a problem hiding this comment.
Hey @dt thanks for opening this, if the goal is to be more resilient when spot clusters preempt but still get the cost benefit of attempting to use spot VMs then I think this change makes sense.
I do have a request though. I don't think the work.requeueForPreemptionRetry() should be in testRunner.runTest() because we're breaking an abstraction barrier i.e. runTest being solely concerned about test execution. After thinking about it, I think testRunner.runWorker() would be a better spot (1 call up). Seems like a more natural spot to add retry scheduling, maybe something like
r.runTest(...)
// Check if t.Failed() and flag
if shouldRetryBenchmarkPreemption(t, testToRun.spec, roachtestflags.RetryBenchOnNonSpot) {
retrySpec := testToRun.spec
retrySpec.Cluster.UseSpotVMs = false
work.requeuePreemptionRetry(ctx, workerL, retrySpec)
}I would also like to gate this behind a feature flag, maybe something like, which we can set to True on nightlies, but default to false.
RetryBenchOnNonSpot bool
_ = registerRunFlag(&RetryBenchOnNonSpot, FlagInfo{
Name: "retry-bench-on-non-spot",
Usage: `
Retry a Benchmark: true test once on non-spot VMs if its spot VM run
fails due to VM preemption. Intended to preserve nightly benchmark
datapoints without changing non-benchmark retry behavior.`,
})There may be some side effects in testRunner.runTest() and testRunner.runWorker() that need to be handled because roachtest currently doesn't have a retry on generic failure cond feature. If the above sounds sane, I will help take a look into any possible side effects that need to be handled. Also if there's a new feature flag there's some plumbing to do on the CI side, but I can also help out with that.
Also codex pointed out the following that I haven't looked at yet
One caveat worth calling out: the PR’s current requeue approach assumes --count=1. If --count > 1, adding a duplicate work item with the same test name can interact badly with decTestLocked, which identifies work by name.
Feel free to push back on any of this.
A benchmark's value is the per-nightly time-series data point it produces. When a benchmark runs on a spot VM and the VM is preempted, the runner today marks the failure as a flake (routed to test-eng, no GitHub issue) and moves on: --count=1 means no in-run retry, and the test selector only forces re-selection on the *next* nightly. The data point is silently lost. Requeue a benchmark exactly once when a preemption is detected post-test. The retry is queued with Cluster.UseSpotVMs=false, which also serves as the loop-breaker: only spot runs can preempt, so the retry can't itself qualify for another retry. Non-benchmark tests are unaffected. Release note: None Epic: none
|
Thanks Will — agree on (1) and (3); pushed those. For (1), For (3), the requeue is gated on For (2) I am not quite seeing the need for a flag so far: we're pretty narrowly targeting just benchmarks that happen to hit spot preemptions already; if a nightly (or human engineer) is running a benchmark to get a measurement, is there some reason not to try to get them that measurement, even if an initial attempt hits a preemption? |
|
Drive by comments:
|
I think that's fine for non-benchmarks, but the benchmarks are supposed to be providing steady signal over time and help us spot when something changed. They can already be a little noisy and having missing days makes that worse when trying to see if there's a single step change or series of smaller steps over time or something.
We'd just retry any given test once (since the second try won't use spot so can't also be retried), right? Since only tests marked as a benchmark can get retried and can be retried at most once it seems like it shouldn't add much to the overall run time? Alternatively we could pull benchmark tests into a separate run if we need to but I'd rather leave that to its own change. |
More of a personal preference to describe behavior, but not a hard requirement. What you and Darryl said makes to me so I'm ok without it.
I agree with that concern. I'll add a note about this change in our test selection strategy issue #169712 so this behavior change isn't overlooked during that work |
I assumed Darryl was talking about retries in the general case (because I mentioned that in my initial comment). But regardless it is a valid concern as we've been struggling with timeouts recently. But we also need accurate benchmarks. |
I had a similar idea https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1776979960500769?thread_ts=1776977242.525669&cid=C023S0V4YEB I think it's worth considering, I'll add a note to the test selection issue if it's not already there as an option |
williamchoe3
left a comment
There was a problem hiding this comment.
Thanks for making the changes! I'll capture the conversation points in the relevant issues.
|
hey @dt just for my understanding
Given the current preemption retry strategy for the next day, did you notice significant days inbetween runs? Because if so I just realized this might be more of an issue with selection. Or is it important that specific days produce results in which case an individual preemption would be unideal. |
Agreed but I'd point out that due to test selection, (1/3 chance for a test to be chosen every nightly) its already possible for a test to get unlucky and be skipped for an entire week. Given that we have 700+ tests, its actually pretty likely to have a handful of tests to go the entire week without a test run. If the pain point is that we run benchmarks on an inconsistent schedule, then it seems like we need to teach test selection to have a different heuristic than just random draw. If the pain point is that we can go long periods without a benchmark run, then it seems like the issue is not the run next day on non spot vm pattern adding an extra day, but rather that test selection can go so long without selecting a test in the first place. e.g. we need to tighten the run every 7 day requirement to say 3 for benchmarks That being said I think fixing either issue would be a much larger scope than your changes described here. I think my comments are more geared towards aspirational long term fixes, rather than blocking requests on your PR. |
Indeed, I actually thought / had heard (incorrectly) that setting I still think this PR makes sense -- if we set out to get a run from a benchmark we should go ahead and get it and not let the fact spot instances might get killed change that -- but seems like I likely need to follow-up as well to actually get what I want, which is reliable, routine runs of benchmarks to maintain the reporting data they're intended to produce. |
A benchmark's value is the per-nightly time-series data point it produces. When a benchmark runs on a spot VM and the VM is preempted, the runner today marks the failure as a flake (routed to test-eng, no GitHub issue) and moves on: --count=1 means no in-run retry, and the test selector only forces re-selection on the next nightly. The data point is silently lost.
Requeue a benchmark exactly once when a preemption is detected post-test. The retry is queued with Cluster.UseSpotVMs=false, which also serves as the loop-breaker: only spot runs can preempt, so the retry can't itself qualify for another retry. Non-benchmark tests are unaffected.
Release note: None
Epic: none