Skip to content

fix(prover-client): drop late stale-epoch reports + race-proof epoch teardown#23348

Draft
AztecBot wants to merge 1 commit into
merge-train/spartanfrom
claudebox/fix-merge-train-spartan-23344
Draft

fix(prover-client): drop late stale-epoch reports + race-proof epoch teardown#23348
AztecBot wants to merge 1 commit into
merge-train/spartanfrom
claudebox/fix-merge-train-spartan-23344

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

Summary

Re-fix the flaky ProvingBroker > Retries > does not retry if job is stale (persisted variant) that caused PR #23344 to be dequeued from the merge queue (failing run, CI log key 64a972aafaa40dd0).

.test_patterns.yml previously suppressed this exact failure; #23260 removed the suppression after cleaning up in-memory state, but didn't address the underlying race between the broker's epoch cleanup and the LMDB-backed batched writes, so the flake recurred. The failure surfaces as Error('Store is closed') thrown out of BatchQueue.execProcessor → KVBrokerDatabase.commitWrites → SingleEpochDatabase.batchWrite → store.transactionAsync.

Fix

Two small, independent defenses:

  1. proving_broker.ts — in both #reportProvingJobError and #reportProvingJobSuccess, after the existing !item early-return, also drop the report when this.isJobStale(item). A late report for a job whose epoch has been (or is being) cleaned up never reaches the database.
  2. proving_broker_database/persisted.ts:
    • Track deletedEpochs: Set<number> so commitWrites short-circuits for epoch directories that are being torn down.
    • deleteAllProvingJobsOlderThanEpoch adds to deletedEpochs first, then removes from the epochs map before awaiting db.delete() (which synchronously marks the store closed and then awaits rm). This closes the window where a racing commitWrites could fetch a still-mapped, already-closed store.
    • Catch a residual Error('Store is closed') inside commitWrites as a benign drop. Any other error still propagates.

Verification

yarn workspace @aztec/prover-client test src/proving_broker/proving_broker.test.ts — 5 consecutive runs, 102/102 tests passing on each, including both in-memory and persisted variants of does not retry if job is stale.

Full analysis (failure timeline, the two races, why both defenses are needed): https://gist.github.com/AztecBot/612364332ec5a062bcf88ded9177334e

Test plan

  • Persisted-variant does not retry if job is stale passes deterministically in CI under merge-queue-heavy concurrency
  • No regressions in the rest of proving_broker.test.ts (retries, timeouts, database management blocks) — verified locally
  • Merge-queue run on this PR completes cleanly

Created by claudebox — group: slackbot

ClaudeBox log: https://claudebox.work/s/731f628c79243769?run=1

…teardown

Re-fix the flaky 'ProvingBroker > Retries > does not retry if job is stale'
test (persisted variant) that dequeued PR #23344 from the merge queue. The
failure surfaces as Error('Store is closed') from BatchQueue.execProcessor
-> KVBrokerDatabase.commitWrites -> SingleEpochDatabase.batchWrite ->
store.transactionAsync, triggered by a race between the broker's epoch
cleanupPass and the persisted DB's in-flight batched writes.

Two small, independent defenses:

- proving_broker.ts: in #reportProvingJobError and #reportProvingJobSuccess,
  after the existing !item early-return, also drop the report when
  isJobStale(item). A late report for a job whose epoch has been (or is
  being) cleaned up never reaches the database.

- proving_broker_database/persisted.ts: track deletedEpochs so commitWrites
  short-circuits for torn-down epoch directories; reorder
  deleteAllProvingJobsOlderThanEpoch to remove from the epochs map before
  await db.delete(); catch a residual 'Store is closed' inside commitWrites
  as a benign drop.

Verified locally: 5 consecutive runs of proving_broker.test.ts pass, 102/102
tests each, including both in-memory and persisted variants of the previously
flaky 'does not retry if job is stale'.
@AztecBot AztecBot added ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR. labels May 16, 2026
@spalladino spalladino requested a review from PhilWindle May 17, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant