test: make flaky deadline retry tests robust to CI speed#277
Conversation
The two deadline integration tests asserted a narrow retry-count range, but the count is wall-clock dependent: the backoff `not_before`, the deadline check in can_retry, and the SQL claim filter all use chrono::Utc::now(). On a slow/loaded CI runner each retry's DB + CPU overhead grows, so fewer retries fit in the fixed completion window; a recent run fit only ~8 and tripped the (9..12) lower bound. Re-anchor each assertion to the real invariant: a tight upper bound equal to the exponential-backoff ceiling (retries cannot run past the deadline) plus a generous lower bound that absorbs slow-runner slack. Also fixes the half-open (9..12) range that contradicted its own "9-12" comment. Test-only; no library behavior change. tokio paused-time was rejected -- the path uses chrono/SQL wall-clock, not tokio::time, and runs against real Postgres under #[sqlx::test]. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Loosens retry-count assertions in two deadline-related integration tests so they no longer flake under slow/loaded CI, by anchoring to the exponential-backoff ceiling (tight upper bound) and a generous lower bound, and fixes a half-open range that contradicted its own comment.
Changes:
test_deadline_aware_retry_stops_before_deadline:(7..=9)→(4..=8), with an updated explanatory comment.test_retry_stops_at_deadline_when_no_limits_set:(9..12)→(5..=11), fixing the half-open range vs. "9-12" comment mismatch.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Summary
This PR fixes flaky assertions in two deadline-aware retry integration tests by widening the acceptable retry-count ranges and correcting the underlying rationale. The changes are test-only and do not modify library behavior.
Verdict: Approved — the changes are correct, well-motivated, and address a real CI flakiness issue without masking bugs.
Research notes
I reviewed:
- The
can_retrylogic insrc/request/transitions.rs:272-333which calculatesnot_before = now + backoff_durationand denies retries whennot_before >= effective_deadline - The backoff calculation:
backoff_ms * (backoff_factor ^ retry_attempt), capped atmax_backoff_ms - Both test functions (
test_deadline_aware_retry_stops_before_deadlineandtest_retry_stops_at_deadline_when_no_limits_set) to verify the timeline math
Key findings from the code:
- Test 1 uses
backoff_ms=50, backoff_factor=2, max_backoff_ms=200, effective deadline = 2000ms - 500ms buffer = 1500ms - Test 2 uses the same backoff config but no buffer, so effective deadline = 2000ms
- With zero overhead, Test 1 can fit at most 8 retries (next would be at t=1550ms > 1500ms), Test 2 can fit at most 11 retries (next would be at t=2150ms > 2000ms)
Suggested next steps
No action required — this PR is ready to merge. Consider running the tests a few times on CI to confirm the flakiness is resolved.
General findings
None — the changes are narrowly scoped to test assertions and comments, and the rationale is sound.
The retry-count range assertions were wall-clock dependent: how many retries fit the fixed completion window depends on real DB/CPU cycle time, so any fixed range is CI-speed dependent (the original flake). Assert what the deadline logic actually guarantees, independent of speed: - buffered: stops before batch expiry (failed_at < batch_expires_at), exercising the 500ms stop_before_deadline_ms buffer; - no buffer: retries run up to the deadline (failed_at within one backoff step of batch_expires_at); - both: retried >= 1 time, one HTTP call per attempt, error reason set. Verified stable under single-core CPU starvation (8/8 runs). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Summary
This PR fixes flaky integration tests by replacing brittle retry-count assertions with semantic deadline invariants. The core insight is correct: retry counts are wall-clock dependent (how many DB/CPU cycles fit in the deadline window), so fixed ranges inevitably flake on slow CI.
The new assertions verify what the deadline logic actually guarantees:
- Buffered case (
stop_before_deadline_ms: Some(500)): Retries stop before batch expiry - No-buffer case: Retries run up to the deadline (within 500ms slack)
Both tests retain the call_count == retry_count + 1 invariant (one HTTP call per attempt) and error-reason checks.
Verdict: Approved — the changes are semantically correct and address the root cause of flakiness without weakening test coverage meaningfully.
Research notes
I reviewed the can_retry implementation in src/request/transitions.rs:272-333 to understand the deadline logic:
- Effective deadline =
batch_expires_at - stop_before_deadline_ms(or justbatch_expires_atif no buffer) - Retry denied when
not_before >= effective_deadline not_before = Utc::now() + backoff_duration
The daemon flow in src/daemon/mod.rs:1064-1143 shows that failed_at is set when each HTTP attempt fails, and the terminal Failed state preserves the last attempt's timestamp.
Suggested next steps
- Merge as-is — the assertions are correct and will eliminate the flakiness.
- Optional polish: Consider strengthening the buffered-case assertion slightly (see inline comment on line 870).
General findings
None — all issues are anchored to specific changed lines below.
|
|
||
| // 2. Verify HTTP call count matches retry attempts (1 initial + N retries) | ||
| // The 500ms buffer must stop the request *before* the batch expires. | ||
| assert!( |
There was a problem hiding this comment.
Non-blocking: Consider strengthening this assertion to more directly verify the buffer was effective.
Why it matters: The current assertion failed.state.failed_at < failed.state.batch_expires_at only verifies we didn't pass expiry. Combined with retry_count >= 1, it provides meaningful coverage, but doesn't strongly verify the 500ms buffer specifically. A request that failed after 1 retry at t=100ms would pass, even though the buffer should have allowed retries until ~t=1500ms.
A stronger invariant would be: failed.state.failed_at <= failed.state.batch_expires_at - chrono::Duration::milliseconds(400) (allowing 100ms slack for the final attempt's execution time). This would verify we actually stopped due to the buffer, not some other factor.
That said, the current assertion is correct and does distinguish the buffered behavior from the no-buffer case when read alongside the second test. The trade-off between assertion strength and flakiness risk is reasonable here.
Suggested fix: If you want stronger verification without reintroducing flakiness, consider:
assert!(
failed.state.failed_at <= failed.state.batch_expires_at - chrono::Duration::milliseconds(400),
"buffered retry should stop ~500ms before batch expiry: failed_at={}, expires_at={}",
failed.state.failed_at,
failed.state.batch_expires_at
);Or leave as-is if the current assertion proves stable in CI.
| ); | ||
|
|
||
| // 2. Verify HTTP call count matches retry attempts (1 initial + N retries) | ||
| // With no buffer, retries run right up to the deadline: the final attempt |
There was a problem hiding this comment.
Nit: Minor clarification on the comment.
Why it matters: The comment says "fails within one backoff step (≤200ms) of batch expiry" but the assertion uses 500ms slack. This is intentional (as noted: "500ms slack absorbs scheduling jitter"), but could be slightly clearer about why the slack is larger than the backoff step.
Suggested fix: Consider rewording to:
// With no buffer, retries run right up to the deadline: the final attempt
// fails near batch expiry (the max backoff step is 200ms, but we use 500ms
// slack to absorb scheduling jitter and DB overhead on slow CI).
Summary
Fixes the flaky
fusilladedeadline integration tests (test_deadline_aware_retry_stops_before_deadline,test_retry_stops_at_deadline_when_no_limits_set).Root cause: they asserted the number of retries that fit in a fixed completion window, but that count is wall-clock dependent — backoff
not_before, thecan_retrydeadline check, and the SQL claim filter all usechrono::Utc::now(). On a slow/loaded CI runner fewer retries fit, so a fixed range flakes (a recent run fit ~8 and tripped the(9..12)lower bound).Fix: assert the invariant the deadline logic guarantees — independent of CPU speed — instead of a magic count range:
stop_before_deadline_ms = 500): the request stops before the batch expires —failed_at < batch_expires_at.failed_atis within one backoff step (≤200ms) ofbatch_expires_at(500ms slack for jitter).call_count == retry_count + 1), and the terminal failure carries an error reason.tokio::time::pause()was considered but doesn't apply — the path useschrono/SQL wall-clock (nottokio::time) and runs against real Postgres under#[sqlx::test]; true determinism would require injecting a mock clock into production + SQL, out of scope here.Test-only; no library behavior change. Standalone, unrelated to #276.
Test plan
cargo test --test integration→ 18/18 passfailed_atinvariants hold under loadcargo fmt --checkclean;cargo clippy -- -D warnings(CI's command) clean🤖 Generated with Claude Code