Skip to content

test: make flaky deadline retry tests robust to CI speed#277

Open
pjb157 wants to merge 2 commits into
mainfrom
fix/deadline-retry-test-flakiness
Open

test: make flaky deadline retry tests robust to CI speed#277
pjb157 wants to merge 2 commits into
mainfrom
fix/deadline-retry-test-flakiness

Conversation

@pjb157
Copy link
Copy Markdown
Contributor

@pjb157 pjb157 commented May 29, 2026

Summary

Fixes the flaky fusillade deadline integration tests (test_deadline_aware_retry_stops_before_deadline, test_retry_stops_at_deadline_when_no_limits_set).

Root cause: they asserted the number of retries that fit in a fixed completion window, but that count is wall-clock dependent — backoff not_before, the can_retry deadline check, and the SQL claim filter all use chrono::Utc::now(). On a slow/loaded CI runner fewer retries fit, so a fixed range flakes (a recent run fit ~8 and tripped the (9..12) lower bound).

Fix: assert the invariant the deadline logic guarantees — independent of CPU speed — instead of a magic count range:

  • buffered (stop_before_deadline_ms = 500): the request stops before the batch expires — failed_at < batch_expires_at.
  • no buffer: retries run up to the deadline — failed_at is within one backoff step (≤200ms) of batch_expires_at (500ms slack for jitter).
  • both: retried ≥ 1 time, exactly one HTTP call per attempt (call_count == retry_count + 1), and the terminal failure carries an error reason.

tokio::time::pause() was considered but doesn't apply — the path uses chrono/SQL wall-clock (not tokio::time) and runs against real Postgres under #[sqlx::test]; true determinism would require injecting a mock clock into production + SQL, out of scope here.

Test-only; no library behavior change. Standalone, unrelated to #276.

Test plan

  • cargo test --test integration → 18/18 pass
  • Two deadline tests stable under single-core CPU starvation (8/8 runs, ~2x wall-time) — the new failed_at invariants hold under load
  • cargo fmt --check clean; cargo clippy -- -D warnings (CI's command) clean

🤖 Generated with Claude Code

The two deadline integration tests asserted a narrow retry-count range,
but the count is wall-clock dependent: the backoff `not_before`, the
deadline check in can_retry, and the SQL claim filter all use
chrono::Utc::now(). On a slow/loaded CI runner each retry's DB + CPU
overhead grows, so fewer retries fit in the fixed completion window; a
recent run fit only ~8 and tripped the (9..12) lower bound.

Re-anchor each assertion to the real invariant: a tight upper bound equal
to the exponential-backoff ceiling (retries cannot run past the deadline)
plus a generous lower bound that absorbs slow-runner slack. Also fixes the
half-open (9..12) range that contradicted its own "9-12" comment.

Test-only; no library behavior change. tokio paused-time was rejected --
the path uses chrono/SQL wall-clock, not tokio::time, and runs against
real Postgres under #[sqlx::test].

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 29, 2026 09:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Loosens retry-count assertions in two deadline-related integration tests so they no longer flake under slow/loaded CI, by anchoring to the exponential-backoff ceiling (tight upper bound) and a generous lower bound, and fixes a half-open range that contradicted its own comment.

Changes:

  • test_deadline_aware_retry_stops_before_deadline: (7..=9)(4..=8), with an updated explanatory comment.
  • test_retry_stops_at_deadline_when_no_limits_set: (9..12)(5..=11), fixing the half-open range vs. "9-12" comment mismatch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@doubleword-code doubleword-code Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR fixes flaky assertions in two deadline-aware retry integration tests by widening the acceptable retry-count ranges and correcting the underlying rationale. The changes are test-only and do not modify library behavior.

Verdict: Approved — the changes are correct, well-motivated, and address a real CI flakiness issue without masking bugs.

Research notes

I reviewed:

  • The can_retry logic in src/request/transitions.rs:272-333 which calculates not_before = now + backoff_duration and denies retries when not_before >= effective_deadline
  • The backoff calculation: backoff_ms * (backoff_factor ^ retry_attempt), capped at max_backoff_ms
  • Both test functions (test_deadline_aware_retry_stops_before_deadline and test_retry_stops_at_deadline_when_no_limits_set) to verify the timeline math

Key findings from the code:

  • Test 1 uses backoff_ms=50, backoff_factor=2, max_backoff_ms=200, effective deadline = 2000ms - 500ms buffer = 1500ms
  • Test 2 uses the same backoff config but no buffer, so effective deadline = 2000ms
  • With zero overhead, Test 1 can fit at most 8 retries (next would be at t=1550ms > 1500ms), Test 2 can fit at most 11 retries (next would be at t=2150ms > 2000ms)

Suggested next steps

No action required — this PR is ready to merge. Consider running the tests a few times on CI to confirm the flakiness is resolved.

General findings

None — the changes are narrowly scoped to test assertions and comments, and the rationale is sound.

The retry-count range assertions were wall-clock dependent: how many
retries fit the fixed completion window depends on real DB/CPU cycle
time, so any fixed range is CI-speed dependent (the original flake).

Assert what the deadline logic actually guarantees, independent of speed:
- buffered: stops before batch expiry (failed_at < batch_expires_at),
  exercising the 500ms stop_before_deadline_ms buffer;
- no buffer: retries run up to the deadline (failed_at within one backoff
  step of batch_expires_at);
- both: retried >= 1 time, one HTTP call per attempt, error reason set.

Verified stable under single-core CPU starvation (8/8 runs).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@pjb157 pjb157 changed the title test: make deadline retry-count assertions robust to slow CI test: make flaky deadline retry tests robust to CI speed May 29, 2026
Copy link
Copy Markdown

@doubleword-code doubleword-code Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR fixes flaky integration tests by replacing brittle retry-count assertions with semantic deadline invariants. The core insight is correct: retry counts are wall-clock dependent (how many DB/CPU cycles fit in the deadline window), so fixed ranges inevitably flake on slow CI.

The new assertions verify what the deadline logic actually guarantees:

  • Buffered case (stop_before_deadline_ms: Some(500)): Retries stop before batch expiry
  • No-buffer case: Retries run up to the deadline (within 500ms slack)

Both tests retain the call_count == retry_count + 1 invariant (one HTTP call per attempt) and error-reason checks.

Verdict: Approved — the changes are semantically correct and address the root cause of flakiness without weakening test coverage meaningfully.

Research notes

I reviewed the can_retry implementation in src/request/transitions.rs:272-333 to understand the deadline logic:

  • Effective deadline = batch_expires_at - stop_before_deadline_ms (or just batch_expires_at if no buffer)
  • Retry denied when not_before >= effective_deadline
  • not_before = Utc::now() + backoff_duration

The daemon flow in src/daemon/mod.rs:1064-1143 shows that failed_at is set when each HTTP attempt fails, and the terminal Failed state preserves the last attempt's timestamp.

Suggested next steps

  1. Merge as-is — the assertions are correct and will eliminate the flakiness.
  2. Optional polish: Consider strengthening the buffered-case assertion slightly (see inline comment on line 870).

General findings

None — all issues are anchored to specific changed lines below.

Comment thread tests/integration.rs

// 2. Verify HTTP call count matches retry attempts (1 initial + N retries)
// The 500ms buffer must stop the request *before* the batch expires.
assert!(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Consider strengthening this assertion to more directly verify the buffer was effective.

Why it matters: The current assertion failed.state.failed_at < failed.state.batch_expires_at only verifies we didn't pass expiry. Combined with retry_count >= 1, it provides meaningful coverage, but doesn't strongly verify the 500ms buffer specifically. A request that failed after 1 retry at t=100ms would pass, even though the buffer should have allowed retries until ~t=1500ms.

A stronger invariant would be: failed.state.failed_at <= failed.state.batch_expires_at - chrono::Duration::milliseconds(400) (allowing 100ms slack for the final attempt's execution time). This would verify we actually stopped due to the buffer, not some other factor.

That said, the current assertion is correct and does distinguish the buffered behavior from the no-buffer case when read alongside the second test. The trade-off between assertion strength and flakiness risk is reasonable here.

Suggested fix: If you want stronger verification without reintroducing flakiness, consider:

assert!(
    failed.state.failed_at <= failed.state.batch_expires_at - chrono::Duration::milliseconds(400),
    "buffered retry should stop ~500ms before batch expiry: failed_at={}, expires_at={}",
    failed.state.failed_at,
    failed.state.batch_expires_at
);

Or leave as-is if the current assertion proves stable in CI.

Comment thread tests/integration.rs
);

// 2. Verify HTTP call count matches retry attempts (1 initial + N retries)
// With no buffer, retries run right up to the deadline: the final attempt
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Minor clarification on the comment.

Why it matters: The comment says "fails within one backoff step (≤200ms) of batch expiry" but the assertion uses 500ms slack. This is intentional (as noted: "500ms slack absorbs scheduling jitter"), but could be slightly clearer about why the slack is larger than the backoff step.

Suggested fix: Consider rewording to:

// With no buffer, retries run right up to the deadline: the final attempt
// fails near batch expiry (the max backoff step is 200ms, but we use 500ms
// slack to absorb scheduling jitter and DB overhead on slow CI).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants