Skip to content

test: deterministic seeds + multi-seed robustness sweeps for flaky tests#45

Draft
Copilot wants to merge 10 commits into
mainfrom
copilot/add-experimental-tag-to-tests
Draft

test: deterministic seeds + multi-seed robustness sweeps for flaky tests#45
Copilot wants to merge 10 commits into
mainfrom
copilot/add-experimental-tag-to-tests

Conversation

Copy link
Copy Markdown

Copilot AI commented Jan 16, 2026

Summary

Replaces the original "tag flaky tests as :experimental to keep CI green" approach with something stronger: each previously-flaky test gets a deterministic seeded baseline + a multi-seed robustness sweep, both running on every PR.

A single Random.seed!(0) is reproducible on one Julia version but the downstream numerics drift across the matrix (1.10 / 1.11 / 1.12). The robustness sweep (K=20 seeds with MersenneTwister(seed), per-test pass-rate threshold) survives that drift — ≥80% of seeds pass is a much stronger guarantee than one lucky seed passes.

The sweeps cost ~5 seconds in aggregate on a ~10-minute matrix run, so there's no opt-in gate. No env vars, no tags, no machinery.

What changed

test/runtests.jl — now one line of filter:

@run_package_tests filter = ti -> !occursin("/benchmark/", ti.filename)

The /benchmark/ filter matches Piccolissimo.jl's convention — the subtree has its own Project.toml + deps + workflow, so its @testitems shouldn't be picked up by the main test run. Side benefit: this preemptively unbreaks PR #93's CI (which currently fails because benchmark/convergence/convergence.jl is being discovered).

Two flaky tests reworked. Each now has an untagged deterministic baseline + an untagged sweep:

  1. TimeConsistencyConstraint with free time optimization — was tagged :experimental (intermittent 0.066 < 1e-6). Now uses MersenneTwister(0) and a paired sweep at threshold 0.80.
  2. NonlinearKnotPointConstraint - single variable with vector syntax — observed Julia-1.10 finite-diff failure on PR benchmarks: convergence suite v1 (Ipopt + MadNLP on X gate) #93. Same treatment; sweep threshold 0.65 (norm-based finite-diff is inherently noisier).

Test helpers grew an optional rng kwarg. bilinear_dynamics_and_trajectory and test_constraint accept rng::AbstractRNG (default Random.default_rng()) so seeded reproducibility is per-test, not via global RNG state. test_constraint also gained assert::Bool=true and returns a NamedTuple including jacobian_pass / hessian_pass — sweeps inspect those directly rather than triggering @test per seed.

Statistical strength of K=20 sweeps

True pass rate P(passes at threshold 0.80) P(passes at threshold 0.65)
0.95 0.998 ~1.000
0.80 0.630 0.913
0.50 0.006 0.058

Thresholds are per-test, picked with buffer above the observed baseline rate. A regression that drops the true rate below ~0.5 fails with very high probability; a healthy test rarely false-fails.

Verified

Local Julia 1.12: 355 pass / 1 broken / 0 fail / 356 total / 10m02s

  • TimeConsistency sweep: pass_rate = 1.00 (20/20)
  • NonlinearKnotPoint sweep: pass_rate = 0.80 (16/20, threshold 0.65)

CI matrix: all green on Julia 1.10 / 1.11 / 1.12, codecov/patch, Documentation, Formatter.

Out of scope (separate work)

  • Nightly Julia 1.13-rc1: blocked by upstream JET (v0.11.3 master pins julia = "1.12"). Resolves when JET adds 1.13 support; no force_latest_compatible_version change needed on our side.
  • Shared sweep helper: when a 3rd sweep lands, the right shape of a robustness_sweep(K, threshold, predicate) helper will be obvious from three concrete callsites. Punting until then.
  • Cross-package conventions: if @gennadiryan's upcoming upfront test-item discovery + caching work introduces a sanctioned sweep primitive, we'd migrate. The one-line filter in runtests.jl is intentionally trivial to absorb.

Test plan

  • Default test suite passes locally on Julia 1.12
  • Sweeps pass at their thresholds locally on Julia 1.12
  • CI matrix (1.10 / 1.11 / 1.12) green with always-on sweeps
  • codecov/patch green (sweep code is now exercised)

🤖 Generated with Claude Code

Copilot AI and others added 3 commits January 16, 2026 20:27
…s.jl

Co-authored-by: jack-champagne <43344745+jack-champagne@users.noreply.github.com>
Co-authored-by: jack-champagne <43344745+jack-champagne@users.noreply.github.com>
Co-authored-by: jack-champagne <43344745+jack-champagne@users.noreply.github.com>
Copilot AI changed the title [WIP] Add experimental tag to flaky tests in CI workflow Add experimental tag filtering for flaky tests Jan 16, 2026
Copilot AI requested a review from jack-champagne January 16, 2026 20:37
@gennadiryan
Copy link
Copy Markdown
Member

gennadiryan commented Apr 29, 2026

@jack-champagne I have some things mocked up that will help us take advantage of this, but it comes with some additional potential overhauls to our CI. The idea is to run TestItemRunner's test item discovery routine up front to not only generate the list of test item locations, but also grab the associated code objects and cache them alongside the package itself. This would reduce CI latency spent rebuilding tests but would also allow for much more fine-grained control of which tests "apply" to a given CI run, since the tests would be getting filtered ahead of time. At present, the most obvious use case is filtering out GPU tests on certain runs (see https://github.com/harmoniqs/Piccolissimo.jl/commit/ade50e4a29225534163afe888ee7225a33d52d34), but there are many others. Another potential future use case is supporting separate development streams (e.g. in Piccolissimo, a main branch and a well-maintained altissimo_integration branch, where the former is not yet ready to merge into the latter, and where both need to be kept up to date with downstream upgrades in the meantime). More to follow (likely in a separate PR initially landing on Piccolo.jl, then propagated back down to here).

Yet another use case: temporarily gracefully skipping CI tests for particular extension packages due to that particular extension relying on a dep (QuantumToolbox.jl is the one I've got in mind today) that is behind on compat.

harmoniqs-rebase-bot and others added 5 commits May 19, 2026 06:34
…r callbacks" from `main` with the "Sparse formulations" entry and the new "Testing" documentation from the feature branch.
…test_constraint

Adds optional `rng::AbstractRNG` kwarg to the two main test helpers so seeded
reproducibility is opt-in and per-test rather than via global Random.seed!:

- `bilinear_dynamics_and_trajectory` now accepts `rng` (default
  `Random.default_rng()`); a seeded `MersenneTwister` makes every random
  draw in the trajectory deterministic.
- `test_constraint` accepts `rng` (for the Lagrange-multiplier draw) and
  `assert::Bool=true` (set false to inspect `jacobian_pass`/`hessian_pass`
  on the returned NamedTuple without registering @test outcomes — needed
  for multi-seed robustness sweeps).

The return type changes from a 4-tuple to a NamedTuple, but positional
destructuring `(a,b,c,d) = test_constraint(...)` still works since the
first four fields are unchanged. No internal callers destructure the
result anyway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ded + sweep pairs

Generalizes the runtests.jl filter into a two-tag taxonomy:

  :experimental   excluded by default; opt in with INCLUDE_EXPERIMENTAL=1
  :robustness     excluded by default; opt in with INCLUDE_ROBUSTNESS=1

The filter remains a single closure over `ti.tags`, easy to absorb into
gennadiryan's upcoming upfront test-item discovery PR later.

For the two confirmed-flaky tests, replaces the single-seed :experimental
version with two testitems:

1. A deterministic baseline (untagged) that uses MersenneTwister(0) for
   both trajectory initialization and the multiplier μ. Runs every PR.
   A failure here is a real regression on a specific (Julia version,
   seed) pair — not RNG drift.

2. A :robustness sweep (K=20 seeds) that asserts ≥80% of seeds pass
   within the same tolerance. With K=20 and threshold=0.80, a true
   pass-rate of 95% passes ~99.8% of the time, while a regression
   that drops the true rate to 50% fails ~99.4% of the time.

Affected tests:
  - TimeConsistencyConstraint with free time optimization
    (was :experimental; drops the tag, gains a robustness twin)
  - NonlinearKnotPointConstraint - single variable with vector syntax
    (intermittent Julia-1.10 finite-diff failure; same treatment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ophy

Replaces the single-paragraph Testing section with a table of the two
opt-in env vars (INCLUDE_EXPERIMENTAL, INCLUDE_ROBUSTNESS) and a short
prose section explaining the deterministic-baseline + robustness-sweep
pattern for stochastic / numerical primitives.

Captures the rationale: a single Random.seed!(0) is reproducible on one
Julia version but drifts across the CI matrix (1.10/1.11/1.12), so we
prefer a two-layer approach over indefinite :experimental tagging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for the multi-seed robustness sweeps:

1. JuliaFormatter folded `tags=[:robustness] begin` across a line break,
   which TestItemDetection's parser doesn't handle — the keyword arg got
   absorbed into the begin block and `default_imports` was effectively
   off, so the testitem module didn't get `using DirectTrajOpt` injected
   (manifesting as `UndefVarError: QuadraticRegularizer not defined`).
   Renamed each robustness testitem (drop em-dash, drop leading dash) so
   the macro line fits on one row and the formatter leaves it alone.

2. Wrapped the per-seed loop in `run_sweep(K::Int)` so `pass_count` and
   `failures` are function-local, eliminating the soft-scope-ambiguity
   warning.

Also lowers the NonlinearKnotPointConstraint sweep threshold from 0.80
to 0.65. The norm-based finite-diff/analytic comparison at atol=1e-3 is
genuinely noisy — observed Julia-1.12 pass rate is exactly 16/20=0.80,
right at the old threshold. A one-seed shift across the Julia matrix
would false-fail. A real regression in the analytic derivative would
drop the rate well below 0.5, still caught with high probability under
the looser gate. README updated to note that per-test thresholds are
chosen with buffer above the observed baseline.

Verified on Julia 1.12:
  - default suite: 353 pass / 1 broken / 0 fail / 0 error
  - INCLUDE_ROBUSTNESS=1: 355 pass / 1 broken / 0 fail / 0 error
    TimeConsistency sweep:        pass_rate=1.00 (20/20)
    NonlinearKnotPoint sweep:     pass_rate=0.80 (16/20)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 96.66667% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../constraints/linear/time_consistency_constraint.jl 92.30% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

After the deterministic+sweep rework no testitems use :experimental anymore,
and the sweeps add ~5 seconds in aggregate on top of a ~10-minute matrix —
not worth a parallel filter framework.

Drops both INCLUDE_EXPERIMENTAL and INCLUDE_ROBUSTNESS env vars and the
ti.tags closure. The robustness sweeps now run on every PR (along with their
deterministic baselines).

Keeps a `/benchmark/` path filter following Piccolissimo.jl's convention —
the benchmark subtree has its own Project.toml + deps + workflow, so its
@testItems shouldn't be discovered by `@run_package_tests`. This also
preemptively unbreaks PR #93's CI (which fails today because
benchmark/convergence/convergence.jl is being picked up).

Verified locally on Julia 1.12: 355 pass / 1 broken / 0 fail.
  TimeConsistency sweep:    pass_rate = 1.0  (20/20)
  NonlinearKnotPoint sweep: pass_rate = 0.8  (16/20, threshold 0.65)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jack-champagne jack-champagne changed the title Add experimental tag filtering for flaky tests test: deterministic seeds + multi-seed robustness sweeps for flaky tests May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants