test: deterministic seeds + multi-seed robustness sweeps for flaky tests#45
test: deterministic seeds + multi-seed robustness sweeps for flaky tests#45Copilot wants to merge 10 commits into
Conversation
…s.jl Co-authored-by: jack-champagne <43344745+jack-champagne@users.noreply.github.com>
Co-authored-by: jack-champagne <43344745+jack-champagne@users.noreply.github.com>
Co-authored-by: jack-champagne <43344745+jack-champagne@users.noreply.github.com>
|
@jack-champagne I have some things mocked up that will help us take advantage of this, but it comes with some additional potential overhauls to our CI. The idea is to run TestItemRunner's test item discovery routine up front to not only generate the list of test item locations, but also grab the associated code objects and cache them alongside the package itself. This would reduce CI latency spent rebuilding tests but would also allow for much more fine-grained control of which tests "apply" to a given CI run, since the tests would be getting filtered ahead of time. At present, the most obvious use case is filtering out GPU tests on certain runs (see https://github.com/harmoniqs/Piccolissimo.jl/commit/ade50e4a29225534163afe888ee7225a33d52d34), but there are many others. Another potential future use case is supporting separate development streams (e.g. in Piccolissimo, a main branch and a well-maintained altissimo_integration branch, where the former is not yet ready to merge into the latter, and where both need to be kept up to date with downstream upgrades in the meantime). More to follow (likely in a separate PR initially landing on Piccolo.jl, then propagated back down to here). Yet another use case: temporarily gracefully skipping CI tests for particular extension packages due to that particular extension relying on a dep (QuantumToolbox.jl is the one I've got in mind today) that is behind on compat. |
…r callbacks" from `main` with the "Sparse formulations" entry and the new "Testing" documentation from the feature branch.
…test_constraint Adds optional `rng::AbstractRNG` kwarg to the two main test helpers so seeded reproducibility is opt-in and per-test rather than via global Random.seed!: - `bilinear_dynamics_and_trajectory` now accepts `rng` (default `Random.default_rng()`); a seeded `MersenneTwister` makes every random draw in the trajectory deterministic. - `test_constraint` accepts `rng` (for the Lagrange-multiplier draw) and `assert::Bool=true` (set false to inspect `jacobian_pass`/`hessian_pass` on the returned NamedTuple without registering @test outcomes — needed for multi-seed robustness sweeps). The return type changes from a 4-tuple to a NamedTuple, but positional destructuring `(a,b,c,d) = test_constraint(...)` still works since the first four fields are unchanged. No internal callers destructure the result anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ded + sweep pairs
Generalizes the runtests.jl filter into a two-tag taxonomy:
:experimental excluded by default; opt in with INCLUDE_EXPERIMENTAL=1
:robustness excluded by default; opt in with INCLUDE_ROBUSTNESS=1
The filter remains a single closure over `ti.tags`, easy to absorb into
gennadiryan's upcoming upfront test-item discovery PR later.
For the two confirmed-flaky tests, replaces the single-seed :experimental
version with two testitems:
1. A deterministic baseline (untagged) that uses MersenneTwister(0) for
both trajectory initialization and the multiplier μ. Runs every PR.
A failure here is a real regression on a specific (Julia version,
seed) pair — not RNG drift.
2. A :robustness sweep (K=20 seeds) that asserts ≥80% of seeds pass
within the same tolerance. With K=20 and threshold=0.80, a true
pass-rate of 95% passes ~99.8% of the time, while a regression
that drops the true rate to 50% fails ~99.4% of the time.
Affected tests:
- TimeConsistencyConstraint with free time optimization
(was :experimental; drops the tag, gains a robustness twin)
- NonlinearKnotPointConstraint - single variable with vector syntax
(intermittent Julia-1.10 finite-diff failure; same treatment)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ophy Replaces the single-paragraph Testing section with a table of the two opt-in env vars (INCLUDE_EXPERIMENTAL, INCLUDE_ROBUSTNESS) and a short prose section explaining the deterministic-baseline + robustness-sweep pattern for stochastic / numerical primitives. Captures the rationale: a single Random.seed!(0) is reproducible on one Julia version but drifts across the CI matrix (1.10/1.11/1.12), so we prefer a two-layer approach over indefinite :experimental tagging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for the multi-seed robustness sweeps:
1. JuliaFormatter folded `tags=[:robustness] begin` across a line break,
which TestItemDetection's parser doesn't handle — the keyword arg got
absorbed into the begin block and `default_imports` was effectively
off, so the testitem module didn't get `using DirectTrajOpt` injected
(manifesting as `UndefVarError: QuadraticRegularizer not defined`).
Renamed each robustness testitem (drop em-dash, drop leading dash) so
the macro line fits on one row and the formatter leaves it alone.
2. Wrapped the per-seed loop in `run_sweep(K::Int)` so `pass_count` and
`failures` are function-local, eliminating the soft-scope-ambiguity
warning.
Also lowers the NonlinearKnotPointConstraint sweep threshold from 0.80
to 0.65. The norm-based finite-diff/analytic comparison at atol=1e-3 is
genuinely noisy — observed Julia-1.12 pass rate is exactly 16/20=0.80,
right at the old threshold. A one-seed shift across the Julia matrix
would false-fail. A real regression in the analytic derivative would
drop the rate well below 0.5, still caught with high probability under
the looser gate. README updated to note that per-test thresholds are
chosen with buffer above the observed baseline.
Verified on Julia 1.12:
- default suite: 353 pass / 1 broken / 0 fail / 0 error
- INCLUDE_ROBUSTNESS=1: 355 pass / 1 broken / 0 fail / 0 error
TimeConsistency sweep: pass_rate=1.00 (20/20)
NonlinearKnotPoint sweep: pass_rate=0.80 (16/20)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
After the deterministic+sweep rework no testitems use :experimental anymore, and the sweeps add ~5 seconds in aggregate on top of a ~10-minute matrix — not worth a parallel filter framework. Drops both INCLUDE_EXPERIMENTAL and INCLUDE_ROBUSTNESS env vars and the ti.tags closure. The robustness sweeps now run on every PR (along with their deterministic baselines). Keeps a `/benchmark/` path filter following Piccolissimo.jl's convention — the benchmark subtree has its own Project.toml + deps + workflow, so its @testItems shouldn't be discovered by `@run_package_tests`. This also preemptively unbreaks PR #93's CI (which fails today because benchmark/convergence/convergence.jl is being picked up). Verified locally on Julia 1.12: 355 pass / 1 broken / 0 fail. TimeConsistency sweep: pass_rate = 1.0 (20/20) NonlinearKnotPoint sweep: pass_rate = 0.8 (16/20, threshold 0.65) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Replaces the original "tag flaky tests as
:experimentalto keep CI green" approach with something stronger: each previously-flaky test gets a deterministic seeded baseline + a multi-seed robustness sweep, both running on every PR.A single
Random.seed!(0)is reproducible on one Julia version but the downstream numerics drift across the matrix (1.10 / 1.11 / 1.12). The robustness sweep (K=20 seeds withMersenneTwister(seed), per-test pass-rate threshold) survives that drift —≥80% of seeds passis a much stronger guarantee thanone lucky seed passes.The sweeps cost ~5 seconds in aggregate on a ~10-minute matrix run, so there's no opt-in gate. No env vars, no tags, no machinery.
What changed
test/runtests.jl— now one line of filter:The
/benchmark/filter matches Piccolissimo.jl's convention — the subtree has its ownProject.toml+ deps + workflow, so its@testitems shouldn't be picked up by the main test run. Side benefit: this preemptively unbreaks PR #93's CI (which currently fails becausebenchmark/convergence/convergence.jlis being discovered).Two flaky tests reworked. Each now has an untagged deterministic baseline + an untagged sweep:
TimeConsistencyConstraint with free time optimization— was tagged:experimental(intermittent0.066 < 1e-6). Now usesMersenneTwister(0)and a paired sweep at threshold 0.80.NonlinearKnotPointConstraint - single variable with vector syntax— observed Julia-1.10 finite-diff failure on PR benchmarks: convergence suite v1 (Ipopt + MadNLP on X gate) #93. Same treatment; sweep threshold 0.65 (norm-based finite-diff is inherently noisier).Test helpers grew an optional
rngkwarg.bilinear_dynamics_and_trajectoryandtest_constraintacceptrng::AbstractRNG(defaultRandom.default_rng()) so seeded reproducibility is per-test, not via global RNG state.test_constraintalso gainedassert::Bool=trueand returns a NamedTuple includingjacobian_pass/hessian_pass— sweeps inspect those directly rather than triggering@testper seed.Statistical strength of K=20 sweeps
Thresholds are per-test, picked with buffer above the observed baseline rate. A regression that drops the true rate below ~0.5 fails with very high probability; a healthy test rarely false-fails.
Verified
Local Julia 1.12: 355 pass / 1 broken / 0 fail / 356 total / 10m02s
CI matrix: all green on Julia 1.10 / 1.11 / 1.12, codecov/patch, Documentation, Formatter.
Out of scope (separate work)
v0.11.3master pinsjulia = "1.12"). Resolves when JET adds 1.13 support; noforce_latest_compatible_versionchange needed on our side.robustness_sweep(K, threshold, predicate)helper will be obvious from three concrete callsites. Punting until then.runtests.jlis intentionally trivial to absorb.Test plan
🤖 Generated with Claude Code