test: deterministic seeds + multi-seed robustness sweeps for flaky tests by Copilot · Pull Request #45 · harmoniqs/DirectTrajOpt.jl

Copilot · 2026-01-16T20:20:19Z

Summary

Replaces the original "tag flaky tests as :experimental to keep CI green" approach with something stronger: each previously-flaky test gets a deterministic seeded baseline + a multi-seed robustness sweep, both running on every PR.

A single Random.seed!(0) is reproducible on one Julia version but the downstream numerics drift across the matrix (1.10 / 1.11 / 1.12). The robustness sweep (K=20 seeds with MersenneTwister(seed), per-test pass-rate threshold) survives that drift — ≥80% of seeds pass is a much stronger guarantee than one lucky seed passes.

The sweeps cost ~5 seconds in aggregate on a ~10-minute matrix run, so there's no opt-in gate. No env vars, no tags, no machinery.

What changed

test/runtests.jl — now one line of filter:

@run_package_tests filter = ti -> !occursin("/benchmark/", ti.filename)

The /benchmark/ filter matches Piccolissimo.jl's convention — the subtree has its own Project.toml + deps + workflow, so its @testitems shouldn't be picked up by the main test run. Side benefit: this preemptively unbreaks PR #93's CI (which currently fails because benchmark/convergence/convergence.jl is being discovered).

Two flaky tests reworked. Each now has an untagged deterministic baseline + an untagged sweep:

TimeConsistencyConstraint with free time optimization — was tagged :experimental (intermittent 0.066 < 1e-6). Now uses MersenneTwister(0) and a paired sweep at threshold 0.80.
NonlinearKnotPointConstraint - single variable with vector syntax — observed Julia-1.10 finite-diff failure on PR benchmarks: convergence suite v1 (Ipopt + MadNLP on X gate) #93. Same treatment; sweep threshold 0.65 (norm-based finite-diff is inherently noisier).

Test helpers grew an optional rng kwarg. bilinear_dynamics_and_trajectory and test_constraint accept rng::AbstractRNG (default Random.default_rng()) so seeded reproducibility is per-test, not via global RNG state. test_constraint also gained assert::Bool=true and returns a NamedTuple including jacobian_pass / hessian_pass — sweeps inspect those directly rather than triggering @test per seed.

Statistical strength of K=20 sweeps

True pass rate	P(passes at threshold 0.80)	P(passes at threshold 0.65)
0.95	0.998	~1.000
0.80	0.630	0.913
0.50	0.006	0.058

Thresholds are per-test, picked with buffer above the observed baseline rate. A regression that drops the true rate below ~0.5 fails with very high probability; a healthy test rarely false-fails.

Verified

Local Julia 1.12: 355 pass / 1 broken / 0 fail / 356 total / 10m02s

TimeConsistency sweep: pass_rate = 1.00 (20/20)
NonlinearKnotPoint sweep: pass_rate = 0.80 (16/20, threshold 0.65)

CI matrix: all green on Julia 1.10 / 1.11 / 1.12, codecov/patch, Documentation, Formatter.

Out of scope (separate work)

Nightly Julia 1.13-rc1: blocked by upstream JET (v0.11.3 master pins julia = "1.12"). Resolves when JET adds 1.13 support; no force_latest_compatible_version change needed on our side.
Shared sweep helper: when a 3rd sweep lands, the right shape of a robustness_sweep(K, threshold, predicate) helper will be obvious from three concrete callsites. Punting until then.
Cross-package conventions: if @gennadiryan's upcoming upfront test-item discovery + caching work introduces a sanctioned sweep primitive, we'd migrate. The one-line filter in runtests.jl is intentionally trivial to absorb.

Test plan

Default test suite passes locally on Julia 1.12
Sweeps pass at their thresholds locally on Julia 1.12
CI matrix (1.10 / 1.11 / 1.12) green with always-on sweeps
codecov/patch green (sweep code is now exercised)

🤖 Generated with Claude Code

…s.jl Co-authored-by: jack-champagne <43344745+jack-champagne@users.noreply.github.com>

Co-authored-by: jack-champagne <43344745+jack-champagne@users.noreply.github.com>

gennadiryan · 2026-04-29T15:50:12Z

@jack-champagne I have some things mocked up that will help us take advantage of this, but it comes with some additional potential overhauls to our CI. The idea is to run TestItemRunner's test item discovery routine up front to not only generate the list of test item locations, but also grab the associated code objects and cache them alongside the package itself. This would reduce CI latency spent rebuilding tests but would also allow for much more fine-grained control of which tests "apply" to a given CI run, since the tests would be getting filtered ahead of time. At present, the most obvious use case is filtering out GPU tests on certain runs (see https://github.com/harmoniqs/Piccolissimo.jl/commit/ade50e4a29225534163afe888ee7225a33d52d34), but there are many others. Another potential future use case is supporting separate development streams (e.g. in Piccolissimo, a main branch and a well-maintained altissimo_integration branch, where the former is not yet ready to merge into the latter, and where both need to be kept up to date with downstream upgrades in the meantime). More to follow (likely in a separate PR initially landing on Piccolo.jl, then propagated back down to here).

Yet another use case: temporarily gracefully skipping CI tests for particular extension packages due to that particular extension relying on a dep (QuantumToolbox.jl is the one I've got in mind today) that is behind on compat.

…r callbacks" from `main` with the "Sparse formulations" entry and the new "Testing" documentation from the feature branch.

@test

…test_constraint Adds optional `rng::AbstractRNG` kwarg to the two main test helpers so seeded reproducibility is opt-in and per-test rather than via global Random.seed!: - `bilinear_dynamics_and_trajectory` now accepts `rng` (default `Random.default_rng()`); a seeded `MersenneTwister` makes every random draw in the trajectory deterministic. - `test_constraint` accepts `rng` (for the Lagrange-multiplier draw) and `assert::Bool=true` (set false to inspect `jacobian_pass`/`hessian_pass` on the returned NamedTuple without registering @test outcomes — needed for multi-seed robustness sweeps). The return type changes from a 4-tuple to a NamedTuple, but positional destructuring `(a,b,c,d) = test_constraint(...)` still works since the first four fields are unchanged. No internal callers destructure the result anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ded + sweep pairs Generalizes the runtests.jl filter into a two-tag taxonomy: :experimental excluded by default; opt in with INCLUDE_EXPERIMENTAL=1 :robustness excluded by default; opt in with INCLUDE_ROBUSTNESS=1 The filter remains a single closure over `ti.tags`, easy to absorb into gennadiryan's upcoming upfront test-item discovery PR later. For the two confirmed-flaky tests, replaces the single-seed :experimental version with two testitems: 1. A deterministic baseline (untagged) that uses MersenneTwister(0) for both trajectory initialization and the multiplier μ. Runs every PR. A failure here is a real regression on a specific (Julia version, seed) pair — not RNG drift. 2. A :robustness sweep (K=20 seeds) that asserts ≥80% of seeds pass within the same tolerance. With K=20 and threshold=0.80, a true pass-rate of 95% passes ~99.8% of the time, while a regression that drops the true rate to 50% fails ~99.4% of the time. Affected tests: - TimeConsistencyConstraint with free time optimization (was :experimental; drops the tag, gains a robustness twin) - NonlinearKnotPointConstraint - single variable with vector syntax (intermittent Julia-1.10 finite-diff failure; same treatment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ophy Replaces the single-paragraph Testing section with a table of the two opt-in env vars (INCLUDE_EXPERIMENTAL, INCLUDE_ROBUSTNESS) and a short prose section explaining the deterministic-baseline + robustness-sweep pattern for stochastic / numerical primitives. Captures the rationale: a single Random.seed!(0) is reproducible on one Julia version but drifts across the CI matrix (1.10/1.11/1.12), so we prefer a two-layer approach over indefinite :experimental tagging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two fixes for the multi-seed robustness sweeps: 1. JuliaFormatter folded `tags=[:robustness] begin` across a line break, which TestItemDetection's parser doesn't handle — the keyword arg got absorbed into the begin block and `default_imports` was effectively off, so the testitem module didn't get `using DirectTrajOpt` injected (manifesting as `UndefVarError: QuadraticRegularizer not defined`). Renamed each robustness testitem (drop em-dash, drop leading dash) so the macro line fits on one row and the formatter leaves it alone. 2. Wrapped the per-seed loop in `run_sweep(K::Int)` so `pass_count` and `failures` are function-local, eliminating the soft-scope-ambiguity warning. Also lowers the NonlinearKnotPointConstraint sweep threshold from 0.80 to 0.65. The norm-based finite-diff/analytic comparison at atol=1e-3 is genuinely noisy — observed Julia-1.12 pass rate is exactly 16/20=0.80, right at the old threshold. A one-seed shift across the Julia matrix would false-fail. A real regression in the analytic derivative would drop the rate well below 0.5, still caught with high probability under the looser gate. README updated to note that per-test thresholds are chosen with buffer above the observed baseline. Verified on Julia 1.12: - default suite: 353 pass / 1 broken / 0 fail / 0 error - INCLUDE_ROBUSTNESS=1: 355 pass / 1 broken / 0 fail / 0 error TimeConsistency sweep: pass_rate=1.00 (20/20) NonlinearKnotPoint sweep: pass_rate=0.80 (16/20) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-20T06:34:05Z

Codecov Report

❌ Patch coverage is 96.66667% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../constraints/linear/time_consistency_constraint.jl	92.30%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

After the deterministic+sweep rework no testitems use :experimental anymore, and the sweeps add ~5 seconds in aggregate on top of a ~10-minute matrix — not worth a parallel filter framework. Drops both INCLUDE_EXPERIMENTAL and INCLUDE_ROBUSTNESS env vars and the ti.tags closure. The robustness sweeps now run on every PR (along with their deterministic baselines). Keeps a `/benchmark/` path filter following Piccolissimo.jl's convention — the benchmark subtree has its own Project.toml + deps + workflow, so its @testItems shouldn't be discovered by `@run_package_tests`. This also preemptively unbreaks PR #93's CI (which fails today because benchmark/convergence/convergence.jl is being picked up). Verified locally on Julia 1.12: 355 pass / 1 broken / 0 fail. TimeConsistency sweep: pass_rate = 1.0 (20/20) NonlinearKnotPoint sweep: pass_rate = 0.8 (16/20, threshold 0.65) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jack-champagne · 2026-06-09T06:28:34Z

🤖 Nightly Rebase Bot: Conflicts detected during rebase. An agent-attempted resolution is available on rebase/app/copilot-swe-agent/copilot/add-experimental-tag-to-tests. Please review before merging.

Initial plan

71b055d

Copilot AI assigned Copilot and jack-champagne Jan 16, 2026

Copilot started work on behalf of jack-champagne January 16, 2026 20:20 View session

Copilot AI and others added 3 commits January 16, 2026 20:27

Add experimental tag to flaky test and implement filtering in runtest…

c1fb1b8

…s.jl Co-authored-by: jack-champagne <43344745+jack-champagne@users.noreply.github.com>

Add testing documentation to README

fab856f

Co-authored-by: jack-champagne <43344745+jack-champagne@users.noreply.github.com>

Add safety check for tags field in test filtering

2317d72

Co-authored-by: jack-champagne <43344745+jack-champagne@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add experimental tag to flaky tests in CI workflow~~ Add experimental tag filtering for flaky tests Jan 16, 2026

Copilot AI requested a review from jack-champagne January 16, 2026 20:37

Copilot finished work on behalf of jack-champagne January 16, 2026 20:37

harmoniqs-rebase-bot and others added 5 commits May 19, 2026 06:34

[ai-rebase] Combined the detailed feature descriptions and new "Solve…

31fc0cc

…r callbacks" from `main` with the "Sparse formulations" entry and the new "Testing" documentation from the feature branch.

jack-champagne changed the title ~~Add experimental tag filtering for flaky tests~~ test: deterministic seeds + multi-seed robustness sweeps for flaky tests May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: deterministic seeds + multi-seed robustness sweeps for flaky tests#45

test: deterministic seeds + multi-seed robustness sweeps for flaky tests#45
jack-champagne with Copilot wants to merge 10 commits into
mainfrom
copilot/add-experimental-tag-to-tests

Copilot AI commented Jan 16, 2026 •

edited by jack-champagne

Loading

Uh oh!

gennadiryan commented Apr 29, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

jack-champagne commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Copilot AI commented Jan 16, 2026 • edited by jack-champagne Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Statistical strength of K=20 sweeps

Verified

Out of scope (separate work)

Test plan

Uh oh!

gennadiryan commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jack-champagne commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Jan 16, 2026 •

edited by jack-champagne

Loading

gennadiryan commented Apr 29, 2026 •

edited

Loading

codecov Bot commented May 20, 2026 •

edited

Loading

jack-champagne commented Jun 9, 2026 •

edited

Loading