Skip to content

CI: retry Pkg.test() once on transient PythonCall init failures#1180

Closed
ChrisRackauckas-Claude wants to merge 1 commit intoSciML:masterfrom
ChrisRackauckas-Claude:ci-retry-pythoncall-init
Closed

CI: retry Pkg.test() once on transient PythonCall init failures#1180
ChrisRackauckas-Claude wants to merge 1 commit intoSciML:masterfrom
ChrisRackauckas-Claude:ci-retry-pythoncall-init

Conversation

@ChrisRackauckas-Claude
Copy link
Copy Markdown
Contributor

Summary

The OptimizationSciPy (and occasionally OptimizationPyCMA) jobs intermittently fail at the precompile stage with

InitError: UndefRefError: access to undefined reference
 [11] __init__()
    @ PythonCall.JlWrap ~/.julia/packages/PythonCall/.../src/JlWrap/JlWrap.jl:70
during initialization of module JlWrap
in expression starting at .../OptimizationSciPy/src/OptimizationSciPy.jl:2

This happens with the same PythonCall version (v0.9.31) on runs where other jobs succeed — e.g. OptimizationSciPy, lts passes on master run 24144818227 but the same job fails on PR #1169 run 23171032225. It's an upstream PythonCall flakiness around precompile cache deserialization, not a code change in this repository.

Approach

Wrap Pkg.test() so that, on a PythonCall + JlWrap / UndefRefError-shaped exception:

  1. The precompile cache directories for PythonCall, OptimizationSciPy, OptimizationPyCMA, and SciMLBasePythonCallExt are removed (they're what carry the stale JlWrap state across runs).
  2. Pkg.test() is retried exactly once.

Any other exception is rethrown immediately, so genuine test failures are still surfaced. The retry only fires when the error matches both PythonCall and one of JlWrap / UndefRefError, so it will not mask unrelated test bugs in any group.

Test plan

  • Detection function unit-tested locally against the CI error string and against unrelated BoundsError / generic ErrorException (only the PythonCall init shape returns true)
  • Embedded Julia script parses (Meta.parseall) cleanly
  • YAML validates with yaml.safe_load
  • Live CI run

🤖 Generated with Claude Code

The OptimizationSciPy (and occasionally OptimizationPyCMA) jobs have
been observed to fail at the precompile stage with

  InitError: UndefRefError: access to undefined reference
  during initialization of module JlWrap
   ...
  in PythonCall.JlWrap.__init__()

with the *same* PythonCall version (0.9.31) on which other CI runs
succeed (e.g. OptimizationSciPy lts succeeds on master run 24144818227
but the same job fails on PR run 23171032225). This is an upstream
PythonCall flakiness around precompile cache deserialization, not a
code change in this repository.

Wrap `Pkg.test()` so that, on a `PythonCall` + `JlWrap` /
`UndefRefError`-shaped exception, the precompile cache for PythonCall
and the python-using subpackages is cleared and `Pkg.test()` is retried
exactly once. Any other exception is rethrown immediately, so genuine
test failures are still surfaced. The retry only fires when the error
matches both `PythonCall` and one of `JlWrap` / `UndefRefError`, so it
will not mask unrelated test bugs.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
@ChrisRackauckas-Claude
Copy link
Copy Markdown
Contributor Author

Closing per maintainer feedback. Retry isn't the right approach — replaced by #1182, which sets JULIA_NUM_PRECOMPILE_TASKS=1 for the python-using groups to serialize precompile workers and remove the underlying PythonCall.JlWrap.__init__ parallel-init race that was the actual source of the failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants