Skip to content

CI: serialize precompile workers for python-using groups#1182

Merged
ChrisRackauckas merged 1 commit intoSciML:masterfrom
ChrisRackauckas-Claude:sci-py-robust-init
Apr 10, 2026
Merged

CI: serialize precompile workers for python-using groups#1182
ChrisRackauckas merged 1 commit intoSciML:masterfrom
ChrisRackauckas-Claude:sci-py-robust-init

Conversation

@ChrisRackauckas-Claude
Copy link
Copy Markdown
Contributor

Summary

Replaces #1180 (closed) with a substantive non-retry fix.

The OptimizationSciPy (and occasionally OptimizationPyCMA) jobs have been observed to fail at the precompile stage with

InitError: UndefRefError: access to undefined reference
 [1] _pyjl_new(t::Ptr{...}, ::Ptr{...}, ::Ptr{...})
     @ PythonCall.JlWrap.Cjl ~/.julia/packages/PythonCall/.../JlWrap/C.jl:24
 ...
[11] __init__()
     @ PythonCall.JlWrap ~/.julia/packages/PythonCall/.../JlWrap/JlWrap.jl:70
during initialization of module JlWrap
in expression starting at .../OptimizationSciPy/src/OptimizationSciPy.jl:2

with the same PythonCall v0.9.31 on which other CI runs succeed — OptimizationSciPy, lts passes on master run 24144818227 but the same job fails on PR #1169 run 23171032225.

Root cause

The intermittency + identical package versions + failure happening inside PythonCall's own wrapper-type registration in JlWrap.__init__ all point at a parallel precompile race. JlWrap.__init__ (.../src/JlWrap/JlWrap.jl:54-75) does

function __init__()
    init_base()
    init_raw()
    init_any()
    ...
    init_module()
    ...
    jl = pyjuliacallmodule
    jl.Core = Base.Core      # ← line 70: pysetattr → _pyjl_new
    jl.Base = Base
    jl.Main = Main
    ...
end

When multiple precompile workers spin up wrapper-type registration concurrently, one observes a not-yet-populated Python type slot inside _pyjl_new and throws UndefRefError. This explains the every-N-th-run flake without any other moving parts.

Fix

Pkg.precompile() reads JULIA_NUM_PRECOMPILE_TASKS (Julia base/precompilation.jl:437) to size its parallel-task semaphore. Setting it to 1 serializes precompile workers and removes the race.

- if: ${{ matrix.group == 'OptimizationSciPy' || matrix.group == 'OptimizationPyCMA' }}
  run: echo "JULIA_NUM_PRECOMPILE_TASKS=1" >> $GITHUB_ENV

Targeted to OptimizationSciPy and OptimizationPyCMA only. Other groups keep their default parallel precompile, so there's no global slowdown. The cost on these small python-wrapping subpackages is a few seconds.

Why not retry

A retry-on-failure approach (#1180) would have masked the race instead of removing it. Subsequent failures of unrelated bugs in those groups would also have been silently retried. Per maintainer feedback, JULIA_NUM_PRECOMPILE_TASKS=1 is the more honest, targeted fix.

Test plan

  • JULIA_NUM_PRECOMPILE_TASKS=1 confirmed to be read by Julia 1.11 (base/precompilation.jl:437)
  • YAML validates with yaml.safe_load
  • Conditional if: block scoped to the two python-using groups only
  • Live CI on this PR

🤖 Generated with Claude Code

The `OptimizationSciPy` (and occasionally `OptimizationPyCMA`) jobs have
been observed to fail at the precompile stage with

    InitError: UndefRefError: access to undefined reference
     [1] _pyjl_new(...)
         @ PythonCall.JlWrap.Cjl ~/.julia/packages/PythonCall/.../JlWrap/C.jl:24
     ...
    [11] __init__()
         @ PythonCall.JlWrap ~/.julia/packages/PythonCall/.../JlWrap/JlWrap.jl:70
    during initialization of module JlWrap

with the *same* PythonCall version (`0.9.31`) on which other CI runs
succeed (`OptimizationSciPy lts` passes on master run 24144818227 but
the same job fails on PR run 23171032225). The intermittency, identical
package versions, and the failure happening inside PythonCall's own
wrapper-type registration during JlWrap.__init__ all point at a parallel
precompile race: when multiple precompile workers spin up wrapper-type
registration concurrently, one of them observes a not-yet-populated
Python type slot and throws `UndefRefError`.

`Pkg.precompile()` reads `JULIA_NUM_PRECOMPILE_TASKS` (see
base/precompilation.jl:437) to size its parallel-task semaphore. Setting
it to `1` for the python-using jobs serializes precompile workers and
removes the race. The cost on these small subpackages is negligible.

Targeted to OptimizationSciPy and OptimizationPyCMA only — other groups
keep the default parallel precompile.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
@ChrisRackauckas ChrisRackauckas merged commit e46bc53 into SciML:master Apr 10, 2026
58 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants