Skip to content

Error when reading worker config #78

Description

@affans

Running on version 1.10 of Distributed, I get the following error when add_procs().

nested task error: could not parse 9275#172.1.1.1
        Stacktrace:
         [1] error(s::String)
           @ Base ./error.jl:35
         [2] (::SlurmClusterManager.var"#13#18"{SlurmManager, Vector{WorkerConfig}, Condition})()
           @ SlurmClusterManager ./REPL[4]:36

I added a debug statement in launch() to figure out the problem and found this:

┌ Debug: connecting to worker 1 out of 25
└ @ SlurmClusterManager REPL[4]:31
┌ Debug: Worker 1 output: julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:9276#172.1.1.1
└ @ SlurmClusterManager REPL[4]:34
┌ Debug: Worker 1 ready on host 172.1.1.1, port 9276
└ @ SlurmClusterManager REPL[4]:42
┌ Debug: connecting to worker 2 out of 25
└ @ SlurmClusterManager REPL[4]:31
┌ Debug: Worker 2 output: julia_worker:julia_worker:julia_worker:julia_worker:9274#172.1.1.1
└ @ SlurmClusterManager REPL[4]:34
┌ Debug: Worker 2 ready on host 172.1.1.1, port 9274
└ @ SlurmClusterManager REPL[4]:42
┌ Debug: connecting to worker 3 out of 25
└ @ SlurmClusterManager REPL[4]:31
┌ Debug: Worker 3 output: 9275#172.1.1.1
└ @ SlurmClusterManager REPL[4]:34
┌ Error: Error launching Slurm job
│   exception =
│    TaskFailedException
│
│        nested task error: could not parse 9275#172.1.1.1
│        Stacktrace:
│         [1] error(s::String)
│           @ Base ./error.jl:35
│         [2] (::SlurmClusterManager.var"#13#18"{SlurmManager, Vector{WorkerConfig}, Condition})()
│           @ SlurmClusterManager ./REPL[4]:36

The error is clear here. The readline for Worker 3 returns 9275#172.1.1.1 which does not have the julia_worker string and so the regex m = match(r".*:(\d*)#(.*) on line 206 fails. Moreover, both workers 1 and 2 have weird strings like

julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:9276#172.1.1.1

and

julia_worker:julia_worker:julia_worker:julia_worker:9274#172.1.1.1

So it seems to me that the prints are out of order. I am not sure why this would be. The code in Distributed is

    print(out, "julia_worker:")  # print header
    print(out, "$(string(LPROC.bind_port))#") # print port
    print(out, LPROC.bind_addr)
    print(out, '\n')
    flush(out)

so not sure what is causing the race.

Version info:

julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, broadwell)
Threads: 32 default, 0 interactive, 16 GC (on 32 virtual cores)
Environment:
  LD_LIBRARY_PATH = /cm/shared/apps/slurm/16.05.8/lib64/slurm:/cm/shared/apps/slurm/16.05.8/lib64:/cm/shared/apps/openmpi/gcc/64/1.10.1/lib64
  JULIA_NUM_THREADS = 32
  LD_RUN_PATH = /cm/shared/apps/openmpi/gcc/64/1.10.1/lib64

julia> Distributed.VERSION
v"1.10.3"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions