Skip to content

Using sbatch does not work well with Julia JIT #77

Description

@affans

I recently came across this package now that ClusterManagers is throwing a dep warning for Slurm. The new package uses sbatch functionality which lets SLURM handle the resource allocation on top of which Julia then spawns worker processes (i.e., sbatch -> srun). While this method works well, the workflow is different than ClusterManagers and for myself, I don't think this method is well suited for Julia's workflow, especially with prototyping and interactivity. Let me illustrate.

In the old method, my workflow was like this:

using ClusterManagers
using Distributed 
using Revise
addprocs(SlurmManager(10), kwargs) # ask for 10 tasks

@everywhere includet("model.jl")  # includet for Revise, contains functions for long-running scripts 

function run_simulations(params)  # defined on the head node
   results = pmap(1:100) do x 
      run_long_simulation(params) # runs on the worker processes
   end 
   return results  # an array of outputs from run_long_simulation
end

function process_simluations()   # defined/runs on the headnode
# process/plot simulation results
end

This workflow was great. After the initial allocation and loading of script using Revise, I could run_simulations(); process_simulations (which are now on the headnode) and generate summary statistics, plots, and so on. If I needed to change parameters, I could simply run run_simulations(params) with a different set. This means that I take advantage of all the compiled code on the worker processes. Using Revise also means I could go into run_long_simulation(), make my changes, and run_simulations() will pick those changes up (across all workers).

The sbatch method does not give you this flexibility. The main issues are

  • It has to compile the code every time you run sbatch script.jl.
  • Lose interactive flexibility - have to save data to files, have another instance of Julia open for analysis, etc.
  • Issues with project directory as sbatch runs from a different working directory (yes, there is a env variable set with the working dir so it's managable)
  • The initial execution of the script runs on the allocated resources instead of the head/login node.

This really hurts productivity and workflow and feels not very "Julia"-like.

Alternative / Going back to the old method
I found a workaround to replicate the above behaviour without using sbatch. From the terminal,

(base) odinuser02@podin:~$ salloc -N 2 --ntasks-per-node=10 bash
salloc: Granted job allocation 495
(base) odinuser02@podin:~$

This throws me in an interactive session (on the headnode). Now I launch julia

julia> using Distributed, SlurmClusterManager

julia> addprocs(SlurmManager(),
                exeflags="--project=$(ENV["SLURM_SUBMIT_DIR"])")
20-element Vector{Int64}:
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21

This lets met work interactively, working directory/projects are easy (i.e., julia --project=. sets the correct project), I can use Revise, and prototype my model. Once my pmap returns data, I can use plotting libraries (which on my cluster are only on the head node).

julia> @everywhere println("hello from $(myid()):$(gethostname())")
hello from 1:podin
hello from 4:ops03
hello from 9:ops03
hello from 14:opsc01
hello from 6:ops03
hello from 2:ops03
hello from 10:ops03
hello from 7:ops03
hello from 21:ops03
hello from 5:ops03
hello from 11:opsc01
hello from 8:ops03

I am mainly using this issue for awareness and providing a method that replicates the old workflow. I think having an example of using salloc in the README might be useful for a lot of folks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions