Skip to content

Add support for DistributedNext (and add DistributedNext as a weakdep)#80

Draft
DilumAluthge wants to merge 2 commits into
masterfrom
dpa/distributednext
Draft

Add support for DistributedNext (and add DistributedNext as a weakdep)#80
DilumAluthge wants to merge 2 commits into
masterfrom
dpa/distributednext

Conversation

@DilumAluthge

Copy link
Copy Markdown
Member

No description provided.

@codecov

codecov Bot commented Sep 5, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.83%. Comparing base (11d491c) to head (5c361a8).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
ext/SlurmClusterManagerDistributedNextExt.jl 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #80      +/-   ##
==========================================
- Coverage   93.61%   91.83%   -1.79%     
==========================================
  Files           2        3       +1     
  Lines          94       98       +4     
==========================================
+ Hits           88       90       +2     
- Misses          6        8       +2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DilumAluthge

Copy link
Copy Markdown
Member Author

@JamesWrigley @jpsamaroo Does this look correct?

@DilumAluthge

Copy link
Copy Markdown
Member Author

Hmmm. Is this going to be sufficient?

Because, if I look here:

_srun_cmd_without_env = `srun -D $exehome $exename $exeflags --worker`

I see that SlurmClusterManager is starting (under srun) each worker with julia --worker. But isn't julia --worker hardcoded to always use Distributed.jl?

@JamesWrigley

Copy link
Copy Markdown

Yeah that won't work, for DistributedNext you'd want to call get_worker_arg() and interpolate that into _srun_cmd_without_env: https://github.com/JuliaParallel/DistributedNext.jl/blob/7023c603014d0e78defedd2a3d753061faeabf32/src/managers.jl#L245

I might also suggest putting Distributed support in a package extension too, that way people only need to load one or the other packages.

@DilumAluthge

Copy link
Copy Markdown
Member Author

I might also suggest putting Distributed support in a package extension too, that way people only need to load one or the other packages.

Hmmm. Unfortunately, the SlurmClusterManager.SlurmManager struct subtypes Distributed.ClusterManager. It might be breaking to change that.

@DilumAluthge

Copy link
Copy Markdown
Member Author

Yeah that won't work, for DistributedNext you'd want to call get_worker_arg() and interpolate that into _srun_cmd_without_env: https://github.com/JuliaParallel/DistributedNext.jl/blob/7023c603014d0e78defedd2a3d753061faeabf32/src/managers.jl#L245

If we're going to need to make a change anyway, would it be worth exposing a public interface to the user, that provides a general interface for the user to specify the worker arg? E.g. something like #74?

My motivation for #74 was that I wanted to be able to start workers with julia -e 'import DilumsPackage; DilumsPackage.entrypoint()' (instead of julia --worker). It seems that DistributedNext would be a special case of this more general feature?

@JamesWrigley

Copy link
Copy Markdown

Hmmm. Unfortunately, the SlurmClusterManager.SlurmManager struct subtypes Distributed.ClusterManager. It might be breaking to change that.

Oh I forgot that's a requirement of implementing the cluster interface. But then won't it be necessary to subtype DistributedNext.ClusterManager? 🤔 Need to think about a clean way to do that...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants