Skip to content

Add multi host and multi user support to the process manager#941

Merged
PawelPlesniak merged 24 commits into
PawelPlesniakEmuhammad/SplitShellFixingfrom
emmuhamm/fix-lcs
Jun 19, 2026
Merged

Add multi host and multi user support to the process manager#941
PawelPlesniak merged 24 commits into
PawelPlesniakEmuhammad/SplitShellFixingfrom
emmuhamm/fix-lcs

Conversation

@emmuhamm

@emmuhamm emmuhamm commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes #926
Fixes #937

  • Blocks boot behaviour in unified shell if another session (with the same name) is already booted

  • Asks for confirmation when booting in the process manager shell if user tries to spawn a session with the same name as an existing session

  • Multi host support works by mapping localhost instances to an actual host

    • better handling with via localhost remapping in the session_dal
    • Also a remapping in the controller code
  • Multi user support works

  • extra connectivity service is ready check in the controller

  • new touch_and_chmod behaviour to enable correct permissions for read and write

    • This is done for the info.json that drunc usually spits out
  • deletes the logfile in case there was a previous iteration of it at boot now

  • checks for working directory permissions before running

  • BIG NOTE PERMISSIONS SET TO RWX FOR EVERYONE (777)!!! THIS MUST BE ACCEPTABLE

  • all this works for localhost, ehn1, and the localhost-pmaas config (where logs go to /logs)

Note about multi user support

It works!

However it has to run at a very specific level of permissions. To recap:

  • the user running the Process Manager server is deemed the 'superuser'
    • any new processes spawned by the process manager will be made in the name of the superuser
    • so this superuser will own the relevant processes
  • Any user can connect to the process manager via process-manager-shell or unified-shell

Because of the above,

  • Process Manager Server must be run in a folder that the user has rw(x?) permissions for
    • example: emmuhamm in his own home directory
  • Unified Shell / Process Manager shell must be run in a folder that both the superuser and the user running the US/PMS has rw(x?) permissios for
    • eg in an /nfs/sw/dunedaq.. folder that has 777 permissions on it

Type of change

  • Optimization
  • Bug fix
  • New feature

List of required branches from other repositories

DUNE-DAQ/daqsystemtest#322

Suggested manual testing checklist

built off 260603 nightly

Some relevant commands include

drunc-process-manager ssh-standalone 50000

drunc-unified-shell -o grpc://np04-srv-028:50000 config/daqsystemtest/example-configs.data.xml ehn1-local-1x1-config emir-test-ehn1 start-run --run-number 1 wait 20 
drunc-unified-shell -o grpc://np04-srv-028:50000 config/daqsystemtest/example-configs.data.xml local-1x1-config emir-test-local  start-run --run-number 1 wait 20
drunc-unified-shell -o grpc://np04-srv-028:50000 config/daqsystemtest/example-configs.data.xml local-1x1-config-pmaas emir-test-pmaas  start-run --run-number 1 wait 20


boot config/daqsystemtest/example-configs.data.xml ehn1-local-1x1-config emir-test-ehn3
boot config/daqsystemtest/example-configs.data.xml local-1x1-config emir-test-local-2 
boot config/daqsystemtest/example-configs.data.xml local-1x1-config-pmaas emir-test-pmaas-1

The list of steps (vague, sorry) are

  • clone the relevant branches on the relevant nightly
  • make sure you have super big permissions
  • using one user, start up the process manager
  • do the same cloning and permissions with the other user
  • using the other user, run
    • the unified shell with the three different configs
    • the pm shell and boot the three different things
  • check if having more sessions work with the port mapping
  • if you dont have the right permissions, check if the error message is clear enough and explained what the problem is

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: NP04 - 028 from release 03 June 2026 nightly

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows
  • Drunc integration tests pass (./scripts/drunc_integtest_bundle.sh)

Final checklist prior to marking this as "Ready for Review"

  • Code is clearly commented.
  • New unit tests have been added, or is documented in # ISSUE NUMBER
  • A suitable reviewer has been chosen from this list.

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed (on either np0x or IC HEP clusters)
    • Use the following guidelines to determine which of the integration tests you need to run
      • You do not need to run any integration tests if
        • Code changes are not associated with src/
        • PR changes only affect docstrings
        • In this case, be sure to validate any suggested manual testing.
      • Run only the minimum integration test as daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py if
        • PR changes only affect a few log entries
        • PR changes are small, and do not have a large impact on the workflow (use carefully)
      • Otherwise run the full integration test bundle as daqsystemtest_integtest_bundle.sh
    • What to do if the integration tests fail?
      • Only concern yourself if failures related to drunc are in the log files
      • If non-drunc failure appears:
        • Validate failure in fresh working area
        • Contact Pawel if unsure
  • If you have ran the full integration test bundle, leave a comment on the PR stating
    • Which host the integration tests have ran on
    • [Optional] A copy of the test summary
  • Drunc integration tests pass (scripts/drunc_integtest_bundle.sh)

Once the above boxes are checked, the PR(s) can be merged following the steps below.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #dunedaq-integration Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for a Slack channel

Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

@emmuhamm emmuhamm marked this pull request as draft June 10, 2026 08:24
@emmuhamm emmuhamm self-assigned this Jun 10, 2026
@emmuhamm emmuhamm changed the base branch from develop to emmuhamm/fix-minor-split-shell-issues June 10, 2026 08:28
@emmuhamm emmuhamm force-pushed the emmuhamm/fix-lcs branch 3 times, most recently from e4cfb16 to d452d58 Compare June 12, 2026 14:17
@emmuhamm

emmuhamm commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

so most things work. things to do now is to ensure that the -pmaas config actually works independently in the daqsystemtest repo

and then later we can investigate if everything works for both ehn1 and local, using /log and not /log

well the pmaas thing worked a treat

@emmuhamm emmuhamm force-pushed the emmuhamm/fix-lcs branch 3 times, most recently from c1bc926 to 3c980ac Compare June 15, 2026 14:51
Comment thread src/drunc/process_manager/process_manager_driver.py Outdated
@emmuhamm emmuhamm mentioned this pull request Jun 17, 2026
22 tasks
@emmuhamm emmuhamm added this to the fddaq-v5.7.0 milestone Jun 17, 2026
@emmuhamm emmuhamm marked this pull request as ready for review June 17, 2026 13:23
@emmuhamm emmuhamm changed the title fix lcs wip Add multi host and multi user support to the process manager Jun 17, 2026
@emmuhamm emmuhamm requested a review from PawelPlesniak June 17, 2026 13:29
@emmuhamm

Copy link
Copy Markdown
Contributor Author

Hello hello @PawelPlesniak this is the big boi now. Ready for review to get the multi host multi user support in!

Comment thread src/drunc/controller/controller.py Outdated
# Remaps the localhost into the correct server
# and also grabs the correct port from the right environment from the config

connection_server_config = (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the host name not the config, rename this var to be clearer.

if item.className() == "Variable":
if item.name == "CONNECTION_PORT":
item.value = new_port
def change_localhost(self, session_dal):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this resolve_localhost for clarity

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a resolve_localhost_to_hostname function, could these work in parallel? Seems like there are multiple functions to achieve similar functionality.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved_localhost does use the resolve_localhost_to_hostname function. We simply wrap it in dal_localhost_mapping so it can get used in changing the dal (as defined in the resolve_localhost function cuz I think it needs a little bit of additional processing.

There is a case to be had to basically upgrade the existing resolve_localhost_to_hostname so that it now uses the dal_localhost_mapping code, but I haven't investigated the use of resolve_localhost_to_hostname elsewhere in the codebase and I'd like to not make any changes to resolve_localhost_to_hostname for the time being.

if item.className() == "Variable":
if item.name == "CONNECTION_PORT":
item.value = new_port
def change_localhost(self, session_dal):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a resolve_localhost_to_hostname function, could these work in parallel? Seems like there are multiple functions to achieve similar functionality.

Comment thread src/drunc/process_manager/ssh_process_manager.py Outdated
Comment thread src/drunc/utils/utils.py
@PawelPlesniak

Copy link
Copy Markdown
Collaborator

Sorry for the split of revised changes to comments, that is an error on my behalf

@PawelPlesniak

Copy link
Copy Markdown
Collaborator

Other notes

Blocks boot behaviour in unified shell if another session (with the same name) is already booted

This should be docstringed, with a statement that the run control will validate this in the sesssion manager in the future.

extra connectivity service is ready check in the controller

What was the motivation for this?

@PawelPlesniak

Copy link
Copy Markdown
Collaborator

Once these are all done, I will test with np04daq and myself

@emmuhamm

Copy link
Copy Markdown
Contributor Author

Other notes

Blocks boot behaviour in unified shell if another session (with the same name) is already booted

This should be docstringed, with a statement that the run control will validate this in the sesssion manager in the future.

What do you mean by docstrings? I will add some comments on the code saying that the run control will validate this in the session manager in the future, but the blocking behaviour is pretty obvious from the code itself.

extra connectivity service is ready check in the controller

What was the motivation for this?

During the localhost mapping changes, I noticed that if the ports arent mapped properly drunc kinda just gets stuck until a 2 minute or so timeout engages. Having this extra check immediately checks if the controller can actually connect to the connectivity service in less than 10 seconds, and will fail quickly if said connectivity service cant be reached for whatever reason. Can't think of a good reason to not have that (even after I hopefully fixed the LCS so I've kept it in)

@emmuhamm

Copy link
Copy Markdown
Contributor Author

Hi @PawelPlesniak, I think I've addressed most of your comments, either with a code change or a reply on the thread. Can you go over it and let me know what you think?

@PawelPlesniak

Copy link
Copy Markdown
Collaborator

Other notes

Blocks boot behaviour in unified shell if another session (with the same name) is already booted

This should be docstringed, with a statement that the run control will validate this in the sesssion manager in the future.

What do you mean by docstrings? I will add some comments on the code saying that the run control will validate this in the session manager in the future, but the blocking behaviour is pretty obvious from the code itself.

This is sufficiently clear. I meant to include a note, this absolutely should exist for now, but in the future we will want to move this to the session manager.

extra connectivity service is ready check in the controller

What was the motivation for this?

During the localhost mapping changes, I noticed that if the ports arent mapped properly drunc kinda just gets stuck until a 2 minute or so timeout engages. Having this extra check immediately checks if the controller can actually connect to the connectivity service in less than 10 seconds, and will fail quickly if said connectivity service cant be reached for whatever reason. Can't think of a good reason to not have that (even after I hopefully fixed the LCS so I've kept it in)

Thanks

@PawelPlesniak

PawelPlesniak commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Pytest failures are associated with changes in develop which are not accounted for in the working branch. As this is going into an intermediary branch, which will be used for development and testing prior to merging, this failure will be ignored for now and addressed once the branch that goes into develop will be ready. The same is true for the integration tests.

@PawelPlesniak

Copy link
Copy Markdown
Collaborator

This works as currently intended. Testing was completed as follows

  • np04daq running the process manager server on 028
  • pplesnia running the unified shell on 029

Testing with local-1x1-config successful from an area with the permissions changed.
Testing with local-1x1-config-pmaas succesful from everywhere.

The location for running the process manager is also dependent, as expeced (permissions)

Follow up notes

Blocks boot behaviour in unified shell if another session (with the same name) is already booted

Validated

Asks for confirmation when booting in the process manager shell if user tries to spawn a session with the same name as an existing session

Validated

Multi host support works by mapping localhost instances to an actual host

Validated

better handling with via localhost remapping in the session_dal

Validated

Also a remapping in the controller code

Validated

Multi user support works

Validated

extra connectivity service is ready check in the controller

Validated

new touch_and_chmod behaviour to enable correct permissions for read and write

Validated

This is done for the info.json that drunc usually spits out deletes the logfile in case there was a previous iteration of it at boot now

Validated

checks for working directory permissions before running

Validated

BIG NOTE PERMISSIONS SET TO RWX FOR EVERYONE (777)!!! THIS MUST BE ACCEPTABLE

It is not, but thats a later problem ;)

all this works for localhost, ehn1, and the localhost-pmaas config (where logs go to /logs)

Validated

Nice work!

@PawelPlesniak PawelPlesniak merged commit 9a8d141 into PawelPlesniakEmuhammad/SplitShellFixing Jun 19, 2026
2 of 4 checks passed
@PawelPlesniak PawelPlesniak deleted the emmuhamm/fix-lcs branch June 19, 2026 15:22
emmuhamm pushed a commit that referenced this pull request Jun 22, 2026
Add multi host and multi user support to the process manager
emmuhamm pushed a commit that referenced this pull request Jun 23, 2026
Add multi host and multi user support to the process manager
emmuhamm pushed a commit that referenced this pull request Jun 23, 2026
Add multi host and multi user support to the process manager
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Local Connectivity Service doesn't work for local-x-config Minor issues from PMaaS testing (20260515)

2 participants