To effectively debug or prototype a SCHISM run (or any MPI-based Azure Batch task), use an interactive "dummy wait" strategy. This allows you to board the exact compute node where the job is deployed to troubleshoot environment issues or test command sequences in real-time.
The first step in prototyping is to submit your Batch job using a modified YAML configuration or command. Instead of executing the actual SCHISM run immediately, replace the primary command with a wait command.
- Purpose: This keeps the task in a "Running" state indefinitely, preventing Azure Batch from terminating the task and shutting down the compute node.
- Cost Awareness: Remember that while the node is "waiting," you are being charged for the compute time (e.g., approximately $4.00/hour for an H-series node).
Once the node is in the "Running" state, you must create a user account to log in via SSH.
- Navigate to the Nodes view in the Azure Portal or Azure Batch Explorer.
- Select the specific node and click Add User.
- Configuration: Specify a username and password. It is highly recommended to grant the user Administrative (sudo) privileges so you can manage root-level directories or install missing dependencies if needed.
- Expiry: You can set an expiry time for the user (default is often 24 hours), after which the credentials will lapse.
Visual Studio Code is the preferred tool for interacting with the node because it allows you to edit files and run terminals directly on the remote Linux environment.
- Host Information: Obtain the Public IP address of the node from the Azure Portal.
- SSH Configuration: Add a new entry to your VS Code SSH configuration file:
- Host Name: The node's IP address.
- User: The username you just created.
- Connection: Open a new window and connect to the host. When prompted, select Linux as the platform and enter the password you specified earlier.
Once logged in, navigate to the task working directory, typically located under:
/mnt/batch/tasks/workitems/[job_id]/[task_id]/wd
To replicate the automated Azure Batch environment and setup, execute the following scripts in order:
-
Source
env_vars.sh: This is a critical first step. Sourcing this file loads all necessary environment variables, module paths, and application settings required for SCHISM and MPI.source env_vars.sh -
Use
coordination_command.sh: Run this script to repeat the multi-node setup steps, such as configuring the host file for MPI and ensuring all nodes in the cluster are synchronized.bash coordination_command.sh
-
Use
application_command.sh: This script contains the actual execution logic. By running or modifying this script, you can manually triggerazcopycommands to bring in data or test thempiruncommand with specific flags.bash application_command.sh
Because the "wait" process keeps the node active, the pool will not automatically scale down. Once your prototyping is complete, you must manually delete the job or resize the pool to zero to stop the billing.