UFS Coastal App and spack-stack by patrick-tripp · Pull Request #150 · ioos/Cloud-Sandbox

patrick-tripp · 2026-04-21T20:41:09Z

This adds spack-stack setup for ufs and the ufs-coastal-app. UFS coastal regression tests run using cloudflow.

…ces Not Currently Available in Region (#125) * Update CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md with Minna Ho's suggestions * Update workflow_main.py * Include OWP and NOS model suites as eligible job types for the workflow * Delete cloudflow/workflows/workflow_template.py * Eliminate original workflow template method for executing Cloudflow using the model templates provided for eligible OWP and NOS models. This will conform to the new updated instructions for integrating a new model template to Cloudflow. * Modify SCHISM OWP cluster configuration file to reflect updated changes to use hpc node instance suite * Update AWS Cluster start function to include user or default options to sleep and retry to inquire AWS resources if the server responds as InsufficentCapacity for the AWS region listed on the cluster configuration file * Update the default SCHISM job configuration file to include the new options NUM_TRIES and JOB_DELAY to utilize the new cloudflow method for waiting to obtain AWS resources based on the user patience to obtain them * Modify CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md to fix small syntax error with replacing the workflow_main.py Python executable with user specified pathway * Modify the template_launcher.sh shell script in order to remove MPI variables predefined in the launcher script and move them to the model launcher scripts. Also, schism is now predefined to partition the number of tasks associated with each node instance in order to optimize memory resources between OpenMP and MPI protocols for slave ranks. This is required in order for SCHISM to properly work on the hpc node instance suite for large SCHISM meshes * Modify all OWP, NOS, and default model template shell launcher scripts to use the lateste spack modules on the NOS Cloud Sandbox that has the newer intel libraries eligble to run on the hpc7a node instances * Add on new AWS Cluster cluster configuration options vm_retry_delay and vm_max_retries to workflow. Include as well an expotential backoff method in the job delay over each retry iteration. Update schism.ioos configuration template options to include new delay options for users.

* minor notes, nosofs * minor fix for ucla_roms * improved add-user.py * nceplibs spack install, mkl oneapi, cleanup * all roms and fvcom nosofs forecast tested and run * cleanup and doc, fixed nosofs prep builds * readme and setup updates for nosofs * nosofs setup and build, clean end to end * fixed a bug with the setup_nosofs.sh

* Update CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md with Minna Ho's suggestions * Update workflow_main.py *Include OWP and NOS model suites as eligible job types for the workflow template method. This method will allow users to just simply plug and play their models into Cloudflow by just providing the model directory, model executable, and any model runtime dependencies. * Delete cloudflow/workflows/workflow_template.py *Eliminate original workflow template method for executing Cloudflow using the model templates provided for eligible OWP and NOS models. This will conform to the new updated instructions for integrating a new model template to Cloudflow. * Modify SCHISM OWP cluster configuration file to reflect updated changes to use hpc node instance suite * Update AWS Cluster start function to include user or default options to sleep and retry to inquire AWS resources if the server responds as InsufficentCapacity for the AWS region listed on the cluster configuration file * Update the default SCHISM job configuration file to include the new options NUM_TRIES and JOB_DELAY to utilize the new cloudflow method for waiting to obtain AWS resources based on the user patience to obtain them * Modify the SCHISM Template expected user job options to include NUM_TRIES and JOB_DELAY to utilize the new cloudflow method for waiting to obtain AWS resources based on the user patience to obtain them * Modify CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md to fix small syntax error with replacing the workflow_main.py Python executable with user specified pathway * Replace the syntax across the flows ctasks.cluster_start function to include the job class to be ingested into it so it can extract user arguments NUM_TRIES and JOB_DELAY if available * Replace the cluster_tasks.py start function to include the job class to be ingested into it so it can extract user arguments NUM_TRIES and JOB_DELAY if available * Modify the template_launcher.sh shell script in order to remove MPI variables predefined in the launcher script and move them to the model launcher scripts. Also, schism is now predefined to partition the number of tasks associated with each node instance in order to optimize memory resources between OpenMP and MPI protocols for slave ranks. This is required in order for SCHISM to properly work on the hpc node instance suite for large SCHISM meshes * Modify all OWP, NOS, and default model template shell launcher scripts to use the lateste spack modules on the NOS Cloud Sandbox that has the newer intel libraries eligble to run on the hpc7a node instances * Remove previous job configuration workflow options for job delay method. Add on new AWS Cluster cluster configuration options vm_retry_delay and vm_max_retries to workflow in replacement of older job configuration options NUM_DELAY and NUM_RETRIES. Include as well an expotential backoff method in the job delay over each retry iteration. Update schism.ioos configuration template options to include new delay options for users. * Removing renaming template job configuration files to basic, adding the new required MODEL input variable to all configuration files and moving all job configuration files to their own respective organization. * Renaming the template classes to basic classes. Also adding on the MODEL input variable as a new standard input from the job configuration file to all model classes. Reorganized the entire JobFactory.py work to seperate workflows based on the high-level model class. * Renaming the template launcher scripts to basic launcher scripts. Restructured the entire cloudflow main workflow (tasks.py) to organize the model execution based on the jobtype variable job configuration input instead of OFS. THE OFS dependency is removed entirely and instead a supplementary variable for NOS/OFS workflows. * Fix code bug in if/else statement within JobFactory.py revisions * Fix MODEL class if/else bug in JobFactory.py * Update documentation in CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md *Reflect new changes in job class dependencies, "basic" model naming conventions, and organization of configuration files based on user affiliation. * Replacing OFS with APP as a job configuration variable dependency in the Job classes and configuration files * Replacing OFS variable dependency with the APP job class variable for Python notebooks * Replaced OFS in the Job Class Python scripts, updated documentation, and replaced the Basic with Experiment job classes * Restructured the cloudflow workflow to replace the OFS naming convention with APP. Also renamed the basic implementation to experiment, with the addition of APP for the basic run method for each model class * Update CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md * Adjust documentation here based on explicit changes from "basic" to "experiment" for onboarding new models. * Update Model_Experiment_Template.py * Update the class name to reflect the same Python script name * Rename WRF_Hydro_Basic.py to WRF_Hydro_Experiment.py * Updated tasks.py * Fixed bugs in the if/else blocks for the experiment_run function * Update ADCIRCForecast.py * Update fcst_launcher.sh script to reflect the APP dependency and reverted from JOBTYPE * Update APP dependency in simple_launcher.sh script to reflect the change from JOBTYPE to APP.

* eccofs cold, fcst, and continue * eccofs package, cleanup, and README * end to end eccofs setup and run * fixes/cleanup after merge, retested eccofs romsforecast workflow * removed unneeded file

* Update CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md * Update documentation to include how to add optional arguments to the JobFactory Python class for a new or existing model class. * Adding Python experiment workflows to cloudflow. This includes basic, dask, and mpi implementation for users. Provide an example Python script that uses basic and dask implementation as a template for users. Also refine ucla-roms workflow to conform to new jobtype and application standards in cloudflow. * Delete cloudflow/job/jobs/MODEL_EXPERIMENTS/python_basic.experiment * Default the job type application to be basic for users if they do not specify it in the job configuration file. Also remove ucla_roms.py job class since it's been merged with the ROMS_Experiment class * Add cluster/configs Experiment directory and remove OWP directory. Essentially we will have default cluster configuration files for users matching the Experiment naming convention logic moving forward. * Update dflowfm.ioos and add delay and retries to experiment file * Update adcirc.ioos and add retries and delay to experiment file * Update JobFactory.py reformat if/else block to not use () within logical blocks with a single criteria - PEP 8 * Update PYTHON_Experiment.py fix syntax bug for default environment Python executable and update it to conventional python3 syntax

* Update python_basic_run.sh to ensure users cannot attempt to run on multiple node instances if specified in the cluster configuration file. Python basic implementation can only run on a single node instance.

* Cluster has single readConfig function in base class * prefect upgrade - fcst flow works * added ssm agent install, policy permissions * old prefect signals removed * eccofs fcst flow tested * LiveOcean cleanup * liveocean hindcast_multi flow tested * moved NOSOFS model readme to models/nosofs folder * fixed AWSCluster typo, froze current prefect 3.6.8 and boto 1.40.22 in funcs-setup-instance.sh Model workflows run successfully Dask python workflows will be updated/tested in a future PR

* using dynamoDB table and Lambda for zombies * streamline secofs setup * secofs clean setup and testing, ready for NOS * re-add return to CURHOME * minor updates: zombie_lambda, schism setup * Update cloudflow/workflows/flows.py Co-authored-by: Jason Ducker <81377226+jduckerOWP@users.noreply.github.com> * Update cloudflow/workflows/flows.py Co-authored-by: Jason Ducker <81377226+jduckerOWP@users.noreply.github.com> * added a separate lambda for collecting db records, still need to add email, html/s3 support * added s3 support, html template, and write variables out to html, should generate a human-readable report on run * minor revisions * add human time to DB * Minor syntax error * enforce DynamoDB table, use Prefect flow name in Name tag, improve Exceptions, start adding prefect server to deployment, added PREFECT readme, docs link fix, secofs testing/benchmarking * minor cleanup * Format PREFECTv3.md * Update PREFECTv3.md --------- Co-authored-by: Jason Ducker <81377226+jduckerOWP@users.noreply.github.com> Co-authored-by: Zachary Wills <zwills@lynker.com>

* Remove initial python dask job config and add two new python dask job configs that demonstrate unique dask job applications * Modify python job config to reflect new python example script to use * Remove old python_dask_example.py script and add new python_examples.py script that users can use as an example to execute a basic Python script or different Python dask experiments * Update Python job class to include extra argument, update documentation * Add entire new functionality for starting and closing the dask scheduler and workers that is the modernized way to implement dask functionality and conforms with Prefect3 standards * Update Python dask flow to reflect newly developed workflow and ensure try/catch statements are present in order to prevent Prefect3 zombie jobs * Include new methods for Python runs and Intel MPI runs to try to catch compilation errors or MPI failure errors and kill the run so Prefect3 can shut down the ec2 instances instead of incidental zombie processes occuring. * Insert new dask data and task parallelism workflows into tasks.py, which reflects step by step users must take to implement different Python dask methods. Also remove syntax errors from exception statements within experiment_run section. * Fix syntax error in PYTHON_Experiment.py * Add table_name option to all Experiment cluster configuration files in order to conform with the DynamoDB table name requirement for the new zombie job catcher * Convert output directory paths to absolute paths for dask implementation * Ensure output directory paths are absolute for Dask compatibility. * Provide log to head node DB table output and remove DB table instance deletion flow * Moving forward, we don't want to remove the instance data from the DB, we just mark the previous entry as "cleaned up" or some convention in the near future. * Move log info command under the instance id for loop * Revise python_mpi.experiment configuration inputs to make them more logical for users * Assign db_table dictionary to the batch_writer call * Insert Python flag to force unbuffered output and avoid zombie job * Expand support in the r7i AWS instance family * Replace Basic Experiments configuration files with the latest NOS Sandbox head node image id to ensure hpc7a module accessibility * Update r7i large AWS instance family to reflect number of CPU vs VCPU

…ulers on NOS Sandbox (#145) * Remove initial python dask job config and add two new python dask job configs that demonstrate unique dask job applications * Modify python job config to reflect new python example script to use * Remove old python_dask_example.py script and add new python_examples.py script that users can use as an example to execute a basic Python script or different Python dask experiments * Update Python job class to include extra argument, update documentation * Add entire new functionality for starting and closing the dask scheduler and workers that is the modernized way to implement dask functionality and conforms with Prefect3 standards * Update Python dask flow to reflect newly developed workflow and ensure try/catch statements are present in order to prevent Prefect3 zombie jobs * Include new methods for Python runs and Intel MPI runs to try to catch compilation errors or MPI failure errors and kill the run so Prefect3 can shut down the ec2 instances instead of incidental zombie processes occuring. * Insert new dask data and task parallelism workflows into tasks.py, which reflects step by step users must take to implement different Python dask methods. Also remove syntax errors from exception statements within experiment_run section. * Fix syntax error in PYTHON_Experiment.py * Add table_name option to all Experiment cluster configuration files in order to conform with the DynamoDB table name requirement for the new zombie job catcher * Convert output directory paths to absolute paths for dask implementation * Ensure output directory paths are absolute for Dask compatibility. * Provide log to head node DB table output and remove DB table instance deletion flow * Moving forward, we don't want to remove the instance data from the DB, we just mark the previous entry as "cleaned up" or some convention in the near future. * Move log info command under the instance id for loop * Revise python_mpi.experiment configuration inputs to make them more logical for users * Assign db_table dictionary to the batch_writer call * Insert Python flag to force unbuffered output and avoid zombie job * Expand support in the r7i AWS instance family * Replace Basic Experiments configuration files with the latest NOS Sandbox head node image id to ensure hpc7a module accessibility * Update r7i large AWS instance family to reflect number of CPU vs VCPU * Resolve port binding and socket issues to allow up to 10 dask schedulers to be implemented on the head node at the same time

patrick-tripp

Self-review
Most of this is new files related to UFS Coastal. My self-review took about 15 minutes, including this comment.
Fixes/Upgrades: EFS driver install, EFA driver/utilities install scripts/setup-instance.sh
Deployment thoroughly tested.
Can add UFS spack-stack beside the regular setup by running scripts/add-ufs-setup.sh
No updates to regular deployment compilers or spack in this PR.

mykelalvis

Merging this huge thing.

mykelalvis · 2026-06-02T15:25:33Z

Please don't delete the branch right now.

Michael-Lalime · 2026-06-02T15:50:37Z

Thank you!!!

…

On Tue, Jun 2, 2026 at 11:29 AM Mykel Alvis ***@***.***> wrote: *mykelalvis* left a comment (ioos/Cloud-Sandbox#150) <#150 (comment)> Please don't delete the branch right now. — Reply to this email directly, view it on GitHub <#150?email_source=notifications&email_token=AR7UUIFRMQGUI5FN6EKYLJD453XETA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRQGM4TKMBQGM2KM4TFMFZW63VQOJSXM2LFO5PXEZLROVSXG5DFMSSWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4603950034>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AR7UUIEWPZYY3BYSSBGKUN3453XETAVCNFSM6AAAAACYBOS7YGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DMMBTHE2TAMBTGQ> . Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS <https://github.com/notifications/mobile/ios/AR7UUIHPQPAUOQPNSDVUXMD453XETA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRQGM4TKMBQGM2KM4TFMFZW63VQOJSXM2LFO5PXEZLROVSXG5DFMSSWK5TFNZ2KUZTPN52GK4S7NFXXG> and Android <https://github.com/notifications/mobile/android/AR7UUIEGUDI44HHML6P4KNL453XETA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINRQGM4TKMBQGM2KM4TFMFZW63VQOJSXM2LFO5PXEZLROVSXG5DFMSSWK5TFNZ2K4ZTPN52GK4S7MFXGI4TPNFSA>. Download it today! You are receiving this because your review was requested.Message ID: ***@***.***>

patrick-tripp and others added 21 commits February 6, 2026 09:46

pre ufscoastal deployment

1aa7a98

ECCOFS dev integration with nosofs v3.6.6 framework (#128)

7e458f0

* eccofs cold, fcst, and continue * eccofs package, cleanup, and README * end to end eccofs setup and run * fixes/cleanup after merge, retested eccofs romsforecast workflow * removed unneeded file

Python Basic Experiment Node Count Flag During Execution (#130)

7a8ef0f

* Update python_basic_run.sh to ensure users cannot attempt to run on multiple node instances if specified in the cluster configuration file. Python basic implementation can only run on a single node instance.

RHEL 8.10 ufscoastal

232b335

spack-stack for ufs clean setup/build

e97e158

Changes to identify zombies (#144)

fbe8c16

adding ufs coastal setup

82f1be9

coastal roms builds, working on rt

9e0f9be

rt.sh compile job works, next, call rt.sh from cloudflow

f064682

ufs workflow

541c335

UFS compile and run rt.sh works from cloudlow

46dcab5

UFS ROMS Irene compile and run rt.sh works

3847e2f

Merge branch 'main' into ufscoastal

c9d5a91

patrick-tripp self-assigned this May 21, 2026

patrick-tripp added 8 commits May 27, 2026 14:35

bump prefect version for dependabot

f521b98

RHEL 8.10 ufscoastal

9052a4f

updated environment-vars.sh

7cf878c

prefect bump, deployment.info testing

aa40943

minor deployment fixes post-ufs

401094c

keeping older gcc and oneapi for setup-instance

ee44144

efa installer update and fixes, ufs add-on

b75e243

changed comment

58f46ef

patrick-tripp added 2 commits May 29, 2026 18:25

test deploy fixes

a3c9bb7

fixed deployment info copy to head node

f819dad

patrick-tripp marked this pull request as ready for review June 1, 2026 21:02

patrick-tripp requested review from Michael-Lalime, ZacharyWills, jduckerOWP and mykelalvis June 1, 2026 21:03

patrick-tripp commented Jun 1, 2026

View reviewed changes

mykelalvis approved these changes Jun 2, 2026

View reviewed changes

mykelalvis merged commit e097669 into main Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UFS Coastal App and spack-stack#150

UFS Coastal App and spack-stack#150
mykelalvis merged 31 commits into
mainfrom
ufscoastal

patrick-tripp commented Apr 21, 2026

Uh oh!

patrick-tripp left a comment

Uh oh!

mykelalvis left a comment

Uh oh!

mykelalvis commented Jun 2, 2026

Uh oh!

Michael-Lalime commented Jun 2, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

patrick-tripp commented Apr 21, 2026

Uh oh!

patrick-tripp left a comment

Choose a reason for hiding this comment

Uh oh!

mykelalvis left a comment

Choose a reason for hiding this comment

Uh oh!

mykelalvis commented Jun 2, 2026

Uh oh!

Michael-Lalime commented Jun 2, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants