Skip to content

UFS Coastal App and spack-stack#150

Merged
mykelalvis merged 31 commits into
mainfrom
ufscoastal
Jun 2, 2026
Merged

UFS Coastal App and spack-stack#150
mykelalvis merged 31 commits into
mainfrom
ufscoastal

Conversation

@patrick-tripp
Copy link
Copy Markdown
Member

This adds spack-stack setup for ufs and the ufs-coastal-app. UFS coastal regression tests run using cloudflow.

patrick-tripp and others added 21 commits February 6, 2026 09:46
…ces Not Currently Available in Region (#125)

* Update CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md with Minna Ho's suggestions
* Update workflow_main.py

* Include OWP and NOS model suites as eligible job types for the workflow

* Delete cloudflow/workflows/workflow_template.py

* Eliminate original workflow template method for executing Cloudflow using the model templates provided for eligible OWP and NOS models. This will conform to the new updated instructions for integrating a new model template to Cloudflow.

* Modify SCHISM OWP cluster configuration file to reflect updated changes to use hpc node instance suite

* Update AWS Cluster start function to include user or default options to sleep and retry to inquire AWS resources if the server responds as InsufficentCapacity for the AWS region listed on the cluster configuration file

* Update the default SCHISM job configuration file to include the new options NUM_TRIES and JOB_DELAY to utilize the new cloudflow method for waiting to obtain AWS resources based on the user patience to obtain them

* Modify CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md to fix small syntax error with replacing the workflow_main.py Python executable with user specified pathway

* Modify the template_launcher.sh shell script in order to remove MPI variables predefined in the launcher script and move them to the model launcher scripts. Also, schism is now predefined to partition the number of tasks associated with each node instance in order to optimize memory resources between OpenMP and MPI protocols for slave ranks. This is required in order for SCHISM to properly work on the hpc node instance suite for large SCHISM meshes

* Modify all OWP, NOS, and default model template shell launcher scripts to use the lateste spack modules on the NOS Cloud Sandbox that has the newer intel libraries eligble to run on the hpc7a node instances

* Add on new AWS Cluster cluster configuration options vm_retry_delay and vm_max_retries to workflow. Include as well an expotential backoff method in the job delay over each retry iteration. Update schism.ioos configuration template options to include new delay options for users.
* minor notes, nosofs
* minor fix for ucla_roms
* improved add-user.py
* nceplibs spack install, mkl oneapi, cleanup
* all roms and fvcom nosofs forecast tested and run
* cleanup and doc, fixed nosofs prep builds
* readme and setup updates for nosofs
* nosofs setup and build, clean end to end
* fixed a bug with the setup_nosofs.sh
* Update CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md with Minna Ho's suggestions
* Update workflow_main.py

*Include OWP and NOS model suites as eligible job types for the workflow template method. This method will allow users to just simply plug and play their models into Cloudflow by just providing the model directory, model executable, and any model runtime dependencies.

* Delete cloudflow/workflows/workflow_template.py

*Eliminate original workflow template method for executing Cloudflow using the model templates provided for eligible OWP and NOS models. This will conform to the new updated instructions for integrating a new model template to Cloudflow.

* Modify SCHISM OWP cluster configuration file to reflect updated changes to use hpc node instance suite

* Update AWS Cluster start function to include user or default options to sleep and retry to inquire AWS resources if the server responds as InsufficentCapacity for the AWS region listed on the cluster configuration file

* Update the default SCHISM job configuration file to include the new options NUM_TRIES and JOB_DELAY to utilize the new cloudflow method for waiting to obtain AWS resources based on the user patience to obtain them

* Modify the SCHISM Template expected user job options to include NUM_TRIES and JOB_DELAY to utilize the new cloudflow method for waiting to obtain AWS resources based on the user patience to obtain them

* Modify CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md to fix small syntax error with replacing the workflow_main.py Python executable with user specified pathway

* Replace the syntax across the flows ctasks.cluster_start function to include the job class to be ingested into it so it can extract user arguments NUM_TRIES and JOB_DELAY if available

* Replace the cluster_tasks.py start function to include the job class to be ingested into it so it can extract user arguments NUM_TRIES and JOB_DELAY if available

* Modify the template_launcher.sh shell script in order to remove MPI variables predefined in the launcher script and move them to the model launcher scripts. Also, schism is now predefined to partition the number of tasks associated with each node instance in order to optimize memory resources between OpenMP and MPI protocols for slave ranks. This is required in order for SCHISM to properly work on the hpc node instance suite for large SCHISM meshes

* Modify all OWP, NOS, and default model template shell launcher scripts to use the lateste spack modules on the NOS Cloud Sandbox that has the newer intel libraries eligble to run on the hpc7a node instances

* Remove previous job configuration workflow options for job delay method. Add on new AWS Cluster cluster configuration options vm_retry_delay and vm_max_retries to workflow in replacement of older job configuration options NUM_DELAY and NUM_RETRIES. Include as well an expotential backoff method in the job delay over each retry iteration. Update schism.ioos configuration template options to include new delay options for users.

* Removing renaming template job configuration files to basic, adding the new required MODEL input variable to all configuration files and moving all job configuration files to their own respective organization.

* Renaming the template classes to basic classes. Also adding on the MODEL input variable as a new standard input from the job configuration file to all model classes. Reorganized the entire JobFactory.py work to seperate workflows based on the high-level model class.

* Renaming the template launcher scripts to basic launcher scripts. Restructured the entire cloudflow main workflow (tasks.py) to organize the model execution based on the jobtype variable job configuration input instead of OFS. THE OFS dependency is removed entirely and instead a supplementary variable for NOS/OFS workflows.

* Fix code bug in if/else statement within JobFactory.py revisions
* Fix MODEL class if/else bug in JobFactory.py
* Update documentation in  CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md

*Reflect new changes in job class dependencies, "basic" model naming conventions, and organization of configuration files based on user affiliation.

* Replacing OFS with APP as a job configuration variable dependency in the Job classes and configuration files
* Replacing OFS variable dependency with the APP job class variable for Python notebooks
* Replaced OFS in the Job Class Python scripts, updated documentation, and replaced the Basic with Experiment job classes
* Restructured the cloudflow workflow to replace the OFS naming convention with APP. Also renamed the basic implementation to experiment, with the addition of APP for the basic run method for each model class

* Update CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md
* Adjust documentation here based on explicit changes from "basic" to "experiment" for onboarding new models.
* Update Model_Experiment_Template.py
* Update the class name to reflect the same Python script name
* Rename WRF_Hydro_Basic.py to WRF_Hydro_Experiment.py
* Updated tasks.py
* Fixed bugs in the if/else blocks for the experiment_run function
* Update ADCIRCForecast.py
* Update fcst_launcher.sh script to reflect the APP dependency and reverted from JOBTYPE
* Update APP dependency in simple_launcher.sh script to reflect the change from JOBTYPE to APP.
* eccofs cold, fcst, and continue

* eccofs package, cleanup, and README

* end to end eccofs setup and run

* fixes/cleanup after merge, retested eccofs romsforecast workflow

* removed unneeded file
* Update CLOUD_SANDBOX_MODEL_INTEGRATION_TEMPLATE.md

* Update documentation to include how to add optional arguments to the JobFactory Python class for a new or existing model class.

* Adding Python experiment workflows to cloudflow. This includes basic, dask, and mpi implementation for users. Provide an example Python script that uses basic and dask implementation as a template for users. Also refine ucla-roms workflow to conform to new jobtype and application standards in cloudflow.

* Delete cloudflow/job/jobs/MODEL_EXPERIMENTS/python_basic.experiment

* Default the job type application to be basic for users if they do not specify it in the job configuration file. Also remove ucla_roms.py job class since it's been merged with the ROMS_Experiment class

* Add cluster/configs Experiment directory and remove OWP directory. Essentially we will have default cluster configuration files for users matching the Experiment naming convention logic moving forward.

* Update dflowfm.ioos and add delay and retries to experiment file

* Update adcirc.ioos and add retries and delay to experiment file

* Update JobFactory.py reformat if/else block to not use () within logical blocks with a single criteria - PEP 8

* Update PYTHON_Experiment.py fix syntax bug for default environment Python executable and update it to conventional python3 syntax
* Update python_basic_run.sh to ensure users cannot attempt to run on multiple node instances if specified in the cluster configuration file. Python basic implementation can only run on a single node instance.
* Cluster has single readConfig function in base class
* prefect upgrade - fcst flow works
* added ssm agent install, policy permissions
* old prefect signals removed
* eccofs fcst flow tested
* LiveOcean cleanup
* liveocean hindcast_multi flow tested
* moved NOSOFS model readme to models/nosofs folder
* fixed AWSCluster typo, froze current prefect 3.6.8 and boto 1.40.22 in funcs-setup-instance.sh
Model workflows run successfully
Dask python workflows will be updated/tested in a future PR
* using dynamoDB table and Lambda for zombies

* streamline secofs setup

* secofs clean setup and testing, ready for NOS

* re-add return to CURHOME

* minor updates: zombie_lambda, schism setup

* Update cloudflow/workflows/flows.py

Co-authored-by: Jason Ducker <81377226+jduckerOWP@users.noreply.github.com>

* Update cloudflow/workflows/flows.py

Co-authored-by: Jason Ducker <81377226+jduckerOWP@users.noreply.github.com>

* added a separate lambda for collecting db records, still need to add email, html/s3 support

* added s3 support, html template, and write variables out to html, should generate a human-readable report on run

* minor revisions

* add human time to DB

* Minor syntax error

* enforce DynamoDB table, use Prefect flow name in Name tag, improve Exceptions, start adding prefect server to deployment, added PREFECT readme, docs link fix, secofs testing/benchmarking

* minor cleanup

* Format PREFECTv3.md

* Update PREFECTv3.md

---------

Co-authored-by: Jason Ducker <81377226+jduckerOWP@users.noreply.github.com>
Co-authored-by: Zachary Wills <zwills@lynker.com>
* Remove initial python dask job config and add two new python dask job configs that demonstrate unique dask job applications

* Modify python job config to reflect new python example script to use

* Remove old python_dask_example.py script and add new python_examples.py script that users can use as an example to execute a basic Python script or different Python dask experiments

* Update Python job class to include extra argument, update documentation

* Add entire new functionality for starting and closing the dask scheduler and workers that is the modernized way to implement dask functionality and conforms with Prefect3 standards

* Update Python dask flow to reflect newly developed workflow and ensure try/catch statements are present in order to prevent Prefect3 zombie jobs

* Include new methods for Python runs and Intel MPI runs to try to catch compilation errors or MPI failure errors and kill the run so Prefect3 can shut down the ec2 instances instead of incidental zombie processes occuring.

* Insert new dask data and task parallelism workflows into tasks.py, which reflects step by step users must take to implement different Python dask methods. Also remove syntax errors from exception statements within experiment_run section.

* Fix syntax error in PYTHON_Experiment.py

* Add table_name option to all Experiment cluster configuration files in order to conform with the DynamoDB table name requirement for the new zombie job catcher

* Convert output directory paths to absolute paths for dask implementation

* Ensure output directory paths are absolute for Dask compatibility.

* Provide log to head node DB table output and remove DB table instance deletion flow

* Moving forward, we don't want to remove the instance data from the DB, we just mark the previous entry as "cleaned up" or some convention in the near future.

* Move log info command under the instance id for loop

* Revise python_mpi.experiment configuration inputs to make them more logical for users

* Assign db_table dictionary to the batch_writer call

* Insert Python flag to force unbuffered output and avoid zombie job

* Expand support in the r7i AWS instance family

* Replace Basic Experiments configuration files with the latest NOS Sandbox head node image id to ensure hpc7a module accessibility

* Update r7i large AWS instance family to reflect number of CPU vs VCPU
…ulers on NOS Sandbox (#145)

* Remove initial python dask job config and add two new python dask job configs that demonstrate unique dask job applications

* Modify python job config to reflect new python example script to use

* Remove old python_dask_example.py script and add new python_examples.py script that users can use as an example to execute a basic Python script or different Python dask experiments

* Update Python job class to include extra argument, update documentation

* Add entire new functionality for starting and closing the dask scheduler and workers that is the modernized way to implement dask functionality and conforms with Prefect3 standards

* Update Python dask flow to reflect newly developed workflow and ensure try/catch statements are present in order to prevent Prefect3 zombie jobs

* Include new methods for Python runs and Intel MPI runs to try to catch compilation errors or MPI failure errors and kill the run so Prefect3 can shut down the ec2 instances instead of incidental zombie processes occuring.

* Insert new dask data and task parallelism workflows into tasks.py, which reflects step by step users must take to implement different Python dask methods. Also remove syntax errors from exception statements within experiment_run section.

* Fix syntax error in PYTHON_Experiment.py

* Add table_name option to all Experiment cluster configuration files in order to conform with the DynamoDB table name requirement for the new zombie job catcher

* Convert output directory paths to absolute paths for dask implementation

* Ensure output directory paths are absolute for Dask compatibility.

* Provide log to head node DB table output and remove DB table instance deletion flow

* Moving forward, we don't want to remove the instance data from the DB, we just mark the previous entry as "cleaned up" or some convention in the near future.

* Move log info command under the instance id for loop

* Revise python_mpi.experiment configuration inputs to make them more logical for users

* Assign db_table dictionary to the batch_writer call

* Insert Python flag to force unbuffered output and avoid zombie job

* Expand support in the r7i AWS instance family

* Replace Basic Experiments configuration files with the latest NOS Sandbox head node image id to ensure hpc7a module accessibility

* Update r7i large AWS instance family to reflect number of CPU vs VCPU

* Resolve port binding and socket issues to allow up to 10 dask schedulers to be implemented on the head node at the same time
@patrick-tripp patrick-tripp self-assigned this May 21, 2026
@patrick-tripp patrick-tripp marked this pull request as ready for review June 1, 2026 21:02
Copy link
Copy Markdown
Member Author

@patrick-tripp patrick-tripp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review
Most of this is new files related to UFS Coastal. My self-review took about 15 minutes, including this comment.
Fixes/Upgrades: EFS driver install, EFA driver/utilities install scripts/setup-instance.sh
Deployment thoroughly tested.
Can add UFS spack-stack beside the regular setup by running scripts/add-ufs-setup.sh
No updates to regular deployment compilers or spack in this PR.

Copy link
Copy Markdown
Contributor

@mykelalvis mykelalvis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging this huge thing.

@mykelalvis mykelalvis merged commit e097669 into main Jun 2, 2026
@mykelalvis
Copy link
Copy Markdown
Contributor

Please don't delete the branch right now.

@Michael-Lalime
Copy link
Copy Markdown
Contributor

Michael-Lalime commented Jun 2, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants