Skip to content

Reduce the memory usage that is important for ne1024 simulation#665

Open
sjsprecious wants to merge 4 commits into
ESCOMP:mainfrom
sjsprecious:add_lnd2rof_map_files
Open

Reduce the memory usage that is important for ne1024 simulation#665
sjsprecious wants to merge 4 commits into
ESCOMP:mainfrom
sjsprecious:add_lnd2rof_map_files

Conversation

@sjsprecious

Copy link
Copy Markdown
Contributor

This PR reduces the memory usage that is critical for ultra-high resolution simulation such as ne1024. All the code changes are done by Claude under my supervisory.

The goal is to let the lnd→rof conservative coupling maps be read from offline weight files instead of being computed online, which OOM-kills the job at ne1024 during DataInitialize (ESMF_FieldRegridStore → GeomRendezvous → Zoltan_RCB). This is the implementation of the lnd2rof_consf OOM fix. There are two distinct lnd→rof maps handled, plus an aoflux memory optimization.

  1. Flux/runoff coupling map — bug fix

In main, the lnd2rof_map attribute (driven by the pre-existing LND2ROF_FMAPNAME XML var) was already read into the lnd2rof_map variable — but the five addmap_from calls for the runoff fields (Flrl_rofsur, Flrl_rofi, Flrl_rofgwl, Flrl_rofsub, Flrl_irrig) hardcoded unset, so the file was silently ignored and the map was always built online.

  1. Fraction-init map — new feature

This is a separate map (destarea, not fracarea) used during med_fraction_init, with no pre-existing namelist hook. The fix adds full new plumbing:

  • New XML entry LND2ROF_FRAC_FMAPNAME (default unset, env_run.xml, group run_domain) in config_component.xml.
  • New driver namelist lnd2rof_fmap (modify_via_xml="LND2ROF_FRAC_FMAPNAME") in namelist_definition_drv.xml.
  • med_fraction_mod.F90: reads the lnd2rof_fmap attribute via NUOPC_CompAttributeGet; if it is present, set, and the file exists on disk (inquire), it calls med_map_routehandles_init with mapfile= to read offline weights; otherwise it falls back to the original online path — no behavior change when unset. Adds a use NUOPC import and local vars (isPresent/isSet/lexist, lnd2rof_fmap).
  • med_map_mod.F90: threads a new optional, intent(in) :: mapfile argument through med_map_routehandles_initfrom_fieldbundle, forwarding it to the field-level routine (which already accepted mapfile) when present. Backward-compatible.
  1. aoflux mesh memory optimization

Replaces is_local%wrap%aoflux_mesh = ESMF_MeshCreate(lmesh, rc=rc) with is_local%wrap%aoflux_mesh = lmesh — reuses the ocean field's existing mesh handle instead of allocating a duplicate full ESMF mesh per rank at ne1024pg2. Safe because lmesh persists in FBArea(compocn) and aoflux_mesh is never destroyed.

@billsacks billsacks requested review from billsacks and mvertens June 26, 2026 00:05
@billsacks

Copy link
Copy Markdown
Member

@mvertens - If you have a chance, I'd like to hear any thoughts you have on this change. (I have just skimmed it so far - haven't had a chance to look closely yet.)

@billsacks billsacks left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall changes here look good to me. I appreciate your plugging in the use of lnd2rof_map, which seems like it previously wasn't hooked up. And I think I see why you needed to introduce a separate lnd2rof_fmap (though see my comments asking for this to be made more explicit in some documentation).

I do have a few requests... many about editing some comments, but a couple slightly more substantial... but still overall minor - for the most part this looks good to me.

Once you make the final changes, I'd like to see that at least one or two tests have been run with baseline comparisons to verify that these changes work and are bit-for-bit in out-of-the-box configurations. One test that I think would cover all of these changes would be a B compset test with the aoflux grid set to ogrid (the setting of ogrid is needed to cover the changes in med_phases_aofluxes_mod, and should still cover the other changes here). I'd like to see that run with comparisons against a baseline.

Comment thread mediator/med_phases_aofluxes_mod.F90 Outdated
Comment thread mediator/med_fraction_mod.F90 Outdated
Comment thread mediator/med_fraction_mod.F90 Outdated
Comment thread mediator/med_fraction_mod.F90
Comment thread mediator/med_fraction_mod.F90 Outdated
Comment thread cime_config/namelist_definition_drv.xml
Comment thread cime_config/config_component.xml Outdated
Comment thread cime_config/config_component.xml Outdated
Comment thread mediator/med_fraction_mod.F90 Outdated
@sjsprecious sjsprecious requested a review from billsacks June 26, 2026 20:26
@jtruesdal

Copy link
Copy Markdown
Contributor

@sjsprecious and @billsacks I was also thinking about a test to show that the offline and inline are identical but in the past I've run into roundoff errors in the way the model calculates the mesh compared to offline. If that's the case here you may want to implement some type of tolerance. Maybe a nonissue if they are identical in this case. Other than that I think with the current set of updates the code looks good to me.

@sjsprecious

Copy link
Copy Markdown
Contributor Author

Thanks @jtruesdal for your comments. I think one thing that can contribute to the difference here is that users may use different compiler/ESMF versions when generating the offline mesh.

Once Bill provides the instructions about running the tests he suggested, I will let you know at least whether the code changes affect the current behavior.

@billsacks billsacks left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest set of changes looks good - thank you @sjsprecious for making those changes.

A piece I am less confident about is whether the assignment of aoflux_mesh to lmesh (as opposed to doing an ESMF_MeshCreate there) is always safe. I am asking the ESMF group about that, along with the similar change you made in CDEPS.

Regarding testing, a good test would be SMS_D_Ld1.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-aoflux_ogrid, since this covers the change to med_phases_aofluxes_mod in addition to your other changes. This is in the prebeta test list. It hasn't been run on recent versions of the code, so you'll need to generate baselines and then run with your latest changes with comparisons against baselines.

It would also be good to run aux_cime_baselines, or at least keep a careful eye on aux_cime_baselines when they are run in their nightly run following the merge of this PR. As with the associated CDEPS PR, @fischer-ncar could provide some input here on testing.

Regarding @jtruesdal 's point: I wasn't thinking of something as rigorous as testing behavior with online vs. offline generation of mappings - I'm mainly wanting to confirm that the changes here don't break anything or change answers for the case where we're still using online mapping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants