Description of the issue:
A user was trying to run with ladjust_bury_coeff in user_nl_marbl (which is not a very common configuration); he was also trying to get 100+ SYPD out of the gx3v7 grid (which is not a very common requirement), so he was running with 288 ocean tasks. gen_pop_decomp was giving a layout that creating 290 blocks, and reported the model crashing in ecosys_driver.F90:513 at
508 allocate(rmean_vals(size(marbl_instances(1)%glo_avg_rmean_interior_tendency)))
509 lscalar = .false.
510 call ecosys_running_mean_saved_state_get_var_vals('interior_tendency', lscalar, rmean_vals(:))
511 do n = 1, size(rmean_vals)
512 do iblock = 1, size(marbl_instances)
513 marbl_instances(iblock)%glo_avg_rmean_interior_tendency(n)%rmean = rmean_vals(n)
514 end do
515 end do
516 deallocate(rmean_vals)
it turns out the issue is that marbl_instances is size max_blocks_clinic (2, in his configuration) and we only want these loops running through nblocks_clinic (1 on most tasks), so ladjust_bury_coeff currently can't be true if any block has nblocks_clinic < max_blocks_clinic. Fixing that moved the error to ecosys_driver:640:
637 if ((size(glo_avg_fields_interior, dim=4) /= 0) .or. (size(glo_avg_fields_surface, dim=4) /= 0)) then
638 allocate(glo_avg_area_masked(nx_block, ny_block, nblocks_clinic))
639 where (land_mask(:,:,:))
640 glo_avg_area_masked(:,:,:) = TAREA(:,:,:)
641 else where
642 glo_avg_area_masked(:,:,:) = c0
643 end where
(I think the third dimension of land_mask and TAREA are both max_blocks_clinic while the allocate() statement for glo_avg_area_masked in line 638 shows it uses nblocks_clinic instead.)
As you can tell, I've started working on a fix for this... I think I changed the above block to explicitly use 1:nblocks_clinic for the third dimension of land_mask in 639 and TAREA in 640, but got yet another error elsewhere.
The original user who reported the problem was happy to be given a 252 task layout that keeps max_blocks_clinic=1, so fixing this is not urgent. I'm putting all this detail in the issue ticket because I'm going to set it aside for a few weeks while I focus on more pressing issues, but it would probably be good to eventually come back and fix the bug.
I also think it would be useful to update the test suite to try to explicitly test cases where ladjust_bury_coeff = .true. and either some tasks have more blocks than others, or some tasks have no blocks. I expect both of those tests would fail currently.
Version:
- CESM:
2_3_beta09; I believe the first user was running CESM 2.1.x
- POP2:
cesm_pop_2_1_20220322
Machine/Environment Description:
error was reported on cheyenne and that's also where I reproduced the issue in the latest codebase
Any xml/namelist changes or SourceMods:
Description of the issue:
A user was trying to run with
ladjust_bury_coeffinuser_nl_marbl(which is not a very common configuration); he was also trying to get 100+ SYPD out of thegx3v7grid (which is not a very common requirement), so he was running with 288 ocean tasks.gen_pop_decompwas giving a layout that creating 290 blocks, and reported the model crashing inecosys_driver.F90:513atit turns out the issue is that
marbl_instancesis sizemax_blocks_clinic(2, in his configuration) and we only want these loops running throughnblocks_clinic(1 on most tasks), soladjust_bury_coeffcurrently can't be true if any block hasnblocks_clinic < max_blocks_clinic. Fixing that moved the error toecosys_driver:640:(I think the third dimension of
land_maskandTAREAare bothmax_blocks_clinicwhile theallocate()statement forglo_avg_area_maskedin line 638 shows it usesnblocks_clinicinstead.)As you can tell, I've started working on a fix for this... I think I changed the above block to explicitly use
1:nblocks_clinicfor the third dimension ofland_maskin 639 andTAREAin 640, but got yet another error elsewhere.The original user who reported the problem was happy to be given a 252 task layout that keeps
max_blocks_clinic=1, so fixing this is not urgent. I'm putting all this detail in the issue ticket because I'm going to set it aside for a few weeks while I focus on more pressing issues, but it would probably be good to eventually come back and fix the bug.I also think it would be useful to update the test suite to try to explicitly test cases where
ladjust_bury_coeff = .true.and either some tasks have more blocks than others, or some tasks have no blocks. I expect both of those tests would fail currently.Version:
2_3_beta09; I believe the first user was running CESM 2.1.xcesm_pop_2_1_20220322Machine/Environment Description:
error was reported on cheyenne and that's also where I reproduced the issue in the latest codebase
Any xml/namelist changes or SourceMods: