Version
main
Are there any linked Issues or Pull Requests?
https://github.com/MetOffice/jjdocs/issues/563
What happened?
While testing the JADA configuration for the benchmark, I encountered an issue with some of the domain decomposition sizes, namely: 336, 672 and 2352. The issue is encountered when I use the auto partitioner and I managed to get them to run when using a custom partitioner. The JADA application has a C896 pseudo model that reads a previously run forecast and a C224 linear-adjoint model that includes 3 levels of multigrid (C112, C56, C28).
The JADA application was run with the following resource:
RANKS: Nodes/PPN/OPM_NUM_THREAD
336: 21/16/8 (JADA) + 128: 4/32/1 (XIOS).
672: 42/16/4 (JADA) + 128: 4/32/1 (XIOS).
2352: 84/28/2 (JADA) + 128: 4/32/1 (XIOS).
The failure appears to be a segfault. I message Tom Melvin about this and he made some suggestion about the origin of this error that is documented in: https://github.com/MetOffice/jjdocs/issues/563#issuecomment-4169103833.
Its worth noting that in the 336 case, I had originally encountered an error trap when run back in October:
20251115235223.923+0000:P001:ERROR: Total ranks per panel 112 must be the product of xprocs 4 and yprocs 14.
This particular error seemed to go away after the LFRic 3.0 release, I believe that it is related to this change: https://code.metoffice.gov.uk/trac/lfric_apps/ticket/976. I had not tested with 672 or 2352 at that time.
Working custom partitions
336:
&partitioning
panel_decomposition='custom'
panel_xproc=4
panel_yproc=14
672:
&partitioning
panel_decomposition='custom',
panel_xproc=4,
panel_yproc=28,
2352:
&partitioning
panel_decomposition='custom',
panel_xproc=14,
panel_yproc=28,
Relevant log output
nidc1287: rank 523 died from signal 11
nidc1100: rank 131 died from signal 15
nidc1302: rank 541 died from signal 9
Running on nid 1053
Running on xname x1020c1s5b0n1
[FAIL] jada "$ROSE_DATAC" "$MODEL_TYPE" "$OBS_TASKS_ACTIVE" "$OBS_OUTPUT" # return-code=143
2026-03-30T23:33:27Z CRITICAL - failed/EXIT
Version
main
Are there any linked Issues or Pull Requests?
https://github.com/MetOffice/jjdocs/issues/563
What happened?
While testing the JADA configuration for the benchmark, I encountered an issue with some of the domain decomposition sizes, namely: 336, 672 and 2352. The issue is encountered when I use the
auto partitionerand I managed to get them to run when using acustom partitioner. The JADA application has aC896 pseudo modelthat reads a previously run forecast and aC224 linear-adjointmodel that includes 3 levels of multigrid (C112,C56,C28).The JADA application was run with the following resource:
RANKS:Nodes/PPN/OPM_NUM_THREAD336:21/16/8(JADA) +128:4/32/1(XIOS).672:42/16/4(JADA) +128:4/32/1(XIOS).2352:84/28/2(JADA) +128:4/32/1(XIOS).The failure appears to be a segfault. I message Tom Melvin about this and he made some suggestion about the origin of this error that is documented in: https://github.com/MetOffice/jjdocs/issues/563#issuecomment-4169103833.
Its worth noting that in the 336 case, I had originally encountered an error trap when run back in October:
20251115235223.923+0000:P001:ERROR: Total ranks per panel 112 must be the product of xprocs 4 and yprocs 14.This particular error seemed to go away after the
LFRic 3.0 release, I believe that it is related to this change: https://code.metoffice.gov.uk/trac/lfric_apps/ticket/976. I had not tested with672or2352at that time.Working custom partitions
336:
672:
2352:
Relevant log output