Skip to content

Auto partitioner fails for some MPI sizes #355

@ss421

Description

Version

main

Are there any linked Issues or Pull Requests?

https://github.com/MetOffice/jjdocs/issues/563

What happened?

While testing the JADA configuration for the benchmark, I encountered an issue with some of the domain decomposition sizes, namely: 336, 672 and 2352. The issue is encountered when I use the auto partitioner and I managed to get them to run when using a custom partitioner. The JADA application has a C896 pseudo model that reads a previously run forecast and a C224 linear-adjoint model that includes 3 levels of multigrid (C112, C56, C28).

The JADA application was run with the following resource:

RANKS: Nodes/PPN/OPM_NUM_THREAD
336: 21/16/8 (JADA) + 128: 4/32/1 (XIOS).
672: 42/16/4 (JADA) + 128: 4/32/1 (XIOS).
2352: 84/28/2 (JADA) + 128: 4/32/1 (XIOS).

The failure appears to be a segfault. I message Tom Melvin about this and he made some suggestion about the origin of this error that is documented in: https://github.com/MetOffice/jjdocs/issues/563#issuecomment-4169103833.

Its worth noting that in the 336 case, I had originally encountered an error trap when run back in October:

20251115235223.923+0000:P001:ERROR: Total ranks per panel 112 must be the product of xprocs 4 and yprocs 14.

This particular error seemed to go away after the LFRic 3.0 release, I believe that it is related to this change: https://code.metoffice.gov.uk/trac/lfric_apps/ticket/976. I had not tested with 672 or 2352 at that time.

Working custom partitions

336:

&partitioning
panel_decomposition='custom'
panel_xproc=4
panel_yproc=14

672:

&partitioning
panel_decomposition='custom',
panel_xproc=4,
panel_yproc=28,

2352:

&partitioning
panel_decomposition='custom',
panel_xproc=14,
panel_yproc=28,

Relevant log output

nidc1287: rank 523 died from signal 11
nidc1100: rank 131 died from signal 15
nidc1302: rank 541 died from signal 9
Running on nid 1053
Running on xname x1020c1s5b0n1
[FAIL] jada "$ROSE_DATAC" "$MODEL_TYPE" "$OBS_TASKS_ACTIVE" "$OBS_OUTPUT" # return-code=143
2026-03-30T23:33:27Z CRITICAL - failed/EXIT

Metadata

Metadata

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions