Dart lowres cmeps by kdraeder · Pull Request #657 · ESCOMP/CMEPS

kdraeder · 2026-05-26T22:12:09Z

This issue stems from cime #4933, which is about developing a large ensemble test
motivated by DART applications.

Because of the large ensemble, the testing will be more managable
if it uses a coarse resolution grid. An ne3 grid is available for CAM and CTSM,
and now a ~10 degree resolution is available in MOM6 (MOM_interface #311).
These have been combined into a new CESM grid and used in ERI and MCC tests,
which also use a new testmod tailored to DART needs.
I'm open to suggestions for a shorter testmod name,
but @billsacks and I feel that it will be helpful to have DART in it.
This grid (especially the MOM6 grid) limits the tasks/instance to 12
(6 for MOM, 6 for the other components).

An MCC test for a small ensemble passes all test stages
(/glade/work/raeder/Exp/CESM+DART_testing/MCC_cG.ne3pg3_10deg.B_DART.lowres)
but ensembles which require more than 1 node mostly fail
with an error in cmeps/cesm/driver/ensemble_driver.F90.
This seems to arise from smaller ensembles fitting into a single (develop qeueu) node,
where the exact number of processors needed is assigned to them,
while larger ensembles need multiple (cpu/main) nodes
and more processors are assigned to the job than are requested.
For example, 40 instances request 12 x 40 = 480 processors.
This requires 4 nodes x 128 = 512 processors are assigned.
This difference causes an error:
PetCount ( 512) - Async IOtasks ( 0) must be evenly divisable by number of members ( 40).

When the check for this error is removed, the job goes farther,
but hangs just before the time stepping in CAM. This can be prevented by choosing MAX_TASKS_PER_NODE in a way that prevents any instance from being laid out across 2 nodes.
The changes required to do this are beyond the scope of this PR,
and are handled in CESM #398.

Description of changes

Commenting out the consistency check between PetCount and number_of_members,
if(modulo(PetCount-pio_asyncio_ntasks*number_of_members, number_of_members) .ne. 0) then
allows the test to proceed.
I could not trace the variables back through ESMF to figure out an if-test
which would handle this situation, and developers I talked to weren't certain that it's essential,
so my temporary solution is to comment out the test, without removing it.

Specific notes

Contributors other than yourself, if any:
@billsacks @jedwards4b

CMEPS Issues Fixed (include github issue #): #461
This is also essential for issues in other components:
ESMCI/cime #4933 (overview issue)
CESM PR #398
ESMCI/ccs_config PR #285
NCAR/MOM6 #413

Are changes expected to change answers? (specify if bfb, different at roundoff, more substantial)
This is not expected to change answers in tests which ran successfully before this change.
Some tests which would not run before will now run. It's possible that some of those should not run,
but I have not looked into those.

Any User Interface Changes (namelist or namelist defaults changes)?
Users who want to run ERI or MCC tests with an ensemble which can fit some,
but not all, instances on 1 node, will need to include the test_mods developed in CESM #398
and follow the instructions for setting MAX_TASKS_PER_NODE.

Testing performed

Please describe the tests along with the target model and machine(s)
If possible, please also added hashes that were used in the testing.
Extensive testing (development) of ERI and MCC tests were conducted in a version of cesm3_0_alpha08d,
modified to enable the 10-degree MOM6 grid, using a BHIST compset, on derecho.
The relevant changes (multiple components) were imported to the cesm3_0_alpha09a tag
and tested in cases in /glade/work/raeder/Exp/CESM+DART_testing:

40 instance MCC; alpha09a_MCC_cG_C40.ne3pg3_10deg.B_DART.MAX_TASKS120
2 instance ERI; alpha09a_ERI_cG_C2_Ld8.ne3pg3_10deg.B_DART.aux_lowres

Changes are needed in multiple components: cesm, cime, cmeps, mom/MOM6. The branches are labeled with DART_lowres_{component}. cesm/driver/ensemble_driver.F90 Remove PETcount versus NINST test to let middle-sized tests work. if(modulo(PetCount-pio_asyncio_ntasks*number_of_members, & number_of_members) .ne. 0) then

billsacks · 2026-06-01T21:09:16Z

@kdraeder thank you for your continued work on this.

I had to get my head back into this. I've read back through some of the discussion from a couple of months ago. I have a couple of questions / requests:

(1) @kdraeder - My impression is that we never determined whether this error check was actually needed or not; is that your impression, too?

(2) @kdraeder - It sounds like only select tests pass with this error check removed - that you needed to make some careful changes to PE layouts for the tests to pass. Can you confirm: for these passing tests with carefully selected PE layouts, is the removal of this error check still actually needed? (I couldn't tell this for sure from reading through your comments.)

(3) @kdraeder - If we go ahead with this, please go ahead and actually remove these lines of code rather than commenting them out.

(4) @mvertens @DeniseWorthen @theurich @uturuncoglu @danrosen25 @oehmke - can any of you see a reason why the error check removed in this PR might actually be needed? My understanding of the error check is that it is effectively ensuring that we only use full nodes, but I'm not sure why that would be a requirement... though I may also be misinterpreting the intent of the error check.

kdraeder and others added 3 commits March 27, 2026 16:12

Merge branch 'ESCOMP:main' into DART_lowres_cmeps

e1ef4fd

Removed spaces which snuck into recent commit

4cc3183

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dart lowres cmeps#657

Dart lowres cmeps#657
kdraeder wants to merge 3 commits into
ESCOMP:mainfrom
kdraeder:DART_lowres_cmeps

kdraeder commented May 26, 2026

Uh oh!

billsacks commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kdraeder commented May 26, 2026

Description of changes

Specific notes

Testing performed

Uh oh!

billsacks commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants