Dart lowres cmeps#657
Conversation
Changes are needed in multiple components: cesm, cime, cmeps, mom/MOM6.
The branches are labeled with DART_lowres_{component}.
cesm/driver/ensemble_driver.F90
Remove PETcount versus NINST test to let middle-sized tests work.
if(modulo(PetCount-pio_asyncio_ntasks*number_of_members, &
number_of_members) .ne. 0) then
|
@kdraeder thank you for your continued work on this. I had to get my head back into this. I've read back through some of the discussion from a couple of months ago. I have a couple of questions / requests: (1) @kdraeder - My impression is that we never determined whether this error check was actually needed or not; is that your impression, too? (2) @kdraeder - It sounds like only select tests pass with this error check removed - that you needed to make some careful changes to PE layouts for the tests to pass. Can you confirm: for these passing tests with carefully selected PE layouts, is the removal of this error check still actually needed? (I couldn't tell this for sure from reading through your comments.) (3) @kdraeder - If we go ahead with this, please go ahead and actually remove these lines of code rather than commenting them out. (4) @mvertens @DeniseWorthen @theurich @uturuncoglu @danrosen25 @oehmke - can any of you see a reason why the error check removed in this PR might actually be needed? My understanding of the error check is that it is effectively ensuring that we only use full nodes, but I'm not sure why that would be a requirement... though I may also be misinterpreting the intent of the error check. |
This issue stems from cime #4933, which is about developing a large ensemble test
motivated by DART applications.
Because of the large ensemble, the testing will be more managable
if it uses a coarse resolution grid. An ne3 grid is available for CAM and CTSM,
and now a ~10 degree resolution is available in MOM6 (MOM_interface #311).
These have been combined into a new CESM grid and used in ERI and MCC tests,
which also use a new testmod tailored to DART needs.
I'm open to suggestions for a shorter testmod name,
but @billsacks and I feel that it will be helpful to have DART in it.
This grid (especially the MOM6 grid) limits the tasks/instance to 12
(6 for MOM, 6 for the other components).
An MCC test for a small ensemble passes all test stages
(/glade/work/raeder/Exp/CESM+DART_testing/MCC_cG.ne3pg3_10deg.B_DART.lowres)
but ensembles which require more than 1 node mostly fail
with an error in cmeps/cesm/driver/ensemble_driver.F90.
This seems to arise from smaller ensembles fitting into a single (develop qeueu) node,
where the exact number of processors needed is assigned to them,
while larger ensembles need multiple (cpu/main) nodes
and more processors are assigned to the job than are requested.
For example, 40 instances request 12 x 40 = 480 processors.
This requires 4 nodes x 128 = 512 processors are assigned.
This difference causes an error:
PetCount ( 512) - Async IOtasks ( 0) must be evenly divisable by number of members ( 40).
When the check for this error is removed, the job goes farther,
but hangs just before the time stepping in CAM. This can be prevented by choosing MAX_TASKS_PER_NODE in a way that prevents any instance from being laid out across 2 nodes.
The changes required to do this are beyond the scope of this PR,
and are handled in CESM #398.
Description of changes
Commenting out the consistency check between PetCount and number_of_members,
if(modulo(PetCount-pio_asyncio_ntasks*number_of_members, number_of_members) .ne. 0) thenallows the test to proceed.
I could not trace the variables back through ESMF to figure out an if-test
which would handle this situation, and developers I talked to weren't certain that it's essential,
so my temporary solution is to comment out the test, without removing it.
Specific notes
Contributors other than yourself, if any:
@billsacks @jedwards4b
CMEPS Issues Fixed (include github issue #): #461
This is also essential for issues in other components:
ESMCI/cime #4933 (overview issue)
CESM PR #398
ESMCI/ccs_config PR #285
NCAR/MOM6 #413
Are changes expected to change answers? (specify if bfb, different at roundoff, more substantial)
This is not expected to change answers in tests which ran successfully before this change.
Some tests which would not run before will now run. It's possible that some of those should not run,
but I have not looked into those.
Any User Interface Changes (namelist or namelist defaults changes)?
Users who want to run ERI or MCC tests with an ensemble which can fit some,
but not all, instances on 1 node, will need to include the test_mods developed in CESM #398
and follow the instructions for setting MAX_TASKS_PER_NODE.
Testing performed
Please describe the tests along with the target model and machine(s)
If possible, please also added hashes that were used in the testing.
Extensive testing (development) of ERI and MCC tests were conducted in a version of cesm3_0_alpha08d,
modified to enable the 10-degree MOM6 grid, using a BHIST compset, on derecho.
The relevant changes (multiple components) were imported to the cesm3_0_alpha09a tag
and tested in cases in /glade/work/raeder/Exp/CESM+DART_testing: