Skip to content

Conversation

@bedroge
Copy link
Contributor

@bedroge bedroge commented Nov 7, 2025

See EESSI/software-layer#1288: test step is failing for all x86_64 targets. I could sort of reproduce it interactively, but with a different error, so I'm going to debug it a bit more here.


edit: okay, just to make it easier to find the conclusions, I'm putting the conclusions here:

The ReFrame logs of the failed tests done by the bot contained PSM3 timeout messages like:

libfabric:398:1762504388::psm3:av:psmx3_epid_to_epaddr():234<warn> x86-64-generic-node1.int.aws-rocky88-202310.eessi.io:rank3: psm3_ep_connect returned error Operation timed out, remot
e epid=0x18d00000008:0:ffff0a000051.Try setting FI_PSM3_CONN_TIMEOUT to a larger value (current: 10 seconds). Aborting
[x86-64-generic-node1:00398] *** Process received signal ***
[x86-64-generic-node1:00398] Signal: Aborted (6)
[x86-64-generic-node1:00398] Signal code:  (-6)
libfabric:395:1762504388::psm3:av:psmx3_epid_to_epaddr():234<warn> x86-64-generic-node1.int.aws-rocky88-202310.eessi.io:rank0: psm3_ep_connect returned error Operation timed out, remot
e epid=0x18c00000008:0:ffff0a000051.Try setting FI_PSM3_CONN_TIMEOUT to a larger value (current: 10 seconds). Aborting
[x86-64-generic-node1:00395] *** Process received signal ***
[x86-64-generic-node1:00395] Signal: Aborted (6)
[x86-64-generic-node1:00395] Signal code:  (-6)
[x86-64-generic-node1:00398] [ 0] [x86-64-generic-node1:00400] [ 0] /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/../lib64/libc.so.6(+0x3be10) [0x14e5d6ec0e10]
[x86-64-generic-node1:00400] [ 1] [x86-64-generic-node1:00396] [ 0] /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/../lib64/libc.so.6(+0x3be10) [0x1512ef287e10]
[x86-64-generic-node1:00396] [ 1] [x86-64-generic-node1:00401] [ 0] /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/../lib64/libc.so.6(+0x3be10) [0x14eff332ce10]
[x86-64-generic-node1:00401] [ 1] [x86-64-generic-node1:00395] [ 0] /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/../lib64/libc.so.6(+0x3be10) [0x1484dfa31e10]
[x86-64-generic-node1:00395] [ 1] [x86-64-generic-node1:00402] [ 0] /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/../lib64/libc.so.6(+0x3be10) [0x14c6945f1e10]
[x86-64-generic-node1:00402] [ 1] [x86-64-generic-node1:00397] [ 0] /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/../lib64/libc.so.6(+0x3be10) [0x14a47acd7e10]
[x86-64-generic-node1:00397] [ 1] [x86-64-generic-node1:00399] [ 0] /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/../lib64/libc.so.6(+0x3be10) [0x1552fa090e10]

Initially I could not reproduce that with a manually submitted Slurm job on the AWS cluster, though I did get this:

--------------------------------------------------------------------------
A request was made to bind that would require binding
processes to more cpus than are available in your allocation:

   Application:     osu_bw
   #processes:      2
   Mapping policy:  BYCORE
   Binding policy:  CORE

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

which I could solve by setting export PRTE_MCA_rmaps_default_mapping_policy=:oversubscribe.

Then I noticed that test_suite.sh has a workaround that sets https://github.com/EESSI/software-layer-scripts/blob/main/test_suite.sh#L174, see

export PSM3_DEVICES='self,shm' # this is enough, since we only run single node for now
. By adding that same thing to my jobscript, I could reproduce the issue and got the PSM3 timeouts as well. While trying to find a solution, I found easybuilders/easybuild-easyconfigs#18925 again, and tried another workaround from that issue. Setting FI_PROVIDER="^psm3" completely disables the PSM3 provider, and that solves the issue.

@bedroge
Copy link
Contributor Author

bedroge commented Nov 7, 2025

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/generic

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Nov 7, 2025

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: generic
Building for: x86_64/generic
Job dir: /project/def-users/SHARED/jobs/2025.11/pr_121/102375

date job status comment
Nov 07 08:21:04 UTC 2025 submitted job id 102375 awaits release by job manager
Nov 07 08:21:19 UTC 2025 released job awaits launch by Slurm scheduler
Nov 07 08:22:21 UTC 2025 running job 102375 is running
Nov 07 08:24:24 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-102375.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2025.06-software-linux-x86_64-generic-17625037150.tar.gzsize: 0 MiB (114796 bytes)
entries: 91
modules under 2025.06/software/linux/x86_64/generic/modules/all
cowsay/3.04.lua
software under 2025.06/software/linux/x86_64/generic/software
cowsay/3.04
reprod directories under 2025.06/software/linux/x86_64/generic/reprod
cowsay/3.04/20251107_082150UTC
other under 2025.06/software/linux/x86_64/generic
no other files in tarball
Nov 07 08:24:24 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 4/4 test case(s) from 4 check(s) (4 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-102375.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Contributor Author

bedroge commented Nov 7, 2025

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/generic

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Nov 7, 2025

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: generic
Building for: x86_64/generic
Job dir: /project/def-users/SHARED/jobs/2025.11/pr_121/102376

date job status comment
Nov 07 08:30:42 UTC 2025 submitted job id 102376 awaits release by job manager
Nov 07 08:31:27 UTC 2025 released job awaits launch by Slurm scheduler
Nov 07 08:32:29 UTC 2025 running job 102376 is running
Nov 07 08:34:32 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-102376.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2025.06-software-linux-x86_64-generic-17625043210.tar.gzsize: 0 MiB (114866 bytes)
entries: 91
modules under 2025.06/software/linux/x86_64/generic/modules/all
cowsay/3.04.lua
software under 2025.06/software/linux/x86_64/generic/software
cowsay/3.04
reprod directories under 2025.06/software/linux/x86_64/generic/reprod
cowsay/3.04/20251107_083156UTC
other under 2025.06/software/linux/x86_64/generic
no other files in tarball
Nov 07 08:34:32 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 4/4 test case(s) from 4 check(s) (4 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-102376.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Contributor Author

bedroge commented Nov 7, 2025

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/generic

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Nov 7, 2025

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: generic
Building for: x86_64/generic
Job dir: /project/def-users/SHARED/jobs/2025.11/pr_121/102380

date job status comment
Nov 07 08:56:23 UTC 2025 submitted job id 102380 awaits release by job manager
Nov 07 08:56:37 UTC 2025 released job awaits launch by Slurm scheduler
Nov 07 08:57:39 UTC 2025 running job 102380 is running
Nov 07 08:58:40 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-102380.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2025.06-software-linux-x86_64-generic-17625058300.tar.gzsize: 0 MiB (114725 bytes)
entries: 91
modules under 2025.06/software/linux/x86_64/generic/modules/all
cowsay/3.04.lua
software under 2025.06/software/linux/x86_64/generic/software
cowsay/3.04
reprod directories under 2025.06/software/linux/x86_64/generic/reprod
cowsay/3.04/20251107_085705UTC
other under 2025.06/software/linux/x86_64/generic
no other files in tarball
Nov 07 08:58:40 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86_64_intel_haswell+default
P: latency: 2.28 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86_64_intel_haswell+default
P: latency: 2.2 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86_64_intel_haswell+default
P: latency: 0.64 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86_64_intel_haswell+default
P: bandwidth: 10457.29 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-102380.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Contributor Author

bedroge commented Nov 10, 2025

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws for:arch=x86_64/generic

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Nov 10, 2025

New job on instance eessi-bot-mc-aws for repository eessi.io-2023.06-software
Building on: generic
Building for: x86_64/generic
Job dir: /project/def-users/SHARED/jobs/2025.11/pr_121/103217

date job status comment
Nov 10 12:46:47 UTC 2025 submitted job id 103217 awaits release by job manager
Nov 10 12:47:30 UTC 2025 released job awaits launch by Slurm scheduler
Nov 10 12:52:40 UTC 2025 running job 103217 is running
Nov 10 12:56:48 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-103217.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-generic-17627791960.tar.gzsize: 0 MiB (70825 bytes)
entries: 78
modules under 2023.06/software/linux/x86_64/generic/modules/all
cowsay/3.04.lua
software under 2023.06/software/linux/x86_64/generic/software
cowsay/3.04
reprod directories under 2023.06/software/linux/x86_64/generic/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/generic
no other files in tarball
Nov 10 12:56:48 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] ( 1/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:x86_64_generic+default
P: perf: 409.317 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 2/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:x86_64_generic+default
P: perf: 441.411 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 3/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /775175bf @BotBuildTests:x86_64_generic+default
P: latency: 3.08 us (r:0, l:None, u:None)
[ OK ] ( 4/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /52707c40 @BotBuildTests:x86_64_generic+default
P: latency: 3.04 us (r:0, l:None, u:None)
[ OK ] ( 5/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /b1aacda9 @BotBuildTests:x86_64_generic+default
P: latency: 45.39 us (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /c6bad193 @BotBuildTests:x86_64_generic+default
P: latency: 5.73 us (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:x86_64_generic+default
P: latency: 0.66 us (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:x86_64_generic+default
P: latency: 0.69 us (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:x86_64_generic+default
P: bandwidth: 10794.19 MB/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:x86_64_generic+default
P: bandwidth: 10845.37 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-103217.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge bedroge changed the title debug test failures for 2025a New workaround for PSM3 issues causing test failures in 2025.06 Nov 10, 2025
Copy link
Contributor

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Also clear that it shouldn't reintroduce our prior issue. Approved!

@casparvl casparvl merged commit 6303643 into EESSI:main Nov 10, 2025
64 checks passed
@bedroge bedroge deleted the debug_test_failures branch November 10, 2025 14:13
@boegel boegel added the 2025.06-software.eessi.io 2025.06 version of software.eessi.io label Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2025.06-software.eessi.io 2025.06 version of software.eessi.io

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants