-
Notifications
You must be signed in to change notification settings - Fork 11
Description
I have been seeing this specifically on GPU tests. See the logs in the link https://buildkite.com/julialang/luxlib-dot-jl/builds/797#0190cc64-0b5a-4e2a-9e47-795d8fa7176e/309-616
The Batch Normalization, Group Normalization, and Instance Normalization tests are "DONE" but those workers never terminate, which eventually leads to the job timing out. This problem doesn't show up when the same tests are run on Github Actions (exclusively CPU tests).
If I set the number of workers to not run in parallel then tests finish as expected. I have ReTestItems setup to run GPU testing on other repos (and they work perfectly), so I am not sure what is causing this issue.
P.S. This repo is amazing, it has cut down on our CI timings a great deal (and makes local testing so much easier)!