Add broadcast_sigterm_every_n_steps to reduce SIGTERM broadcast overhead#21640
Add broadcast_sigterm_every_n_steps to reduce SIGTERM broadcast overhead#21640c-pozzi wants to merge 7 commits intoLightning-AI:masterfrom
broadcast_sigterm_every_n_steps to reduce SIGTERM broadcast overhead#21640Conversation
Allow users to control how often SIGTERM status is broadcast across ranks, reducing CPU-GPU sync overhead for fast training loops while preserving the default every-step behavior. Closes Lightning-AI#21487
for more information, see https://pre-commit.ci
When broadcast_sigterm_every_n_steps > 1, SIGTERM could arrive between broadcasts near the end of an epoch. Without a forced check, rank 0 would exit while other ranks hang waiting at the next collective (e.g. validation barrier). This adds a forced broadcast whenever the epoch ends or validation is about to start.
for more information, see https://pre-commit.ci
The test was setting _devices_flag=2 to simulate multi-GPU, but the code checks trainer.world_size (from strategy.world_size) which remained 1 in the test Trainer. Mock the property directly instead.
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## master #21640 +/- ##
=========================================
- Coverage 87% 79% -8%
=========================================
Files 270 267 -3
Lines 23934 23885 -49
=========================================
- Hits 20713 18809 -1904
- Misses 3221 5076 +1855 |
- Fix epoch boundary test: mock world_size property instead of _devices_flag, which didn't affect trainer.world_size - Rewrite interval test to call real advance() with mocked distributed instead of reimplementing the logic in the test - Add ddp_spawn integration test exercising real NCCL broadcasts on 2 GPUs with non-aligned step count to trigger epoch-end flush
for more information, see https://pre-commit.ci
|
This is a great way to handle it! Super thanks! |
|
I have been thinking about this and I wonder if exposing this as a user parameter is the right abstraction. The optimal value is step-time dependent which the user can't easily reason about. Would the maintainers prefer an adaptive internal approach that auto-tunes based on measured step time? Happy to rework if that approach seems valuable. @mojtababahrami @justusschock @deependujha |
|
Ohh sorry I'm not a maintainer. Just a user :) |
|
Isn't this going to cause a problem when The first rank will raise the exception at the end of any batch, but the others will continue and then hang. I think a better approach would be a single check at the end of |
|
Hi @aagaskar, I think you meant That would make it a natural synchronization point for the SIGTERM check, ensuring all ranks observe it together and preventing the desync/hang scenario. Also the current sigterm check seems quite off, as it is at the starting of the training batch, and the corresponding test is quite weird as it intentially adds a sleep time, so other batches have progressed: https://github.com/Lightning-AI/pytorch-lightning/pull/20825/changes#diff-411516b6debb3ec43a284753ff0e5341932f355b4dcc8fc8684ba0c0acf318cd |
Yes, oops, that is what I meant! |
|
Thanks for the feedback @aagaskar and @deependujha. I ran some DDP benchmarks with real forward + backward + gradient allreduce to measure the practical overhead. I tested 4 setups on 4× A10G. For this exercise I set
With With So my takeaway is:
Results —
Results —
Willing to rework this PR around I can also share the benchmark script if useful. |
|
Hi @c-pozzi, thanks for the great work. I’m currently exploring why even we need this communication at all, and trying to fix how it was originally supposed to work (including reverting previously merged pr) |
Summary
Adds a
broadcast_sigterm_every_n_stepsparameter to the Trainer that controls how often SIGTERM status is broadcast across ranks in distributed training. Default is1(every step), preserving current behavior.The NCCL broadcast in
_broadcast_sigterm_tensor(#20825) adds a fixed-cost CPU-GPU sync every step. For fast training loops this overhead is significant relative to step time. Benchmark details and measurements in #21487.Tradeoff
Worst-case SIGTERM detection delay is
(N-1) × step_time. This is safe because SIGTERM grace periods are 30–120s, and users who benefit most (fast loops) have the smallest absolute delay (e.g., N=10, step=0.5ms → 4.5ms).Design
The parameter lives on
Trainerfollowing the existingevery_npattern (log_every_n_steps,check_val_every_n_epoch,reload_dataloaders_every_n_epochs) rather thanStrategy, which holds communication mechanics, not loop frequency policy.Test plan
Closes #21487
📚 Documentation preview 📚: https://pytorch-lightning--21640.org.readthedocs.build/en/21640/