Skip to content

Add broadcast_sigterm_every_n_steps to reduce SIGTERM broadcast overhead#21640

Open
c-pozzi wants to merge 7 commits intoLightning-AI:masterfrom
c-pozzi:fix/sigterm-broadcast-interval-21487
Open

Add broadcast_sigterm_every_n_steps to reduce SIGTERM broadcast overhead#21640
c-pozzi wants to merge 7 commits intoLightning-AI:masterfrom
c-pozzi:fix/sigterm-broadcast-interval-21487

Conversation

@c-pozzi
Copy link
Copy Markdown
Contributor

@c-pozzi c-pozzi commented Apr 4, 2026

Summary

Adds a broadcast_sigterm_every_n_steps parameter to the Trainer that controls how often SIGTERM status is broadcast across ranks in distributed training. Default is 1 (every step), preserving current behavior.

The NCCL broadcast in _broadcast_sigterm_tensor (#20825) adds a fixed-cost CPU-GPU sync every step. For fast training loops this overhead is significant relative to step time. Benchmark details and measurements in #21487.

Tradeoff

Worst-case SIGTERM detection delay is (N-1) × step_time. This is safe because SIGTERM grace periods are 30–120s, and users who benefit most (fast loops) have the smallest absolute delay (e.g., N=10, step=0.5ms → 4.5ms).

Design

The parameter lives on Trainer following the existing every_n pattern (log_every_n_steps, check_val_every_n_epoch, reload_dataloaders_every_n_epochs) rather than Strategy, which holds communication mechanics, not loop frequency policy.

trainer = Trainer(broadcast_sigterm_every_n_steps=10)

Test plan

  • Validation rejects values < 1
  • Default value is 1
  • Broadcast interval logic correct for N=1, 5, 10

Closes #21487


📚 Documentation preview 📚: https://pytorch-lightning--21640.org.readthedocs.build/en/21640/

Allow users to control how often SIGTERM status is broadcast across
ranks, reducing CPU-GPU sync overhead for fast training loops while
preserving the default every-step behavior.

Closes Lightning-AI#21487
@github-actions github-actions Bot added the pl Generic label for PyTorch Lightning package label Apr 4, 2026
pre-commit-ci Bot and others added 3 commits April 4, 2026 11:14
When broadcast_sigterm_every_n_steps > 1, SIGTERM could arrive between
broadcasts near the end of an epoch. Without a forced check, rank 0
would exit while other ranks hang waiting at the next collective
(e.g. validation barrier). This adds a forced broadcast whenever the
epoch ends or validation is about to start.
@c-pozzi c-pozzi marked this pull request as ready for review April 6, 2026 09:06
@c-pozzi
Copy link
Copy Markdown
Contributor Author

c-pozzi commented Apr 6, 2026

The test was setting _devices_flag=2 to simulate multi-GPU, but the
code checks trainer.world_size (from strategy.world_size) which
remained 1 in the test Trainer. Mock the property directly instead.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79%. Comparing base (bb7820f) to head (4aa8af7).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (bb7820f) and HEAD (4aa8af7). Click for more details.

HEAD has 2053 uploads less than BASE
Flag BASE (bb7820f) HEAD (4aa8af7)
cpu 503 42
python 36 3
lightning_fabric 162 0
pytest 252 0
python3.13 108 9
lightning 179 15
python3.11 71 6
python3.12 144 12
python3.10 36 3
python3.12.7 108 9
pytorch2.1 36 6
pytest-full 251 42
pytorch_lightning 162 27
pytorch2.7 18 3
pytorch2.8 36 6
pytorch2.10 36 6
pytorch2.3 17 3
pytorch2.2.2 18 3
pytorch2.9 36 6
pytorch2.5.1 18 3
pytorch2.4.1 18 3
pytorch2.6 18 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21640     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         270      267      -3     
  Lines       23934    23885     -49     
=========================================
- Hits        20713    18809   -1904     
- Misses       3221     5076   +1855     

c-pozzi and others added 2 commits April 6, 2026 13:26
- Fix epoch boundary test: mock world_size property instead of
  _devices_flag, which didn't affect trainer.world_size
- Rewrite interval test to call real advance() with mocked distributed
  instead of reimplementing the logic in the test
- Add ddp_spawn integration test exercising real NCCL broadcasts on
  2 GPUs with non-aligned step count to trigger epoch-end flush
@mojtababahrami
Copy link
Copy Markdown

This is a great way to handle it! Super thanks!

@c-pozzi
Copy link
Copy Markdown
Contributor Author

c-pozzi commented Apr 13, 2026

I have been thinking about this and I wonder if exposing this as a user parameter is the right abstraction. The optimal value is step-time dependent which the user can't easily reason about. Would the maintainers prefer an adaptive internal approach that auto-tunes based on measured step time? Happy to rework if that approach seems valuable. @mojtababahrami @justusschock @deependujha

@mojtababahrami
Copy link
Copy Markdown

Ohh sorry I'm not a maintainer. Just a user :)

@aagaskar
Copy link
Copy Markdown

Isn't this going to cause a problem when on_advance_end hits

if not self._is_training_done and self.trainer.received_sigterm:
            raise SIGTERMException

The first rank will raise the exception at the end of any batch, but the others will continue and then hang.

I think a better approach would be a single check at the end of on_advance_end, where the existing one is. And across all ranks, the check should only run if self._should_accumulate() is True. That should minimize overhead since whenever gradients are accumulated there is already synchronization.

@deependujha
Copy link
Copy Markdown
Collaborator

deependujha commented Apr 16, 2026

Hi @aagaskar, I think you meant when self._should_accumulate() is False, since that’s when gradient synchronization happens across ranks.

That would make it a natural synchronization point for the SIGTERM check, ensuring all ranks observe it together and preventing the desync/hang scenario.
It also avoids introducing another argument to the trainer if this can be handled reliably at existing sync points.


Also the current sigterm check seems quite off, as it is at the starting of the training batch, and the corresponding test is quite weird as it intentially adds a sleep time, so other batches have progressed: https://github.com/Lightning-AI/pytorch-lightning/pull/20825/changes#diff-411516b6debb3ec43a284753ff0e5341932f355b4dcc8fc8684ba0c0acf318cd

@aagaskar
Copy link
Copy Markdown

Hi @aagaskar, I think you meant when self._should_accumulate() is False, since that’s when gradient synchronization happens across ranks.

Yes, oops, that is what I meant!

@c-pozzi
Copy link
Copy Markdown
Contributor Author

c-pozzi commented Apr 17, 2026

Thanks for the feedback @aagaskar and @deependujha.

I ran some DDP benchmarks with real forward + backward + gradient allreduce to measure the practical overhead. I tested 4 setups on 4× A10G. For this exercise I set every_n=10.

  • baseline: no broadcast
  • every_step: current Lightning behavior
  • every_n: broadcast every 10 steps
  • on_sync_step: broadcast only when _should_accumulate() is False

With accumulate_grad_batches=1, on_sync_step is basically the same as every_step, so it does not remove the overhead in the default case. For medium workloads the overhead is ~1.5–2%, but for very fast sub-ms loops it gets much larger.

With accumulate_grad_batches > 1, on_sync_step works really well and brings the overhead close to zero, since the check only happens on the natural gradient sync step.

So my takeaway is:

  • the sync-point approach seems like the better design from a correctness/API perspective,
  • but it does not seem to fully address the fast-loop overhead when accumulate_grad_batches=1,
  • while it does help a lot when gradients are accumulated.

Results — accumulate_grad_batches=1 (default)

Workload Step time every_step on_sync_step every_n
small_linear (512) 0.87 ms +14.4% +14.2% -1.1%
medium_linear (2048) 8.83 ms +1.5% +1.7% +0.1%
small_mlp (1024) 6.98 ms +1.9% +1.8% -0.0%

Results — accumulate_grad_batches=4

Workload Step time every_step on_sync_step every_n
small_linear (512) 0.40 ms +37.4% -4.2% -2.9%
medium_linear (2048) 0.42 ms +32.3% -0.6% +1.4%
small_mlp (1024) 0.77 ms +17.0% -0.2% +0.8%

Willing to rework this PR around on_advance_end, triggering the SIGTERM broadcast/check only when _should_accumulate() is False, and drop the new Trainer argument if that is the preferred direction.

I can also share the benchmark script if useful.

@deependujha
Copy link
Copy Markdown
Collaborator

Hi @c-pozzi, thanks for the great work. I’m currently exploring why even we need this communication at all, and trying to fix how it was originally supposed to work (including reverting previously merged pr)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

forced CPU-GPU optimization at the end of every training step resulting in underutilization of GPU

4 participants