Add `broadcast_sigterm_every_n_steps` to reduce SIGTERM broadcast overhead by c-pozzi · Pull Request #21640 · Lightning-AI/pytorch-lightning

c-pozzi · 2026-04-04T11:13:43Z

Summary

Adds a broadcast_sigterm_every_n_steps parameter to the Trainer that controls how often SIGTERM status is broadcast across ranks in distributed training. Default is 1 (every step), preserving current behavior.

The NCCL broadcast in _broadcast_sigterm_tensor (#20825) adds a fixed-cost CPU-GPU sync every step. For fast training loops this overhead is significant relative to step time. Benchmark details and measurements in #21487.

Tradeoff

Worst-case SIGTERM detection delay is (N-1) × step_time. This is safe because SIGTERM grace periods are 30–120s, and users who benefit most (fast loops) have the smallest absolute delay (e.g., N=10, step=0.5ms → 4.5ms).

Design

The parameter lives on Trainer following the existing every_n pattern (log_every_n_steps, check_val_every_n_epoch, reload_dataloaders_every_n_epochs) rather than Strategy, which holds communication mechanics, not loop frequency policy.

trainer = Trainer(broadcast_sigterm_every_n_steps=10)

Test plan

Validation rejects values < 1
Default value is 1
Broadcast interval logic correct for N=1, 5, 10

Closes #21487

📚 Documentation preview 📚: https://pytorch-lightning--21640.org.readthedocs.build/en/21640/

Allow users to control how often SIGTERM status is broadcast across ranks, reducing CPU-GPU sync overhead for fast training loops while preserving the default every-step behavior. Closes Lightning-AI#21487

for more information, see https://pre-commit.ci

When broadcast_sigterm_every_n_steps > 1, SIGTERM could arrive between broadcasts near the end of an epoch. Without a forced check, rank 0 would exit while other ranks hang waiting at the next collective (e.g. validation barrier). This adds a forced broadcast whenever the epoch ends or validation is about to start.

for more information, see https://pre-commit.ci

c-pozzi · 2026-04-06T09:10:28Z

@justusschock @mojtababahrami

The test was setting _devices_flag=2 to simulate multi-GPU, but the code checks trainer.world_size (from strategy.world_size) which remained 1 in the test Trainer. Mock the property directly instead.

codecov · 2026-04-06T12:52:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79%. Comparing base (bb7820f) to head (4aa8af7).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (bb7820f) and HEAD (4aa8af7). Click for more details.

HEAD has 2053 uploads less than BASE

Flag BASE (bb7820f) HEAD (4aa8af7)

cpu 503 42

python 36 3

lightning_fabric 162 0

pytest 252 0

python3.13 108 9

lightning 179 15

python3.11 71 6

python3.12 144 12

python3.10 36 3

python3.12.7 108 9

pytorch2.1 36 6

pytest-full 251 42

pytorch_lightning 162 27

pytorch2.7 18 3

pytorch2.8 36 6

pytorch2.10 36 6

pytorch2.3 17 3

pytorch2.2.2 18 3

pytorch2.9 36 6

pytorch2.5.1 18 3

pytorch2.4.1 18 3

pytorch2.6 18 3

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #21640     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         270      267      -3     
  Lines       23934    23885     -49     
=========================================
- Hits        20713    18809   -1904     
- Misses       3221     5076   +1855

- Fix epoch boundary test: mock world_size property instead of _devices_flag, which didn't affect trainer.world_size - Rewrite interval test to call real advance() with mocked distributed instead of reimplementing the logic in the test - Add ddp_spawn integration test exercising real NCCL broadcasts on 2 GPUs with non-aligned step count to trigger epoch-end flush

for more information, see https://pre-commit.ci

mojtababahrami · 2026-04-08T10:17:26Z

This is a great way to handle it! Super thanks!

c-pozzi · 2026-04-13T07:52:25Z

I have been thinking about this and I wonder if exposing this as a user parameter is the right abstraction. The optimal value is step-time dependent which the user can't easily reason about. Would the maintainers prefer an adaptive internal approach that auto-tunes based on measured step time? Happy to rework if that approach seems valuable. @mojtababahrami @justusschock @deependujha

mojtababahrami · 2026-04-13T08:48:11Z

Ohh sorry I'm not a maintainer. Just a user :)

aagaskar · 2026-04-15T14:59:48Z

Isn't this going to cause a problem when on_advance_end hits

if not self._is_training_done and self.trainer.received_sigterm:
            raise SIGTERMException

The first rank will raise the exception at the end of any batch, but the others will continue and then hang.

I think a better approach would be a single check at the end of on_advance_end, where the existing one is. And across all ranks, the check should only run if self._should_accumulate() is True. That should minimize overhead since whenever gradients are accumulated there is already synchronization.

deependujha · 2026-04-16T15:04:57Z

Hi @aagaskar, I think you meant when self._should_accumulate() is False, since that’s when gradient synchronization happens across ranks.

That would make it a natural synchronization point for the SIGTERM check, ensuring all ranks observe it together and preventing the desync/hang scenario.
It also avoids introducing another argument to the trainer if this can be handled reliably at existing sync points.

Also the current sigterm check seems quite off, as it is at the starting of the training batch, and the corresponding test is quite weird as it intentially adds a sleep time, so other batches have progressed: https://github.com/Lightning-AI/pytorch-lightning/pull/20825/changes#diff-411516b6debb3ec43a284753ff0e5341932f355b4dcc8fc8684ba0c0acf318cd

aagaskar · 2026-04-17T12:48:54Z

Hi @aagaskar, I think you meant when self._should_accumulate() is False, since that’s when gradient synchronization happens across ranks.

Yes, oops, that is what I meant!

c-pozzi · 2026-04-17T15:51:18Z

Thanks for the feedback @aagaskar and @deependujha.

I ran some DDP benchmarks with real forward + backward + gradient allreduce to measure the practical overhead. I tested 4 setups on 4× A10G. For this exercise I set every_n=10.

baseline: no broadcast
every_step: current Lightning behavior
every_n: broadcast every 10 steps
on_sync_step: broadcast only when _should_accumulate() is False

With accumulate_grad_batches=1, on_sync_step is basically the same as every_step, so it does not remove the overhead in the default case. For medium workloads the overhead is ~1.5–2%, but for very fast sub-ms loops it gets much larger.

With accumulate_grad_batches > 1, on_sync_step works really well and brings the overhead close to zero, since the check only happens on the natural gradient sync step.

So my takeaway is:

the sync-point approach seems like the better design from a correctness/API perspective,
but it does not seem to fully address the fast-loop overhead when accumulate_grad_batches=1,
while it does help a lot when gradients are accumulated.

Results — accumulate_grad_batches=1 (default)

Workload	Step time	every_step	on_sync_step	every_n
small_linear (512)	0.87 ms	+14.4%	+14.2%	-1.1%
medium_linear (2048)	8.83 ms	+1.5%	+1.7%	+0.1%
small_mlp (1024)	6.98 ms	+1.9%	+1.8%	-0.0%

Results — accumulate_grad_batches=4

Workload	Step time	every_step	on_sync_step	every_n
small_linear (512)	0.40 ms	+37.4%	-4.2%	-2.9%
medium_linear (2048)	0.42 ms	+32.3%	-0.6%	+1.4%
small_mlp (1024)	0.77 ms	+17.0%	-0.2%	+0.8%

Willing to rework this PR around on_advance_end, triggering the SIGTERM broadcast/check only when _should_accumulate() is False, and drop the new Trainer argument if that is the preferred direction.

I can also share the benchmark script if useful.

deependujha · 2026-04-17T17:31:24Z

Hi @c-pozzi, thanks for the great work. I’m currently exploring why even we need this communication at all, and trying to fix how it was originally supposed to work (including reverting previously merged pr)

Add broadcast_sigterm_every_n_steps Trainer parameter

491e4b6

Allow users to control how often SIGTERM status is broadcast across ranks, reducing CPU-GPU sync overhead for fast training loops while preserving the default every-step behavior. Closes Lightning-AI#21487

github-actions Bot added the pl Generic label for PyTorch Lightning package label Apr 4, 2026

pre-commit-ci Bot and others added 3 commits April 4, 2026 11:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

b2ea3e2

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

4af18f5

for more information, see https://pre-commit.ci

c-pozzi marked this pull request as ready for review April 6, 2026 09:06

c-pozzi requested review from ethanwharris, justusschock and tchaton as code owners April 6, 2026 09:06

Fix epoch boundary test to mock world_size property

e2d7496

The test was setting _devices_flag=2 to simulate multi-GPU, but the code checks trainer.world_size (from strategy.world_size) which remained 1 in the test Trainer. Mock the property directly instead.

c-pozzi and others added 2 commits April 6, 2026 13:26

[pre-commit.ci] auto fixes from pre-commit.com hooks

4aa8af7

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `broadcast_sigterm_every_n_steps` to reduce SIGTERM broadcast overhead#21640

Add `broadcast_sigterm_every_n_steps` to reduce SIGTERM broadcast overhead#21640
c-pozzi wants to merge 7 commits intoLightning-AI:masterfrom
c-pozzi:fix/sigterm-broadcast-interval-21487

c-pozzi commented Apr 4, 2026 •

edited by github-actions Bot

Loading

Uh oh!

c-pozzi commented Apr 6, 2026

Uh oh!

codecov Bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

mojtababahrami commented Apr 8, 2026

Uh oh!

c-pozzi commented Apr 13, 2026

Uh oh!

mojtababahrami commented Apr 13, 2026

Uh oh!

aagaskar commented Apr 15, 2026

Uh oh!

deependujha commented Apr 16, 2026 •

edited

Loading

Uh oh!

aagaskar commented Apr 17, 2026

Uh oh!

c-pozzi commented Apr 17, 2026

Uh oh!

deependujha commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

c-pozzi commented Apr 4, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tradeoff

Design

Test plan

Uh oh!

c-pozzi commented Apr 6, 2026

Uh oh!

codecov Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mojtababahrami commented Apr 8, 2026

Uh oh!

c-pozzi commented Apr 13, 2026

Uh oh!

mojtababahrami commented Apr 13, 2026

Uh oh!

aagaskar commented Apr 15, 2026

Uh oh!

deependujha commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aagaskar commented Apr 17, 2026

Uh oh!

c-pozzi commented Apr 17, 2026

Uh oh!

deependujha commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

c-pozzi commented Apr 4, 2026 •

edited by github-actions Bot

Loading

codecov Bot commented Apr 6, 2026 •

edited

Loading

deependujha commented Apr 16, 2026 •

edited

Loading