Skip to content

fix: prevent replication neighbor sync from blocking shutdown under active traffic#70

Merged
jacderida merged 1 commit intorc-2026.4.1from
fix/replication-shutdown-hang-under-traffic
Apr 14, 2026
Merged

fix: prevent replication neighbor sync from blocking shutdown under active traffic#70
jacderida merged 1 commit intorc-2026.4.1from
fix/replication-shutdown-hang-under-traffic

Conversation

@jacderida
Copy link
Copy Markdown
Collaborator

Summary

  • The replication engine's neighbor sync loop ran run_neighbor_sync_round() outside its tokio::select! block, preventing shutdown cancellation from being noticed during long sync rounds
  • Wrap the sync round in tokio::select! with shutdown.cancelled() so in-progress operations are cancelled immediately
  • Add 10-second timeout to replication engine task joins in shutdown() as defense in depth

Context

Discovered during auto-upgrade testing on a 100-node testnet with active client uploads. The companion fix in saorsa-core (saorsa-labs/saorsa-core#fix/dht-shutdown-hang-under-traffic) resolved the DHT shutdown hang for 98% of nodes. The remaining 2% were stuck at engine.shutdown().await in the replication engine, where the neighbor sync task was mid-round when shutdown fired.

Test plan

  • Both fixes together tested on 151-node testnet (27 AWS + 14 Vultr + 15 DO standard + 16 DO NAT + 7 bootstrap) with 3 clients uploading 10MB files continuously
  • 151/151 services restarted cleanly, 0 hangs, 0 upload failures throughout the 2-hour rollout window
  • cargo check clean

Depends on: saorsa-labs/saorsa-core PR for the DHT shutdown fix (same root cause pattern, different code layer)

🤖 Generated with Claude Code

…ctive traffic

The neighbor sync loop ran `run_neighbor_sync_round()` outside of its
`tokio::select!` block. When shutdown was cancelled mid-round, the task
couldn't notice until the entire sync round completed — which involves
multiple network round-trips to peers that may themselves be shutting
down, causing extended blocking.

Wrap the sync round in a `tokio::select!` with `shutdown.cancelled()` so
in-progress operations are cancelled immediately when shutdown fires.

Also add a 10-second timeout to the replication engine's task joins in
`shutdown()` as defense in depth, matching the same pattern applied to
`DhtNetworkManager::stop()`.

Discovered during auto-upgrade testing on a 151-node testnet with active
client uploads. The DHT shutdown fix (saorsa-core) resolved 98% of
hangs; this fix resolved the remaining 2%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jacderida jacderida force-pushed the fix/replication-shutdown-hang-under-traffic branch from 95d6dcf to 34366cf Compare April 14, 2026 13:33
@jacderida
Copy link
Copy Markdown
Collaborator Author

Addressed the same timeout-detach issue flagged by Greptile on the companion saorsa-core PR (saorsa-labs/saorsa-core#81). The replication engine's shutdown() had the same pattern — tokio::time::timeout(dur, handle) moves the handle, so a timeout drop detaches rather than aborts. Fixed in 34366cf by passing &mut handle and calling handle.abort() on timeout.

@jacderida jacderida merged commit 87b7d91 into rc-2026.4.1 Apr 14, 2026
1 check passed
@jacderida jacderida deleted the fix/replication-shutdown-hang-under-traffic branch April 14, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants