fix: prevent replication neighbor sync from blocking shutdown under active traffic by jacderida · Pull Request #70 · WithAutonomi/ant-node

jacderida · 2026-04-14T13:25:12Z

Summary

The replication engine's neighbor sync loop ran run_neighbor_sync_round() outside its tokio::select! block, preventing shutdown cancellation from being noticed during long sync rounds
Wrap the sync round in tokio::select! with shutdown.cancelled() so in-progress operations are cancelled immediately
Add 10-second timeout to replication engine task joins in shutdown() as defense in depth

Context

Discovered during auto-upgrade testing on a 100-node testnet with active client uploads. The companion fix in saorsa-core (saorsa-labs/saorsa-core#fix/dht-shutdown-hang-under-traffic) resolved the DHT shutdown hang for 98% of nodes. The remaining 2% were stuck at engine.shutdown().await in the replication engine, where the neighbor sync task was mid-round when shutdown fired.

Test plan

Both fixes together tested on 151-node testnet (27 AWS + 14 Vultr + 15 DO standard + 16 DO NAT + 7 bootstrap) with 3 clients uploading 10MB files continuously
151/151 services restarted cleanly, 0 hangs, 0 upload failures throughout the 2-hour rollout window
cargo check clean

Depends on: saorsa-labs/saorsa-core PR for the DHT shutdown fix (same root cause pattern, different code layer)

🤖 Generated with Claude Code

…ctive traffic The neighbor sync loop ran `run_neighbor_sync_round()` outside of its `tokio::select!` block. When shutdown was cancelled mid-round, the task couldn't notice until the entire sync round completed — which involves multiple network round-trips to peers that may themselves be shutting down, causing extended blocking. Wrap the sync round in a `tokio::select!` with `shutdown.cancelled()` so in-progress operations are cancelled immediately when shutdown fires. Also add a 10-second timeout to the replication engine's task joins in `shutdown()` as defense in depth, matching the same pattern applied to `DhtNetworkManager::stop()`. Discovered during auto-upgrade testing on a 151-node testnet with active client uploads. The DHT shutdown fix (saorsa-core) resolved 98% of hangs; this fix resolved the remaining 2%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jacderida · 2026-04-14T13:37:21Z

Addressed the same timeout-detach issue flagged by Greptile on the companion saorsa-core PR (saorsa-labs/saorsa-core#81). The replication engine's shutdown() had the same pattern — tokio::time::timeout(dur, handle) moves the handle, so a timeout drop detaches rather than aborts. Fixed in 34366cf by passing &mut handle and calling handle.abort() on timeout.

jacderida force-pushed the fix/replication-shutdown-hang-under-traffic branch from 95d6dcf to 34366cf Compare April 14, 2026 13:33

mickvandijke approved these changes Apr 14, 2026

View reviewed changes

jacderida merged commit 87b7d91 into rc-2026.4.1 Apr 14, 2026
1 check passed

jacderida deleted the fix/replication-shutdown-hang-under-traffic branch April 14, 2026 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent replication neighbor sync from blocking shutdown under active traffic#70

fix: prevent replication neighbor sync from blocking shutdown under active traffic#70
jacderida merged 1 commit intorc-2026.4.1from
fix/replication-shutdown-hang-under-traffic

jacderida commented Apr 14, 2026

Uh oh!

jacderida commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jacderida commented Apr 14, 2026

Summary

Context

Test plan

Uh oh!

jacderida commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants