Conversation
9e31084 to
395a527
Compare
There was a problem hiding this comment.
Pull request overview
This PR implements a maximum reconnection attempt limit for peer connections in the Ouroboros Network peer selection governor. Peers that fail to connect repeatedly are now forgotten after exceeding a configurable retry limit, with exceptions for localroot peers, bootstrap relays, and manually configured public root peers which can retry indefinitely.
Key changes:
- Adds
policyMaxConnectionRetriesfield (default: 5) toPeerSelectionPolicy - Implements retry limit enforcement in the cold peer promotion failure handler
- Adds a
forgottenboolean flag to promotion failure trace events to indicate when a peer is removed - Updates all trace pattern matches and JSON serialization to handle the new parameter
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
ouroboros-network/lib/Ouroboros/Network/PeerSelection/Governor/Types.hs |
Adds policyMaxConnectionRetries field and forgotten parameter to trace constructors |
ouroboros-network/lib/Ouroboros/Network/PeerSelection/Governor/EstablishedPeers.hs |
Implements retry limit logic to forget peers after max failures, exempting localroots and bootstrap peers |
ouroboros-network/lib/Ouroboros/Network/Diffusion/Policies.hs |
Sets default policyMaxConnectionRetries value to 5 |
ouroboros-network/orphan-instances/Ouroboros/Network/OrphanInstances.hs |
Updates JSON serialization to include forgotten field |
dmq-node/src/DMQ/Diffusion/PeerSelection.hs |
Adds policyMaxConnectionRetries to DMQ policy |
cardano-diffusion/tests/lib/Test/Cardano/Network/PeerSelection/MockEnvironment.hs |
Updates test trace patterns for new forgotten parameter |
cardano-diffusion/tests/lib/Test/Cardano/Network/PeerSelection.hs |
Updates test trace pattern matches |
cardano-diffusion/tests/lib/Test/Cardano/Network/Diffusion/Testnet.hs |
Updates test trace pattern matches |
ouroboros-network/changelog.d/20251128_094205_karl.fb.knutsson_max_recon_main.md |
Adds changelog entry |
dmq-node/changelog.d/20251128_104246_karl.fb.knutsson_max_recon_main.md |
Adds changelog entry |
cardano-diffusion/changelog.d/20251128_104055_karl.fb.knutsson_max_recon_main.md |
Adds changelog entry |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| unForgetAble = LocalRootPeers.member peeraddr localRootPeers || | ||
| (memberExtraPeers peeraddr (PublicRootPeers.getExtraPeers publicRootPeers)) | ||
| (publicRootPeers', knownPeers'', forgotten) = | ||
| if unForgetAble || failCount < policyMaxConnectionRetries |
There was a problem hiding this comment.
[nitpick] The variable name unForgetAble has inconsistent capitalization. Consider using unforgettable or unForgettable for better consistency with Haskell naming conventions (camelCase for local variables).
| unForgetAble = LocalRootPeers.member peeraddr localRootPeers || | |
| (memberExtraPeers peeraddr (PublicRootPeers.getExtraPeers publicRootPeers)) | |
| (publicRootPeers', knownPeers'', forgotten) = | |
| if unForgetAble || failCount < policyMaxConnectionRetries | |
| unForgettable = LocalRootPeers.member peeraddr localRootPeers || | |
| (memberExtraPeers peeraddr (PublicRootPeers.getExtraPeers publicRootPeers)) | |
| (publicRootPeers', knownPeers'', forgotten) = | |
| if unForgettable || failCount < policyMaxConnectionRetries |
| unForgetAble = LocalRootPeers.member peeraddr localRootPeers || | ||
| (memberExtraPeers peeraddr (PublicRootPeers.getExtraPeers publicRootPeers)) | ||
| (publicRootPeers', knownPeers'', forgotten) = | ||
| if unForgetAble || failCount < policyMaxConnectionRetries | ||
| then ( publicRootPeers | ||
| , KnownPeers.setConnectTimes (Map.singleton peeraddr (delay `addTime` now)) | ||
| knownPeers' | ||
| , False | ||
| ) | ||
| else ( PublicRootPeers.difference differenceExtraPeers publicRootPeers | ||
| (Set.singleton peeraddr) | ||
| , KnownPeers.delete (Set.singleton peeraddr) knownPeers' | ||
| , True | ||
| ) |
There was a problem hiding this comment.
The new behavior of forgetting peers after policyMaxConnectionRetries failed connection attempts lacks test coverage. Consider adding a test that verifies:
- Peers that are not localroots or bootstrap peers are forgotten after the maximum number of connection failures
- Localroot peers and bootstrap peers are never forgotten regardless of connection failure count
- The
forgottenflag is correctly set toTruein the trace when a peer is forgotten
4344590 to
72a8bdb
Compare
f41e82c to
8981df7
Compare
Enforce a maximum limit on the number of times we will attempt to promote a peer to warm. Localroot peers, bootstrap relays and manually configured public root peers are exempt from this limit. The clearing of the reconnection counter is delayd until a connection has managed to be active for a specific time (currently 120s).
Incase of an error use a shorter timeout when waiting for chainsync to exit.
Exclude shutdown peers in active peers calculations. It can take a while for peers to exit because blockfetch has to sync with chainsync as it exits. But we shouldn't count those peers as active or preferred anymore.
With p2p peerselection and the keepalive protocol we are not that dependant on chainsync timeout for detecting bad upstream peers. By bumping the timeout from between 135s and 269s to between 601s and 911s we change the false positive rate from something that happens a few times per epoch to something that happens less than once in a decade.
0d9de7f to
48bdfb3
Compare
Since #5253, peer selection can forget know peers. This patch relaxes the property. Peer selection now logs the set of forgotten peers via `TraceForgottenPeers`, which is used by the property.
Since #5253, peer selection can forget know peers. This patch relaxes the property. Peer selection now logs the set of forgotten peers via `TraceForgottenPeers`, which is used by the property.
Description
This is a series a changes that makes the node more robust.
Implements #5252
Checklist
Quality
Maintenance
ouroboros-networkproject.