Skip to content

More robust diffusion#5253

Merged
karknu merged 5 commits intomainfrom
karknu/max_recon_main
Jan 12, 2026
Merged

More robust diffusion#5253
karknu merged 5 commits intomainfrom
karknu/max_recon_main

Conversation

@karknu
Copy link
Copy Markdown
Contributor

@karknu karknu commented Nov 28, 2025

Description

This is a series a changes that makes the node more robust.

Implements #5252

Checklist

Quality

  • Commit sequence makes sense and have useful messages, see ref.
  • New tests are added and existing tests are updated.
  • Self-reviewed the PR.

Maintenance

  • Linked an issue or added the PR to the current sprint of ouroboros-network project.
  • Added labels.
  • Updated changelog files.
  • The documentation has been properly updated, see ref.

@github-project-automation github-project-automation Bot moved this to In Progress in Ouroboros Network Nov 28, 2025
@karknu karknu force-pushed the karknu/max_recon_main branch from 9e31084 to 395a527 Compare November 28, 2025 09:46
@karknu karknu added the outbound-governor Issues / PRs related to outbound-governor label Nov 28, 2025
@karknu karknu marked this pull request as ready for review November 28, 2025 10:35
@karknu karknu requested a review from a team as a code owner November 28, 2025 10:35
Copy link
Copy Markdown
Collaborator

@coot coot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread dmq-node/changelog.d/20251128_104246_karl.fb.knutsson_max_recon_main.md Outdated
Comment thread ouroboros-network/changelog.d/20251128_094205_karl.fb.knutsson_max_recon_main.md Outdated
Comment thread cardano-diffusion/changelog.d/20251128_104055_karl.fb.knutsson_max_recon_main.md Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a maximum reconnection attempt limit for peer connections in the Ouroboros Network peer selection governor. Peers that fail to connect repeatedly are now forgotten after exceeding a configurable retry limit, with exceptions for localroot peers, bootstrap relays, and manually configured public root peers which can retry indefinitely.

Key changes:

  • Adds policyMaxConnectionRetries field (default: 5) to PeerSelectionPolicy
  • Implements retry limit enforcement in the cold peer promotion failure handler
  • Adds a forgotten boolean flag to promotion failure trace events to indicate when a peer is removed
  • Updates all trace pattern matches and JSON serialization to handle the new parameter

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
ouroboros-network/lib/Ouroboros/Network/PeerSelection/Governor/Types.hs Adds policyMaxConnectionRetries field and forgotten parameter to trace constructors
ouroboros-network/lib/Ouroboros/Network/PeerSelection/Governor/EstablishedPeers.hs Implements retry limit logic to forget peers after max failures, exempting localroots and bootstrap peers
ouroboros-network/lib/Ouroboros/Network/Diffusion/Policies.hs Sets default policyMaxConnectionRetries value to 5
ouroboros-network/orphan-instances/Ouroboros/Network/OrphanInstances.hs Updates JSON serialization to include forgotten field
dmq-node/src/DMQ/Diffusion/PeerSelection.hs Adds policyMaxConnectionRetries to DMQ policy
cardano-diffusion/tests/lib/Test/Cardano/Network/PeerSelection/MockEnvironment.hs Updates test trace patterns for new forgotten parameter
cardano-diffusion/tests/lib/Test/Cardano/Network/PeerSelection.hs Updates test trace pattern matches
cardano-diffusion/tests/lib/Test/Cardano/Network/Diffusion/Testnet.hs Updates test trace pattern matches
ouroboros-network/changelog.d/20251128_094205_karl.fb.knutsson_max_recon_main.md Adds changelog entry
dmq-node/changelog.d/20251128_104246_karl.fb.knutsson_max_recon_main.md Adds changelog entry
cardano-diffusion/changelog.d/20251128_104055_karl.fb.knutsson_max_recon_main.md Adds changelog entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Comment thread ouroboros-network/lib/Ouroboros/Network/PeerSelection/Governor/Types.hs Outdated
Comment on lines +518 to +521
unForgetAble = LocalRootPeers.member peeraddr localRootPeers ||
(memberExtraPeers peeraddr (PublicRootPeers.getExtraPeers publicRootPeers))
(publicRootPeers', knownPeers'', forgotten) =
if unForgetAble || failCount < policyMaxConnectionRetries
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The variable name unForgetAble has inconsistent capitalization. Consider using unforgettable or unForgettable for better consistency with Haskell naming conventions (camelCase for local variables).

Suggested change
unForgetAble = LocalRootPeers.member peeraddr localRootPeers ||
(memberExtraPeers peeraddr (PublicRootPeers.getExtraPeers publicRootPeers))
(publicRootPeers', knownPeers'', forgotten) =
if unForgetAble || failCount < policyMaxConnectionRetries
unForgettable = LocalRootPeers.member peeraddr localRootPeers ||
(memberExtraPeers peeraddr (PublicRootPeers.getExtraPeers publicRootPeers))
(publicRootPeers', knownPeers'', forgotten) =
if unForgettable || failCount < policyMaxConnectionRetries

Copilot uses AI. Check for mistakes.
Comment on lines +518 to +531
unForgetAble = LocalRootPeers.member peeraddr localRootPeers ||
(memberExtraPeers peeraddr (PublicRootPeers.getExtraPeers publicRootPeers))
(publicRootPeers', knownPeers'', forgotten) =
if unForgetAble || failCount < policyMaxConnectionRetries
then ( publicRootPeers
, KnownPeers.setConnectTimes (Map.singleton peeraddr (delay `addTime` now))
knownPeers'
, False
)
else ( PublicRootPeers.difference differenceExtraPeers publicRootPeers
(Set.singleton peeraddr)
, KnownPeers.delete (Set.singleton peeraddr) knownPeers'
, True
)
Copy link

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new behavior of forgetting peers after policyMaxConnectionRetries failed connection attempts lacks test coverage. Consider adding a test that verifies:

  1. Peers that are not localroots or bootstrap peers are forgotten after the maximum number of connection failures
  2. Localroot peers and bootstrap peers are never forgotten regardless of connection failure count
  3. The forgotten flag is correctly set to True in the trace when a peer is forgotten

Copilot uses AI. Check for mistakes.
@karknu karknu force-pushed the karknu/max_recon_main branch from 4344590 to 72a8bdb Compare December 1, 2025 08:46
@coot coot moved this from In Progress to In Review in Ouroboros Network Dec 8, 2025
@coot coot moved this from In Review to In Progress in Ouroboros Network Dec 8, 2025
@karknu karknu force-pushed the karknu/max_recon_main branch 2 times, most recently from f41e82c to 8981df7 Compare January 12, 2026 10:48
@karknu karknu changed the title Enforce max reconnection attempts More robust diffusion Jan 12, 2026
Copy link
Copy Markdown
Collaborator

@coot coot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread ouroboros-network/lib/Ouroboros/Network/PeerSelection/State/KnownPeers.hs Outdated
@karknu karknu added this pull request to the merge queue Jan 12, 2026
@karknu karknu removed this pull request from the merge queue due to a manual request Jan 12, 2026
Enforce a maximum limit on the number of times we will attempt to
promote a peer to warm. Localroot peers, bootstrap relays and manually
configured public root peers are exempt from this limit.

The clearing of the reconnection counter is delayd until a connection
has managed to be active for a specific time (currently 120s).
Incase of an error use a shorter timeout when waiting for chainsync to
exit.
Exclude shutdown peers in active peers calculations.
It can take a while for peers to exit because blockfetch has to
sync with chainsync as it exits. But we shouldn't count those peers as
active or preferred anymore.
With p2p peerselection and the keepalive protocol we are not that
dependant on chainsync timeout for detecting bad upstream peers.

By bumping the timeout from between 135s and 269s to between 601s
and 911s we change the false positive rate from something that
happens a few times per epoch to something that happens less than
once in a decade.
@karknu karknu force-pushed the karknu/max_recon_main branch from 0d9de7f to 48bdfb3 Compare January 12, 2026 11:55
@karknu karknu enabled auto-merge January 12, 2026 11:55
@karknu karknu added this pull request to the merge queue Jan 12, 2026
Merged via the queue into main with commit cbd2526 Jan 12, 2026
12 checks passed
@karknu karknu deleted the karknu/max_recon_main branch January 12, 2026 12:42
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Ouroboros Network Jan 12, 2026
@coot coot linked an issue Feb 18, 2026 that may be closed by this pull request
coot added a commit that referenced this pull request Mar 19, 2026
Since #5253, peer selection can forget know peers.  This patch relaxes
the property.  Peer selection now logs the set of forgotten peers via
`TraceForgottenPeers`, which is used by the property.
coot added a commit that referenced this pull request Mar 19, 2026
Since #5253, peer selection can forget know peers.  This patch relaxes
the property.  Peer selection now logs the set of forgotten peers via
`TraceForgottenPeers`, which is used by the property.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

outbound-governor Issues / PRs related to outbound-governor

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Too long timeout when blockfetch fails.

3 participants